Program on Monday (Nov.28, 2011)

Tutorial 1: Frontiers in Multimedia Search

By Alan Hanjalic and Martha Larson
Multimedia Information Retrieval Lab
Delft University of Technology

Multimedia that cannot be found is, in a certain sense, useless. It is lost in a huge collection, or worse, in a back alley of the Internet, never viewed and impossible to reuse. Research in multimedia retrieval is directed at developing techniques that bring image and video content together with users—matching multimedia content and user needs. The aim of this tutorial is to provide insights into the most recent developments in the field of multimedia retrieval and to identify the issues and bottlenecks that could determine the directions of research focus for the coming years. We present an overview of new algorithms and techniques, concentrating on those approaches that are informed by neighboring fields including information retrieval, speech and language processing and network analysis. We also discuss evaluation of new algorithms, in particular, making use of crowdsourcing for the development of the necessary data sets.

The tutorial targets new scientists in the field of multimedia retrieval, providing instruction on how to best approach the multimedia retrieval problem and examples of promising research directions to work on. It is also designed to benefit active multimedia retrieval scientists—those who are searching for new challenges or re-orientation. The material presented is relevant for participants from both academia and industry. It covers issues pertaining to the development of modern multimedia retrieval systems and highlights emerging challenges and techniques anticipated to be important for the future of multimedia retrieval.

The tutorial begins with an overview of “multimedia search in the wild” that covers how and when we use multimedia access and retrieval technologies in our lives, both personal and professional. These considerations serve to inform the selection of research challenges faced by multimedia retrieval as the field continues to grow and expand. The main body of the presentation focuses on possibilities for exploiting and combining available information resources to optimize multimedia search results in view of these usefulness issues. We concentrate on three complementary information sources:

  • User: Exploiting the interaction of the user with the search system, either to enhance the query so that it better reflects the user information need and search intent, or to enrich the collection with implicit or explicit metadata. Approaches discussed include: transaction log analysis, context modeling in multimedia search, (visual) query suggestion and user-supported query expansion.
  • Collection: Exploiting the information inherent in the relationships that exist in the collection and in the search environment, for example, similarities between documents and connections among users. Two categories of approaches and techniques working in this direction will be discussed:
    • Maximizing the quality of the top-ranked search results using IR concepts and cross-modal analysis through e.g., (visual) search reranking, query-class-dependent search and query performance prediction,
    • Integrating social information from networked communities, including use of community-contributed metadata and techniques for exploiting social networks. A case study will address the problem of non-trivial collaborative recommendation expanding the current scope based on the classical collaborative filtering concept.
  • Content: Exploiting all information channels in the content collection itself. Automatic indexing systems (e.g., speech recognition, audio event detection, semantic concept detection) are well known for their imperfections. The key to improving the usefulness of multimedia search is building systems that can elegantly handle their own shortcomings. Instead of endless resistance, multimedia search paradigms are required that can robustly deal with noise and present the user with a result with the highest possible utility.
    • Making use of confidence scores (e.g., effective use of imperfect results, informing user when a result may be less than satisfactory),
    • Exploiting characteristics of multimedia items that are revealed using simple methods of structural analysis,
    • Integrating information from external sources to reduce influence of indexing noise.

The final section of the tutorial examines the opportunities to formulate research topics that are closely related to the needs of the user and to carry out work in the newly evolving multimedia search paradigms. A critical aspect of tackling new tasks is developing the data sets necessary to evaluate them. We give an introduction to the practice of crowdsourcing for the generation of data sets suited for the evaluation of new multimedia retrieval algorithms, including best practices for task design and quality control. Tasks and data sets are also offered to the multimedia retrieval research community by benchmarking initiatives. We conclude the tutorial with a short presentation of the MediaEval benchmark, which offers multimedia retrieval tasks concentrated on social and contextual challenges of multimedia, including geo-coordinate prediction, genre detection and prediction of viewer affective response.

Presenter’s biography: Dr. Alan Hanjalic is Associate Professor and Coordinator of the Delft Multimedia Information Retrieval Lab at Delft University of Technology, Netherlands. Research interests and expertise of Dr. Hanjalic are in the broad areas of multimedia computing, with focus on multimedia information retrieval and personalized multimedia content delivery. In his areas of expertise Dr. Hanjalic (co-)authored more than 100 publications, among which the books titled Image and Video Databases: Restoration, Watermarking and Retrieval (Elsevier, 2000), Content-Based Analysis of Digital Video (Kluwer Academic Publishers, 2004) and Online Multimedia Advertising (IGI Global, 2010). He was a visiting scientist at Hewlett-Packard Labs, British Telecom Labs, Philips Research Labs and Microsoft Research Asia. Dr. Hanjalic has been on Editorial Boards of the IEEE Transactions on Multimedia, IEEE Transactions on Affective Computing, Journal of Multimedia, Advances in Multimedia and the Image and Vision Computing journal. He was also a Guest Editor of special issues in a number of journals, including the Proceedings of the IEEE (2008), IEEE Transactions on Multimedia (2009), and Journal of Visual Communication and Image Representation (2009). He has also served in the organization committees of the major conferences in the multimedia field, including the ACM Multimedia Conference (General Chair 2009, Program Chair 2007), ACM CIVR conference (Program Chair 2008), ACM ICMR conference (Program Chair 2011), the WWW conference (Track Chair 2008), Multimedia Modeling Conference (Area Chair 2007), Pacific Rim Conference on Multimedia (Track Chair 2007), IEEE ICME (Track Chair 2007), and the IEEE ICIP conference (Track Chair 2010 and 2011). Dr. Hanjalic was a Keynote Speaker at the Pacific-Rim Conference on Multimedia, Hong-Kong, December 2007 and is an elected member of the IEEE TC on Multimedia Signal Processing.

Presenter’s biography: Dr. Martha Larson is a senior researcher in the area of at the Delft University of Technology in the Multimedia Information Retrieval Lab. Her expertise is in the area of speech and language technology for multimedia search with a focus on networked communities. Before joining the group at Delft, she researched and lectured in the area of audio-visual retrieval in the NetMedia group at Fraunhofer IAIS and at the University of Amsterdam. Martha Larson holds a MA and PhD in theoretical linguistics from Cornell University and a BS in Mathematics from the University of Wisconsin. She is an organizer of MediaEval, a multimedia retrieval benchmark campaign that emphasizes spoken content and social media. She is a guest editor of the upcoming ACM TOIS special issue on searching spontaneous conversational speech. Recently, much of her research focused on deriving and exploiting information from multimedia that is “orthogonal to topic”, not directly related to subject matter. Examples include information on user-perceived quality, affective impact and social trust. Such information can be used to improve the quality of multimedia search. Her research interests also include user-generated multimedia content, cultural heritage archives, indexing approaches exploiting multiple modalities, techniques for semantic structuring of spoken content and methods for reducing the impact of speech recognition error on speech-based retrieval. She has participated as both researcher and research coordinator in a number of projects including the EU-projects PetaMedia, MultiMatch and SHARE.

Back to Overview

Tutorial 2: Internet Multimedia Advertising: Techniques and Technologies

By Tao Mei, Ruofei (Bruce) Zhang, Xian-Sheng Hua
Microsoft Research Asia, China
Yahoo! Labs, USA

Advertising provides financial support for a large portion of today’s Internet ecosystem. Compared to traditional means of advertising, such as a banner outside a store or textual advertisements in newspapers, multimedia advertising has some unique advantages: it is more attractive and more salient than plain text, it is able to instantly grab users’ attention and it carries more information that can also be comprehended more quickly than when reading a text advertisement. Rapid convergence of multimedia, Internet and mobile devices has opened new opportunities for manufacturers and advertisers to more effectively and efficiently reach potential customers. While largely limiting itself to radio and TV channels currently, multimedia advertising is about to break through on the web using various concepts of internet multimedia advertising.

The explosive growth of multimedia data on the web also creates huge opportunities for further monetizing them with multimedia advertisements. Multimedia content becomes a natural information carrier for advertising in a way similar to radio wave to carry bits in digital communications. More and more business models are rolled out to freely distribute multimedia contents and recoup the revenue from the multimedia advertisements it carries. With the increasing importance of online multimedia advertising, researchers from multimedia community have made significant progresses along this direction. This tutorial aims at: 1) reviewing and summarizing recent high-quality research works on internet multimedia advertising, including basic technologies and applicable systems, and 2) presenting insight into the challenges and future directions in this emerging and promising area.

This tutorial is appropriate to ACM Multimedia, including both graduate students and senior researchers working in the field of multimedia and/or online advertising, as well as industry practitioners who are working in the field of search engine development, video/image content providers, developers of video/image sharing portals and IPTV providers. Instead of in depth coverage of contemporary papers, in this three hour tutorial, we plan to introduce the important general concepts and themes of this timely topic which are interesting to the MM audience. Moreover, we will also show extensive demos on contextual multimedia advertising.

Tentative outline of the tutorial:

  1. Introduction of traditional text advertising techniques
  2. Understand audience for user-targeted advertising
  3. Image advertising
  4. Video advertising
  5. Mobile advertising
  6. Challenges and future directions

Tao Mei is a Researcher from Microsoft Research Asia, Beijing, China. His current research interests include multimedia content analysis, computer vision, and multimedia applications such as search, advertising, social networking, and mobile computing. He is the editor of one book, the author of seven book chapters and over 100 journal and conference papers, in these areas, and holds more than 30 filed or pending applications. He serves as an Associate Editor for Neurocomputing and Journal of Multimedia, a Guest Editor for IEEE Multimedia, ACM/Springer Multimedia Systems, and Journal of Visual Communication and Image Representation. He was the principle designer of the automatic video search system that achieved the best performance in the worldwide TRECVID evaluation in 2007. He received the Best Paper and Best Demonstration Awards in ACM Multimedia 2007, the Best Poster Paper Award in IEEE MMSP 2008, and the Best Paper Award in ACM Multimedia 2009. He was awarded Microsoft Gold Star in 2010. Tao received the B.E. and the Ph.D. degrees from the University of Science and Technology of China, Hefei, in 2001 and 2006, respectively.

Ruofei (Bruce) Zhang is a Senior Scientist in the Advertising Sciences division at Yahoo! Labs, Silicon Valley. He currently manages information retrieval modeling, response prediction and optimization group that applies statistical machine learning and time series analysis techniques to solve problems in contextual and display advertising. Bruce joined Yahoo! in June 2005. Prior to working at Yahoo! Labs, he had worked on sponsored search query rewriting modeling in Search and Advertising Sciences and led search relevance R&D in Yahoo! Video Search. Bruce’s research interests are in machine learning, large scale data analysis and mining, optimization, time series analysis, image/video processing and analysis, and multimedia information retrieval. He has co-authored a monograph book on multimedia data mining and published over two dozen peer-reviewed papers on leading international journals and conferences and several invited papers and book chapters; he is inventor or co-inventor of more than 20 issued and pending patents on online advertising, search relevance, ranking function learning, and multimedia content analysis. Bruce has been serving on the grant review panels for US NSF and program committee of major conferences in the fields. Bruce received a Ph. D. in computer science with Distinguished Dissertation Award from State University of New York at Binghamton; a M.E. and B.E. from Tsinghua University and Xi’an Jiaotong University, respectively.

Xian-Sheng Hua is now a Principal Research and Development Lead for Bing Multimedia Search with Microsoft. He is responsible for driving a team to design and deliver thought-leading media understanding and indexing features. Before joining Bing in 2011, Dr. Hua was a Lead Researcher with Microsoft Research Asia. During that time, his research interests are in the areas of multimedia search, advertising, understanding, and mining, as well as pattern recognition and machine learning. He has authored or co-authored more than 180 publications in these areas and has more than 60 filed patents or pending applications. Xian-Sheng Hua received the B.S. and Ph.D. degrees from Peking University, Beijing, China, in 1996 and 2001, respectively, both in applied mathematics. He serves as an Associate Editor of IEEE Transactions on Multimedia, Associate Editor of ACM Transactions on Intelligent Systems and Technology, Editorial Board Member of Advances in Multimedia and Multimedia Tools and Applications, and editor of Scholarpedia (Multimedia Category). Dr. Hua won the Best Paper Award and Best Demonstration Award in ACM Multimedia 2007, Best Poster Award in 2008 IEEE International Workshop on Multimedia Signal Processing, Best Student Paper Award in ACM Conference on Information and Knowledge Management 2009, and Best Paper Award in International Conference on MultiMedia Modeling 2010. He also won 2008 MIT Technology Review TR35 Young Innovator Award for his outstanding contributions to video search.

The link of the slide deck for the tutorial on “internet multimedia advertising”:

Back to Overview

Tutorial 3: Internet Video Search

By Cees G.M. Snoek and Arnold W.M. Smeulders
University of Amsterdam, Netherlands


In this tutorial, we focus on the challenges in internet video search, present methods how to achieve state-of-the-art performance while maintaining efficient execution, and indicate how to obtain improvements in the near future. Moreover, we give an overview of the latest developments and future trends in the field on the basis of the TRECVID competition - the leading competition for video search engines run by NIST - where we have achieved consistent top performance over the past years, including the 2008, 2009 and 2010 editions.

Categories and Subject Descriptors: H.3.3 Information Storage and Retrieval: Information Search and Retrieval
General Terms: Algorithms, Experimentation, Performance
Keywords: Visual categorization, video retrieval, information visualization


The scientific topic of video search is dominated by five major challenges:
a the sensory gap between an object and its many appearances due to the accidental sensing conditions;
b the semantic gap between a visual concept and its lingual representation;
c the model gap between the amount of notions in the world and the capacity to learn them;
d the query-context gap between the information need and the possible retrieval solutions;
e the interface gap between the tiny window the screen offers to the amount of data;

The semantic gap is bridged by forming a dictionary of visual detectors for concepts and events. The largest ones to date consist of hundreds of concepts excluding concept-tailored algorithms. It would simply take too long to achieve. Instead, we come closer to the ideal of one computer vision algorithm tailored automatically to the purpose at hand by employing example data to learn from. We discuss the advantages and limitations of a machine learning approach from examples. We show for what type of semantics the approach is likely to succeed or fail. In compensation for the absence of concept-specific (geometric or appearance) models, we emphasize the importance of good feature sets. They form the basis of the observational model by all possible color, shape, texture or structure invariant features help to characterize the concept and event at hand. Apart from good features, the other essential component is state-of- theart machine learning in order to get the most out of the learning data.

We integrate the features and machine learning aspects into a complete internet video search engine, which has successfully competed in TRECVID. The multimedia system includes computer vision, machine learning, information retrieval, and human-computer interaction. We follow the video data as they flow through the efficient computational processes. Starting from fundamental visual features, covering local shape, texture, color, motion and the crucial need for invariance. Then, we explain how invariant features can be used in concert with kernel-based supervised learning methods to arrive at an event or concept detector. We discuss the important role of fusion on a feature, classifier, and semantic level to improve the robustness and general applicability of detectors. We end our component-wise decomposition of internet video search engines by explaining the complexities involved in delivering a limited set of uncertain concept detectors to an inpatient online user. For each of the components we review state-of-the-art solutions in literature, each having different characteristics and merits.

Comparative evaluation of methods and systems is imperative to appreciate progress. We discuss the data, tasks, and results of TRECVID, the leading benchmark. In addition, we discuss the many derived community initiatives in creating annotations, baselines, and software for repeatable experiments. We conclude the course with our perspective on the many challenges and opportunities ahead for the multimedia retrieval community.


This tutorial is supported by STWSEARCHER, FES COMMIT, and the IARPA via Department of Interior National Business Center contract number D11PC20067. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.

Copyright is held by the author/owner(s).
MM’11, December 25–29, 2010, Scottsdale, AZ, USA.
ACM 978-1-60558-933-6/10/10.

Back to Overview

Tutorial 4: Semantic Computing in Multimedia

By Simone Santini
Universidad Autonoma de Madrid, Spain

Semantics, as a cognitive computing topic (as opposed, for example, to the formal semantics of programming languages) began within artificial intelligence and then, beginning in the 1980s, faded from public attention following the then general decline of the interest in symbolic artificial intelligence. Those were the heydays of connectionism, and connectionist machines, ex hypothesi, would not model semantics explicitly. In the last ten years, however, there has been a noteworthy resurgence of the technical discourse on semantics, based on the widespread opinion that the large amount of data available today can be properly managed only through a qualitative leap in the processing capabilities of computing machines. A semantic leap, as it is.

The purpose of this tutorial is not only to give the attendants information about standards and programming techniques, but also to give them a better view of the broader topics in which semantic computing is embedded. Learning techniques and standards is the easy part of the job; the difficult part, the one that needs a face-to-face interaction typical of a tutorial is to know what to do with these standards and techniques, that is, to understand the general theory of semantics and how the different techniques fit in it. Semantics is a complex issue, with a history of many centuries and a variety of different points of view. In order to do serious research on semantics, the computing scientist must be aware of important and complex theoretical questions, and of the solutions and models that the different schools have proposed. This tutorial will try to provide such a background.

After a brief introductory section with a brief history of semantics and pictorial communication, the tutorial will be divided in two parts, corresponding to the two fundamental approaches, which I shall call the ontological and the hermeneutical.

The first part will deal with all those approaches that try to encode formally the semantics of a document and attach it to the document itself. It will cover, roughly, a terrain that goes from Tarski to current ontologies, with some emphasis on model theory and some foray into partially uncharted waters, such as the use of fuzzy logic for multimedia modeling. This section, in turn, will be divided in two parts. The first, and longest, will be a technical discussion on the different concepts of semantics that have been used in logic, with special emphasis on aspect of formal semantics and on traditional knowledge representation (including ontologies, the semantic web, and their relation to multimedia). The second part will be a brief excursus on the presuppositions that underlie this work. Traditional logic is a discipline of formal reasoning and never quite dealt with the content (viz. the semantics) of statements. Ever since Aristotle, the systematization of the syllogism operated by the Scolastics, all the way to the axiomatic programme of Russel and the Analytic Philosophy, logic has been a science of the forms of reasoning, without any reference to the contents of the reasoning activity. We will look into this separation of form and substance, and analyze its plausibility and its consequences for multimedia semantics.

To say that the meaning of a document (multimedia or otherwise) can be characterized by a formal model attached to the document requires certain assumptions, at the basis of which is the idea that a document has a content, independent (more or less) of the linguistic means that are used to express it, and that exists (more or less) intact even if nobody is interpreting the document. We will analyze critically this view of meaning, its plausibility, and the limits of its validity.

The second part will tie-in with the non-technical discussion at the end of the first part. While analyzing the presuppositions of the logic approach to semantics, we will also begin to look at alternative views, with a special emphasis on two areas: hermeneutics and structural semantics. We will work to understand the role of the reader in the creation of meaning, the role of the discoursive practices of media creation, and that of the cultural conventions that drive the way in which media should be interpreted.

The study of signification carried out in this way will unveil several important characteristics for the design of semantic systems. Meaning is not an attribute of an image or a video, but something that arises when an artifact is used as part of an activity, and only makes sense in the context of that activity. That is, meaning is created when an artifact is interpreted in a context, and as part of an activity. From the point of view of the design of computing systems, this means that, rather than modeling the content of documents, we should model the activities that require access to the images and the context in which these activities take place.

With a bit of simplification, we can say that the ontological approaches use the data base as their formal model, attaching a formal model to semi-structured data, while the hermeneutic approaches take interactive systems as their base, and try to make meaning emerge through more sophisticated forms of interaction.

Orthogonal to this distinction is that which separates the models based on logic from those based of soft computing. The two distinctions (ontological vs. hermeneutical and logic vs. soft computing) are not completely independent: by and large, ontological models tend to use a logic machinery (vide OWL and description logic), while hermentutics and interaction tend to use soft computing (feature space geometry, latent semantics, self organizing maps,...). The reason for this is to be sought in the different characteristics of the two approaches: logic methods are very expressive but brittle, they behave poorly in the presence of inconsistent data, and are hard to built by induction from the data; soft methods are not very expressive, but they can be built inductively and automatically on the basis of available data. We will study the characteristics of these two modes of representing semantics, with particular attention to the model that try to move beyond this dichotomy, such as the use of fuzzy logic ontological models that can be inferred from the data, and non-classical logic models such as those based on semantic games, that can be usefully applied to interactive systems.

The attendants will receive didactic material specifically designed for the tutorial. The material will not be composed simply of a hard copy of the transparencies used, but will be a booklet that will constitute a complete reference for the topics studied in the tutorial.

Back to Overview

Tutorial 5: Audio and Multimedia Music Signal Processing

By Gael Richard
Telecom ParisTech, France

1 Motivation

The enormous amount of unstructured audio data available nowadays and the spread of its use as a data source in many applications are introducing new challenges to researchers in information and multimedia signal processing. Automatic analysis of audio documents (music, radio broadcast audio streams,...) gathers several research directions including audio indexing and transcription (extraction of informative features leading to audio content recognition or to the estimation of high level concepts such as melody, rhythm, instrumentation or harmony,...), audio classification (grouping by similarity, by music genre or by audio events categories) and content-based retrieval (such as query by example or query by humming approaches).

In this context, the general field of Music signal Processing is receiving a growing interest and becomes more relevant and more visible in the audio community. Nevertheless, if much work is tackled in audio and music signal processing it is somewhat often presented only in specialized music or audio signal processing conferences. In the multimedia community, the focus of interest is often on the image or video signal with less emphasis on the audio signal and its potential for analyzing or interpreting a multimedia scene.

The aim of the proposed tutorial is then to provide a general introduction of audio signal processing which should be of broad interest for the multimedia community, to review the state of the art in music signal processing (this will be largely based on [1]) and to highlight with some examples the potential of music signal processing for multimedia streams.

2 Intended Audience and Benefices

The tutorial will mostly target an intermediate audience which has some knowledge in multimedia but may not be familiar with audio and music signals. The tutorial will nevertheless include some more advanced concepts which should also be of broad interest to students, researchers and engineers who are more knowledgeable in audio but who are not familiar with decomposition models or audio source separation principles.

The expected benefices for the multimedia community include:

  • a better understanding of audio processing basics and potential for multimedia streams indexing. The tutorial will also include a brief presentation of existing open source tools which allow to rapidly design an audio indexing module.
  • a better understanding of decomposition models for music signals and how they can be efficiently used to represent the signal as a set of objects or sources (with application to audio source separation). This will be illustrated using a number of sound examples in the context of karaoke applications or other audio remixing applications.
  • a better understanding of the potential of multimodality through multimedia music examples. The tutorial will highlight this aspect on using two specific examples (a multimedia drum transcription system and a cross modal music video search).
3 Tutorial Content

As outlined above, the objective of the tutorial is first to introduce some basics of music signal processing, to provide some more insights on decomposition models which are at the heart of a number of audio signal processing methods and then to illustrate on some well chosen examples how audio processing tools are particularly interesting for music multimedia streams processing.

The tutorial is scheduled on half a day and is structured in four main parts:

  • Introduction: this section will provide a general introduction on the domain of audio andmusic signal processing and will illustrate the interest of this domain through a number of recent applications [1]. A typical architecture of an audio classification system will also be given and further discussed on an illustrative music indexing task (e.g. music instrument automatic recognition).
  • Signal representations and decomposition models: this section will start with the traditional Fourier representation of audio signal and related transformations which are particularly well suited for music signal analysis (Constant Q transform, Mel-Frequency transform, chromagrams,...). Decomposition models will then be rapidly presented and will include in particular greedy decomposition models and factorization models which are becoming popular for a wide variety of problems.
  • Application: the signal representations and decomposition models will then be applied to music sources separation with examples on singing voice extraction, drum track separation and bass line separation.
  • Multimodality: the potential of multimodality in music signal processing will then be highlighted through two specific examples: a multimodal drum track separation and audiobased video search for music videos. This part will also be the occasion to discuss some very early results on experiments conducted on the fully multimodal, multiple sensors database released for the ACM Multimedia 2011 Grand challenge sponsored by Huawei/3Dlife [1].

Prof. Gaël Richard received the State Engineering degree from Telecom ParisTech, in 1990, the Ph.D. and Habilitation à Diriger des Recherches degrees from University of Paris-XI, respectively in 1994 and 2001. He then spent two years at the CAIP Center, Rutgers University (USA), in the speech processing group of Prof. J. Flanagan, where he explored innovative approaches for speech production. Between 1997 and 2001, he successively worked for Matra Nortel Communications and Philips Consumer Communications. In particular, he was the project manager of several large-scale European projects in the field of multimodal speaker verification and audio processing. He then joined Télécom ParisTech where he is now full Professor and Head of the Audio, Acoustics and Waves research group. Co-author of over 80 papers and inventor in a number of patents, he is also an expert for the European Commission in Audio and Multimedia signal processing. Pr. Richard is a member of the EURASIP, senior member of IEEE, Associate Editor of the IEEE Transactions on Audio, Speech and Language Processing and member of the Audio Acoustics Signal Processing Technical committee of the IEEE.


[1] M. Mueller, D. Ellis, A. Klapuri, and G. Richard. Signal processing for music analysis. IEEE Journal on Selected Topics in Signal Processing, 2011, To appear.

Back to Overview

Tutorial 6: Acoustic and Multimodal Processing for Multimedia Content Analysis

A 3-Hour Tutorial at ACM Multimedia 2011 for beginners in audio processing, multimedia students and researchers on intermediate level.
By Gerald Friedland
International Computer Science Institute, USA


Today's computers start to have enough computational power and enough memory to be able to process a large amount of data in different sensory modalities. This allows to improve the robustness of current content analysis approaches and attack problems that are impossible to solve using only a single modality. Just as a human analyst uses multiple sources of information to determine the content of a video, it seems obvious that for video content analysis, the investigation of clues across different sensor modalities and their combination can lead to better results than investigating only one stream of sensor input. This is especially true for the analysis of consumer-produced, “unconstrained” videos, such as YouTube uploads or Flickr content.

Many computer science curricula usually include basic image processing and computer vision classes but only rarely acoustic content analysis classes, and if so, acoustic analysis is often reduced to speech recognition. This results in multimedia content analysis in being mostly image and vision-centric. While visual information is a very important part of a video, acoustic information often complements it. Moreover, with multimodal integration and fusion being still a research topic, many multimedia researchers lack in-depth knowledge on how to combine and integrate modalities in an efficient and effective way.


The objective of the tutorial is to introduce interested multimedia students and researchers who are not specialized in audio into the world of acoustic processing research with a focus on multimodal content analysis. For example, what is the accuracy of speech recognition after all and what open source options are there? Can I do indoor/outdoor detection with audio? How does an acoustic event detector work? What are the toolkits available for me to use? The goal is to enable the participants to include acoustic processing into their research as an addition to image, text, and visual video processing to enhance their multimedia content analysis results, especially on large scale video corpora. Because a 3-hour tutorial can neither replace a several semester lecture on the topic nor a degree in electrical engineering, the major purpose of this tutorial is to introduce basic concepts and technical terms, along with practical software toolkits and references to key literature. I hope to foster a high degree of cross-media fertilization that will benefit the multimedia community.


The following is a list of topics that the tutorial will cover:
- Useful and common filters
- Features for audio analysis
- Typical machine learning methods used for acoustic processing
- Evaluation methods of acoustic processing
- Example tasks of acoustic analysis
- Toolkits for acoustic analysis
- Multimodal integration
- Multimodal fusion
- Experimental setup for conducting multimodal experiments
- Acoustic and multimodal research challenges
- Discussion


The materials will be based on a class taught on the same topic at UC Berkeley in the fall semester 2011. The participants will be handed the slides of the presentations. In addition, the attendees of the tutorial will have early access to the textbook materials “Introduction to Multimedia Computing” by G. Friedland and R. Jain which is going to appear at Cambridge University Press soon after the tutorial. The textbook complements the tutorial not only with additional explanation but also with pseudo code (for reference purposes) and exercises (for deepening the presented material).

About the Presenter

Dr. Gerald Friedland is a senior research scientist at the International Computer Science Institute (ICSI), a private lab affiliated with the University of California, Berkeley, where he leads multimedia content analysis research, mostly focusing on acoustic techniques. Projects he is involved include work on acoustic methods for TRECVid MED 2011’s video concept detection task, multimodal location estimation for consumer videos, and multimodal grounded perception for robots. Dr. Friedland has published more than 100 peer-reviewed articles in conferences, journals, and books. He is associate editor for ACM Transactions on Multimedia Computing, Communications, and Applications, is in the organization committee of ACM Multimedia 2011, and a TPC co-chair of IEEE ICME 2012. Dr. Friedland is the recipient of several research and industry recognitions, among them the European Academic Software Award and the Multimedia Entrepreneur Award by the German Federal Department of Economics. Most recently, he led the team that won the ACM Multimedia Grand Challenge in 2009.

Despite being mainly a researcher, Dr. Friedland is a passionate teacher. He teaches a class on the same topic on the fall semester at UC Berkeley and he is currently authoring a new textbook “Introduction to Multimedia Computing” together with Dr. Ramesh Jain, to appear at Cambridge University Press. He is also a proud founder, program director, and instructor of the IEEE International Summer School on Semantic Computing at UC Berkeley which fosters crossdisciplinary computer science research on content extraction.

Back to Overview

Tutorial 7: Graphical Probabilistic Modeling and Applications in Multimedia Content Analysis

By Xiao-Ping Zhang and Zhu Liu
Ryerson University, Canada
AT&T-Research, USA


With the rapid growing popularity of multimedia content creation and sharing enabled by the high quality mobile video capture devices and the broadband wire or wireless connection, the volume of new content that is available to us is simply beyond our consumption capacity. For example, in the last year, video material uploaded to the YouTube video sharing service has increased from about 20 hours per minute to about 35 hours per minute. Tools that can automatically find the most relevant content according to our interests, specified manually or learned from our viewing history, are desired. Multimedia content analysis and understanding is an indispensable component in such tools and many other multimedia applications and services. It has been an active research area in the last two decades, and it has evolved to a combination and interconnection of many subjects such as audio and music processing, image and video processing, natural language processing, computer vision, and machine learning.

Graphical models are the combination of probability theory and graph theory, and they provide a way to represent the joint distribution over all of the random variables by a product of factors each depending only on a subset of the variables. Such models are flexible and scalable to capture complex dependencies among large number of random variables. Many existing statistical models and methods find a unifying representation in graphical models. For example, Hidden Markov models, random Markov field, Kalman filter, etc. As an emerging framework in machine learning with enormous potential, graphical models have been introduced in the multimedia content analysis area, and are adopted widely in many applications. The intrinsic nature of multimedia content analysis tasks, including high dimension of the low level features, rich prior knowledge on the structure of multimedia content, as well as the complex temporal-spatial correlation among multiple modalities, find a perfect match for graphical models. There is no doubt that the application of graphical models in multimedia content analysis will keep thriving.

The purpose of this tutorial is twofold: introducing the graphical models as a new framework of machine learning, and demonstrating the applications of graphical models in multimedia content processing domain. This tutorial is intended for researchers in the multimedia content analysis and understanding area as well as professionals working in related fields. Fundamentals in both graphical models and the multimedia content analysis will be covered in this tutorial, and there are no prerequisites for the audience. The tutorial is designed to present a refreshing, broad perspective on graphic models, and in-depth examples on their application in multimedia content analysis. It will be valuable to experts in the constituent technologies such as multimedia indexing and search, content-based analysis who are looking to broaden their knowledge beyond their current areas of expertise. Specifically, the audience will: 1) Understand the basic of the graph theory and graphical models; 2) Learn special graphical models, including Hidden Markov models, Markov Random Field, and Conditional Random Field; 3) Get familiar with the general approaches in multimedia content analysis systems; and 4) Study a few examples on how to apply graphical models in multimedia content analysis tasks, for example, video event detection, video sequence matching, image labeling, etc; 5) Be able to determine which topics in multimedia content analysis are of interest to them for further study.


This tutorial will be half day, and it includes two parts: Part I – Introduction of Basic Theory, and Part II – Advances and Applications in Multimedia Content Analysis. Following are detailed outlines for each part.

Part I: Introduction of Basic Theory

  • Bayesian Methods
  • Graphical Models
  • Hidden Markov Models (HMM)
  • Markov Random Field (MRF)
  • Conditional Random Field (CRF)

Part II: Advances and Applications in Multimedia Content Analysis

  • Video Temporal Dynamics
  • Identification of Digital Video based on Shot-level Sequence Matching
  • Independent Component Analysis Mixture Hidden Markov Models (ICAMHMM)
  • Video Event Detection Using ICAMHMM
  • ICA Mixture Hidden Conditional Random Field Model (ICAMHCRF) for Sports Event Classification
  • Gaussian Mixture Conditional Random Field Modeling for Indoor Image Labeling
  • Laplacian Mixture Conditional Random Field Model for Image Labeling
  • Future Research Directions

Introductory and intermediate, specifically, graduate students and researchers interested in the graphical models and their applications in multimedia content analysis.


Xiao-Ping Zhang received the B.S. and Ph.D. degrees from Tsinghua University, in 1992 and 1996, respectively, all in electronic engineering. He received the M.B.A. degree in finance and economics with Honors from the University of Chicago Booth School of Business, Chicago, IL. Since fall 2000, he has been with the Department of Electrical and Computer Engineering, Ryerson University, where he is now Professor and Director of Communication and Signal Processing Applications Laboratory (CASPAL). Prior to joining Ryerson, from 1996 to 1998, he was a Postdoctoral Fellow at the University of Texas, San Antonio, and then at the Beckman Institute, the University of Illinois at Urbana-Champaign. He held research and teaching positions with the Communication Research Laboratory, McMaster University, Canada, in 1999. From 1999 to 2000, he was a Senior DSP Engineer with SAM Technology, Inc., San Francisco, CA, and a consultant with the San Francisco Brain Research Institute. His research interests include signal processing for communications, multimedia retrieval and video content analysis, computational intelligence, and applications in bioinformatics, finance, and marketing. He is a frequent consultant for biotech companies and investment firms. Dr. Zhang is a senior member of IEEE and a registered Professional Engineer in Ontario, Canada and a member of Beta Gamma Sigma Honor Society. He is the Publicity Co-Chair for ICME’06 and program Co-Chair for ICIC’05. He has served as guest editor for the Multimedia Tools and Applications Journal. He is currently an Associate Editor for IEEE Signal Processing Letters.

Zhu Liu received the B.S. and M.S. degrees in Electronic Engineering from Tsinghua University, Beijing, China, in 1994 and 1996, respectively, and the Ph.D. degree in Electrical Engineering from Polytechnic University, Brooklyn, NY (now part of New York University), in 2001. He joined AT&T Labs - Research, Middletown, NJ, in 2000, and is currently a Principle Member of Technical Staff in the Video and Multimedia Technologies and Services Research Department. He is an adjunct professor of the Electrical Engineering Department of Columbia University. His research interests include multimedia content analysis, multimedia databases, video search, pattern recognition, machine learning, and natural language understanding. He holds 13 U.S. patents and has published more than 60 papers in international conferences and journals. Dr. Liu is a senior member of IEEE, and a member of ACM. Dr. Liu and his colleagues won the best demonstration award in the Consumer Communication & Networking Conference 2007. He is on the editorial board of the IEEE Transaction on Multimedia and the Peer-to-peer Networking and Applications Journal. He has served as guest editor for the IEEE Transaction on Multimedia, the Multimedia Tools and Applications Journal, and the International Journal of Semantic Computing. He was also on the organizing committee and technical committee for many IEEE International Conferences.

Back to Overview

Tutorial 8: Multimedia Tagging: Past, Present and Future

By Jialie Shen, Meng Wang, Shuicheng Yan, and Xian-Sheng Hua
Singapore Management University, Singapore
National University of Singapore, Singapore
Microsoft Research Asia, China


As the size of online media collections grows rapidly, multimedia information retrieval (MIR) becomes an increasingly critical technique for effective multimedia document search and management. To facilitate the processes, it is essential to annotate the multimedia objects with comprehensive textual information. Consequently, multimedia tagging has been actively studied by many different communities (e.g., multimedia computing, information retrieval, machine learning and computer version) recently. Meanwhile, many commercial web systems (e.g., Youtube,, and Flickr) have successfully applied tags and the related techniques to assist users in discovering, exploring and sharing media content in a convenient and flexible way.

The half-day tutorial comprehensively summarizes the research along this direction and provides a good balance between theoretical methodologies and real systems (including several industrial approaches). We plan to (i) introduce why tags and tagging schemes are important for accurate and scalable MIR; (ii) examine current commercial systems and research prototypes, focusing on comparing the advantages and the disadvantages of the various strategies and schemes for different types of media documents (e.g., image, video and audio); (iii) review key technical challenges in building tagging systems and explore how tagging techniques can be used to facilitate different kinds of retrieval tasks at large scale, and (iv) review a few promising research directions and explore potential solutions.


1. Introduction and Overview (25 mins)
    1.1 What is multimedia tagging?
    1.2 Why multimedia tagging?
    1.3 Manual vs. Automatic Tagging
2. Image tagging (45 mins)
    2.1 Learning-based tagging
        2.1.1 Feature extraction scheme and image representation
        2.1.2 Machine learning models specifically designed for image tagging
        2.1.3 Applications on large scale image search
    2.2 Search-based tagging
        2.2.1 Scalable search; indexing/hashing
        2.2.2 Research test-bed construction
3. Beyond image: music and video tagging (45 mins)
    3.1 How music and video tagging is different from image tagging
        3.1.1 Temporal media data modeling
        3.1.2 Complexity of content
    3.2 TRECVID experience
    3.3 Beyond TRECVID: exploring structure information for music and video tagging
        3.3.1 Advanced feature extraction and combination
        3.3.2 Tagging model design
    3.4 System benchmarking
4. Assistive multimedia tagging (45 mins)
    4.1 Tagging with intelligent data organization & selection
    4.2 Tag recommendation
    4.3 Tag processing - refinement and information complementation
    4.4 Personalization and applications on personal media information management
5. Summarization (20 mins)
    5.1 Future trend of multimedia tagging and its applications
    5.2 Open discussion

Brief Biography

Jialie Shen: Dr. Shen is an Assistant Professor in Information Systems and Lee Foundation Fellow, School of Information Systems, Singapore Management University (SMU), Singapore. Before moving to SMU, he worked as a faculty member at UNSW, Sydney and researcher at University of Glasgow for a few years. Dr. Shen’s main research interests include information retrieval, economic-aware media analysis, and multimedia systems. His recent work has been published or is forthcoming in leading journals and international conferences including ACM SIGIR, ACM Multimedia, ACM SIGMOD, CVPR, ICDE, WWW, IEEE TCSVT, IEEE TMM, Multimedia Systems, ACM TOIT and ACM TOIS. He also has actively served or is serving as session chairs, PC members and reviewers of a large number of leading international conferences.

Meng Wang: Dr. Wang is currently a research staff member in the National University of Singapore. Previously he worked as an associate researcher in Microsoft Research Asia and a research scientist in a start up in the Bay area. Dr. Wang's research interests include multimedia content analysis, tagging, search, and large-scale computing. Dr. Wang has authored about 80 technical papers in these areas. He is an associate editor of Information Sciences, an associate editor of Neurocomputing, and a guest editor of the special issues for Multimedia Systems Journal, Multimedia Tools and Applications, and Journal of Visual Communication and Image Representation. He received the Best Paper Award continuously from the ACM International Conference on Multimedia 2010 and 2009, and the Best Paper Award from the International Multimedia Modeling Conference 2010.

Shuicheng Yan: Dr. Yan is currently an Assistant Professor in the Department of Electrical and Computer Engineering at National University of Singapore, and the founding lead of the Learning and Vision Research Group ( Dr. Yan's research areas include computer vision, multimedia and machine learning, and he has authored or co-authored over 200 technical papers over a wide range of research topics. He is an associate editor of IEEE Transactions on Circuits and Systems for Video Technology, and has been serving as the guest editor of the special issues for TMM and CVIU. He received the Best Paper Awards from ACM MM’10, ICME’10 and ICIMCS'09, the winner prize of the classification task in PASCAL VOC'10, the honorable mention prize of the detection task in PASCAL VOC'10, 2010 TCSVT Best Associate Editor (BAE) Award, and the co-author of the best student paper awards of PREMIA'09 and PREMIA'11.

Xian-Sheng Hua: Dr. Hua is now a Principal Research and Development Lead for Bing Multimedia Search with Microsoft. He is responsible for driving a team to design and deliver thought-leading media understanding and indexing features. Before joining Bing in 2011, Dr. Hua was a Lead Researcher with Microsoft Research Asia. During that time, his research interests are in the areas of multimedia search, advertising, understanding, and mining, as well as pattern recognition and machine learning. He has authored or co-authored more than 180 publications in these areas and has more than 60 filed patents or pending applications. Dr. Hua received the B.S. and Ph.D. degrees from Peking University, Beijing, China, in 1996 and 2001, respectively, both in applied mathematics. He serves as an Associate Editor of IEEE Transactions on Multimedia, Associate Editor of ACM Transactions on Intelligent Systems and Technology, Editorial Board Member of Advances in Multimedia and Multimedia Tools and Applications, and editor of Scholarpedia (Multimedia Category). Dr. Hua won the Best Paper Award and Best Demonstration Award in ACM Multimedia 2007, Best Poster Award in 2008 IEEE International Workshop on Multimedia Signal Processing, Best Student Paper Award in ACM Conference on Information and Knowledge Management 2009, and Best Paper Award in International Conference on MultiMedia Modeling 2010. He also won 2008 MIT Technology Review TR35 Young Innovator Award for his outstanding contributions to video search.

Back to Overview

Tutorial 9: Eye-tracking methodology and applications to images and video

By Harish Katti and Mohan Kankanhalli
National University of Singapore, Singapore


This tutorial introduces eye-tracking as an exciting, non-intrusive method of capturing user attention during interaction with digital images and videos. The tutorial will focus on the following aspects,
1. What is eye-tracking and what does it offer for multimedia researchers?
2. Basic understanding of human anatomy, low and high level visual cognition, mathematical techniques underlying popular eye-tracking hardware and software.
3. Introduction to experiment design, analysis and visualization and application scenarios.

The tutorial will consist of two parts of roughly equal duration. The first half will introduce eye-tracking methodology and potential use in experiments and applications involving human interaction with images and videos. Second half will focus on hands-on experiments and data-analysis using state-of-art eye tracking hardware and software.


This tutorial introduces eye-tracking as an exciting, non-intrusive method of capturing user attention during interaction with digital images and videos. Our motivation stems from (a) huge volumes of image and video content, generated as a result of human experiences and interaction with the environment. We frequently encounter video and image collections having millions video clips on YouTube and billions of images on repositories such as Flickr or Picasa. It becomes useful and necessary to automate the process of understanding and storing it to enable subsequent use such as indexing, retrieval and query processing, re-purposing for devices with different form factors. (b) Personalization and interaction in human-media is an important and promising direction of research. Having access to individual preferences and behavioral patterns is a key component of such a system. Eye-gaze can play a valuable role in this context. Recent developments in hardware technology have made it possible to seamlessly integrate eye-tracking into laptops and mobile devices. This opens up a plethora of possibilities for research and applications for applications and frameworks that can respond to user’s visual attention strategies. (c) Visual content design such as in advertising often employs techniques that guide user attention to produce visual impact and elements of surprise and emotion. Eye-gaze has been used as a tool to evaluate different choices of visual elements and their placement. (d) Affective analysis of images and videos is an ongoing and challenging area in multimedia research. We show how eye-gaze and accompanying pupillary dilation information can aid affective analysis. Recent efforts on use of pupillary dilation for affective analysis for videos will be presented.


The first half will cover the following topics,
1. Defining the role of eye-gaze for an experimental study or application and preparation of stimulus set for image or video based experiment.
2. Eye-tracking hardware choices ranging from low-cost open source, desktop based and mobile head-mounted options.
3. Typical eye-gaze analysis and visualization methods for image and video based experiments.
4. Interactive, eye-gaze contingent systems.

The second half will consist of a hands-on session covering the following topics,
1. Aspects of analysis and visualization for image and video stimuli. This will be introduced with customized code accompanying NUSEF dataset, as well as commercial software from SMI.
2. Eye-tracking experiments with image and video stimuli using commercial trackers from SMI and also low-cost open source based tracking.


The tutorial is aimed at researchers who want to understand the basics of eye-tracking and potential applications to image and video. Participants can be from diverse backgrounds like psychology, engineering and computer science and will be exposed to methods that are complementary to their basic training. A basic knowledge of programming and statistics will help, but is not a compulsory pre-requisite as the tutorial material will be self-contained and introductory in nature. We can accommodate 15-20 participants due to hardware constraints. SMI has kindly agreed to sponsor the commercial eye-tracking hardware and up to 5 merit based scholarships of 100 dollars each.


The tutorial will be spread over two halves of about 1.5 hours. The first half will consist of a series of half hour lectures addressing the following questions,
1. What is eye-tracking and what does it offer for multimedia researchers?
2. Basic understanding of human anatomy, low and high level visual cognition, mathematical techniques underlying popular eye-tracking hardware and software.
3. Introduction to experiment design, analysis and visualization and application scenarios.

The second half will consist of hands-on sessions where the participants will be split up into groups of 3-4 people. Each group will be assigned to a hardware setup and expected to conduct a small eye-tracking study or visualization exercise. The groups will be guided by the speakers through the exercises and will help the participants gain experience of practical eye-tracking experiments.


We will provide background reading and introductory material to selected participants well in advance. The participants will be provided with additional reading material during the tutorial and this will include how-to’s and FAQ’s for the hands-on sessions. We also intend to provide an overview and survey over different free and commercial eye-tracking setups, datasets and related software to help interested participants start on eye-tracking research.


Harish Katti received the B.Engg. degree in Computer science and Engineering from Karnatak University and M.Tech degree in Bio-Medical Engineering from IIT Bombay. He is currently a PhD candidate in the Department of Computer Science, School of Computing, National University of Singapore. He has worked the area of multimedia systems in Sasken Communications Pvt Ltd and Emuzed India Pvt Ltd. His current research interests are in visual perception and applications of eye-tracking methodology to media applications.

Mohan Kankanhalli obtained his BTech (Electrical Engineering) from the Indian Institute of Technology, Kharagpur and his MS/PhD (Computer and Systems Engineering) from the Rensselaer Polytechnic Institute. He is a Professor at the School of Computing at the National University of Singapore. He is on the editorial boards of several journals including the ACM Transactions on Multimedia Computing, Communications, and Applications, Multimedia Systems Journal and Multimedia Tools and Applications. His current research interests are in Multimedia Systems (content processing, retrieval) and Multimedia Security (surveillance, privacy and digital rights management).

Back to Overview

ACM Multimedia 2011

Nov 28th - Dec 1st, 2011 Scottsdale, Arizona, USA

Back To Top