The Continuing Reinvention of Content-Based Retrieval - IEEE Xplore

1 downloads 0 Views 631KB Size Report
nature of the simulacrum of the real world avail- able to you, along with the directly available metadata (that is, metadata produced by the sensors at hand that ...
Visions and Views

Gang Hua Microsoft Research

The Continuing Reinvention of Content-Based Retrieval: Multimedia Is Not Dead William I. Grosky and Terry L. Ruas University of Michigan-Dearborn

I

EEE MultiMedia magazine has existed for 23 years, and one of us (William Grosky) served as its second Editor in Chief from 1998–2001. The January–March 1998 issue featured a “Web Insight” column, “Multimedia Is Not Dead,”1 in which the author considered the term “multimedia” not as a noun describing a field of study but as an adjective, a descriptor of personal computing. Nineteen years later, the subtitle of this Visions and Views piece is a play on that title, and we try to convince you that multimedia, and its nascent relationship with semiotics, is very much alive. We’re very aware of how the multimedia field developed—struggling to justify itself, which was one of the underlying reasons for the founding of this publication. In looking back at the field’s development, we’d like to offer some observations here pertaining to its future, focusing on one of multimedia’s most important constituents, contentbased retrieval.

Content-Based Retrieval Content-based retrieval is a main area of the multimedia landscape. Furthermore, it’s crosscutting with many other fields, including computer vision, signal processing, network science, database management systems, information retrieval, the Internet of Things (IoT), natural language processing, and various aspects of artificial intelligence (AI). The results of this crosscutting are more than just the sum of the individual parts, however. Multimedia researchers often combine new techniques from these enabling technological areas to produce working prototype software that’s unique and useful and includes ancillary modules from other areas of multimedia, such as security and privacy. In the modern age, virtually all research areas are cross-cutting with each other. Even mathematics, which obviously holds a special place in

6



1070-986X/17/$33.00 c 2017 IEEE

scientific development, is influenced by problems in more practical areas. So, being cross-cutting should not imply that our field is any less deep than any other field. Each area of research depends on the progress of many other areas. As in any new field, however, there were times in our early days when we had to justify ourselves and show that, yes, multimedia is different from computer vision, even though we borrow many techniques from that area. We also had to justify our opinions as to the efficacy of the tools we used from other fields, such as the standard database management systems of the early 1990s. Over the objections of researchers in the database community, many of us believed that these tools were wanting in our environment and that improvements were necessary.2 In the early days of our community (the late 1980s and the early 1990s), when the idea of intelligently managing visual information was becoming quite exciting, text was downplayed, and audio was largely ignored as being in the purview of speech processing, an already established field of study. Other modalities of multimedia were rarely thought about, if at all. However, in a 1992 article,3 one of us (Grosky), along with Rajiv Mehrotra, made the following statement: “As is becoming increasingly apparent, moreover, the experience gained from this view of what an image database should be will generalize to other modalities, such as voice and touch, and will usher in the more general field of study of what we call sensor-based data management.” Signal/Sensor-Based Retrieval This prediction seems to be more real than ever and continues to grow stronger. More and more, different modalities are being considered within the purview of content-based retrieval.4

Published by the IEEE Computer Society

In the recent book, Text Data Management and Analysis, ChengXiang Zhai and Sean Massung even include a chapter subheading entitled, “Text vs. Non-text Data: Humans as Subjective Sensors.”5 Basically, various categories of signals are the underlying bedrock of communications of all sorts. Each signal is a spatial/ temporal variation of some quantity (such as amplitude), and each modality of multimedia consists of a set of related and synchronized signals. To process them intelligently, one should, of course, know the nature of the modality in which one is immersed. Most multimedia objects are spatio-temporal simulacrums of the real world. This supports our view that the next grand challenge for our community will be understanding and formally modeling the flow of life around us, over many modalities and scales. As technology advances, the nature of these simulacrums will evolve as well, becoming more detailed and revealing to us more information concerning the nature of reality. Currently, IoT is the state-of-the-art organizational approach to construct complex representations of the flow of life around us. Various, perhaps pervasive, sensors, working collectively, will broadcast to us representations of real events in real time. It will be our task to continuously extract the semantics of these representations and possibly react to them by injecting some response actions into the mix to ensure some desired outcome.

challenge for our community will be understanding and formally modeling the flow of life around us.

Generic signal/sensor-based retrieval should also use syntactical-, semantic-, and pragmatics-based approaches. If we are to understand and model the flow of life around us, this will be a necessity. By definition, most images and videos of the real world are grammatical. Our community has successfully developed various approaches to decode the semantics of these artifacts, or at least the dominant semantics, as image snippets (bags of visual words) are more polysemous7 than text. The development of techniques that use contextual information is in its infancy, however. Artistic media, such as painting, sculpture, performance art, movies, and others, bring along more contextual baggage than other sorts of media. With the expansion of the data horizon, through the everincreasing use of metadata, we can certainly put all media on more equal footing. As Donald Rumsfeld would say, pragmatics is largely a “known unknown”8 in our community. The further study of pragmatics is going to involve the use of semiotics9 (the study of signs and symbols), which the physical science community has largely ignored. In areas such as psychology and marketing, however, the field of semiotics has been used quite often. Expanding the Data Horizon The expansion of the data horizon results from more metadata becoming available through advanced apps. All modern smartphones contain an internal GPS, used to geotag photos taken by the device. Smartphone inertial sensors can also be used for real-time biofeedback information.10 The type of contextual information provided can certainly influence the type of processing that would be necessary for various tasks, such as entity identification in photos. Suppose

January–March 2017

Pragmatics-Based Content-Based Retrieval Linguistics is divided into three broad areas: syntax, semantics, and pragmatics.6 Some relations between these areas are illustrated by observing that a sentence that is formally ungrammatical is usually hard to understand, unless there is a community or culture of understanding for this ostensible violation of standard syntax. Grammar is related to syntax, meaning is related to semantics, and community/culture is related to pragmatics. Most non-linguists are familiar with syntax and semantics, so let us expand here the notion of pragmatics. In a nutshell, pragmatics studies context and how it affects semantics. Context is sometimes culturally, socially, and historically based. For example, pragmatics would encompass the speaker’s intent, body language, and penchant for sarcasm, as well as other signs, usually culturally based, such as the speaker’s type of clothing, which could influence a statement’s meaning.

The next grand

7

Visions and Views

you examine a video of a person exiting a building. This person enters an automobile and drives away. To identify this person, you could perform facial recognition over existing facial databases. However, you might also be able to extract the license plate number and use DMV license plate data to identify the vehicle’s owner. This would narrow down the possibilities of the driver’s identity. Also, you could use geo-coordinates of the exited building to further narrow the possibilities. To minimize the cost of solving some content-based problem, a task similar to query optimization in database management systems could be performed. The task would be to search all feasible subsets of data/metadata that could be used to solve the problem. Then, given the nature of the simulacrum of the real world available to you, along with the directly available metadata (that is, metadata produced by the sensors at hand that produced the given simulacrum), as well as the cost of using external metadata (that is, metadata not produced by those sensors but that might be available to you over the Web), you could determine how to solve the problem at a minimum cost. The number and types of external URIs needed to solve a problem could be used to construct a hierarchy of difficulty for the universe of problems.

Examining the Data

IEEE MultiMedia

Here, we examine the types of research conducted in the multimedia arena over the past 11 years to determine whether the results are consistent with our observations regarding the evolution of content-based retrieval. To perform such an evaluation, first we considered all papers published as part of the ACM International Conference on Multimedia (ACM MM) from 2005 to 2015. This extraction was performed using the bibliographic database Scopus (www.elsevier.com/solutions/scopus), which is owned and maintained by Elsevier. Currently, Scopus holds more than 60 million records, dating from 1960 to the present. To make sure all indexed records would be considered in our study, we extracted each set of publications by its respective conference ISBN number. We obtained approximately 2,872 items, each represented in RIS format. From this set, we preprocessed the entire corpora (data cleaning and normalization) so we could perform the following:

Š

8

frequency tracking of keywords over the 11-year span; and

Š

most-common-topics analysis (represented by a group of common terms concatenated into a single sentence summarizing a general idea).

When tracking keywords, we found the overall frequency of a cluster of related keywords and keyphrases by adding the individual frequencies of these cluster components. We then plotted these frequencies as illustrated in Figure 1. When analyzing the topics, we used the tool MALLET (Machine Learning for LanguagE Toolkit; http://mallet.cs.umass.edu), a Java package embedded with sophisticated tools for document classification developed by Andrew MacCallum. We specified that the number of topics and the number of words characterizing each topic were 25. Tracking Keywords We chose 12 clusters of keywords and keyphrases, based on their pertinence to what we perceive as the necessary evolution of the field content-based retrieval. These 12 clusters are captured by the following terms: aesthetics, affective computing, causality, context, cyberphysical systems, events, geo-tagging, human engineering, natural language processing, semantics, social sensors, and spatio-temporal. Since some extracted topics (such as aesthetics, context, events, and semantics) present a high number of occurrences (at some point in their trend line), to plot all of them in the same view would not accurately show the behavior of the other topics. Consequently, we categorized these 12 clusters into two groups: low occurrence terms (Figure 1a) and mid–high occurrence terms (Figure 1b). In Figure 1a, you will notice that “causality,” “cyberphysical systems,” and “social sensors” had no occurrences until 2015. The other keywords/keyphrases are on an upward trajectory, except for “geo-tagging,” which is on a level trajectory. In Figure 1b, each keyword/keyphrase is on an upward trajectory, except for “aesthetics.” The upward trajectory indicates their increasing popularity. Analyzing Topics We wondered whether we would get a similar plot from considering topic models as we had for the keywords. So, we performed a simple experiment on the keyword “context.” For each year, we examined all topics (among the top 25) containing the word “context.” For

Š

Topic 3: search, semantics, optimization, semantic, applications, complexity, context, automated, embedded, projection, interoperability, topic, latency, structure, tensor, compounds, answering, evolution, structures, perception, cameras, dominance, feature, scale, communities.

16 Affective computing

14

Causality

12 Keyword counts

each of these topics, MALLET computed the number of words from that topic that occurred in each document and summed up all these occurrences. For example, for 2007, there were two topics containing the word “context”:

Cyberphysical systems

10

Geo-tagging

8

Human engineering

6

Natural language processing

4

Social sensors Spatio/temporal

2

15

14

20

13

20

12

20

11

20

10

20

09

20

08

20

07

20

06

20

20

20

Years

120 100 Keyword counts

80 Aesthetics Context Events Semantics

60 40 20

(b)

15

14

20

13

20

12

20

11

20

10

20

09

20

08

20

07

20

06

20

20

05

0 20

For Topic 3, there were two documents having six occurrences of words in that topic, four documents having five occurrences, four documents having four occurrences, seven documents having three occurrences, 19 documents having two occurrences, and 41 documents having one occurrence, producing 148 total occurrences. Similarly, there were 188 occurrences from Topic 6. Thus, the total was 336 occurrences of words from Topics 3 and 6 in our set from 2007. We extracted the plot of the occurrences of the keyword “context” based on the keywords from Figure 1b, and the result is displayed in Figure 2, as is the plot based on the topics. Although both plots have different scales, their shapes are quite similar. This is no surprise, however, as the words occurring in a topic generally co-occur in each document. In case you’re wondering about the words comprising the keywords for all years (2005–2015), and the words contained in the 25 topics for the same years, Figures 3 and 4 present word clouds for them, respectively. We have just scratched the surface of a datadriven approach to determine the evolution of content-based retrieval, and look forward to more detailed investigations.

(a)

Years

Figure 1. Popular keywords in ACM Multimedia Conference papers from 2005–2015: (a) plots with relatively small y-axis values and (b) plots with relatively large y-axis values.

2,000 1,500 Counts

Topic 6: algorithms, fusion, similarity, representation, random, markov, language, propagation, support, space, fields, contextual, processes, speaker, path, adaptive, availability, perceptual, mean, utility-based, patient, finite, perspective, measure, route.

05

0

Š

1,000 500 0

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Years Keyword counts of the word “context” Topic counts of the word “context”

Increased Awareness of the Field’s Evolution In The Structure of Scientific Revolutions, Thomas Kuhn argued for an episodic theory of scientific progress, consisting of periods of normal science,

Figure 2. Keyword and topic counts of the word “context.” We extracted the plot of the occurrences of the keyword counts of the word “context” (blue) and the topic counts of the word “context” (orange). Although the scales are different, the shapes are quite similar. This is no surprise, as the words occurring in a topic generally co-occur in each document.

9

Visions and Views

Figure 3. A word cloud for all the keywords from 2005 through 2015.

and the younger crew causes a new reigning scientific theory to appear. A good example of this is what occurred in the progression from the Ptolemaic to the Copernican view of the solar system. In general, if people are comfortable in their position in life and work, scientists included, they are resistant to change, some more than others. However, if we had a bird’s eye view of our field, including its best practices, the evolution of its different strands, and informed predictions of where the field is going, we might be more skilled at trying to place ourselves in its research landscape and might be more amenable to change. Surveys can, of course, offer a good bird’s eye view. However, good surveys are quite rare and do not go very deep into a field’s possible near-term futures based on an informed and data-driven analysis. Multimedia researchers and their publications can be represented as a graph/Web/network. Using the tools of network science, this network can be formally studied. Since the number of scientific productions (extended abstracts, papers, journals, and so on) has increased drastically over the past few decades, the manual evaluation and categorization of these artifacts has become quite complex and time consuming. In this context, scientometrics12 presents itself as a good alternative for science evaluation, using well-defined metrics. This field of study provides models and methodologies to properly evaluate, in a practical way, large sets of research publications and scientists over time.

W Figure 4. A word cloud for the 25 topics for each year from 2005 through 2015.

IEEE MultiMedia

where standard accumulation of facts is the norm, followed by rare periods where a paradigm shift occurs.11 For any reigning scientific theory, finding anomalous facts that aren’t consistent with this reigning theory increases over time. For a while, the old theory can be slightly transformed to explain these anomalies. After a while, though, the anomalies grow larger, and younger scientists get bolder and try to develop new explanations. Eventually, the old guard dies off,

10

e hope we have convinced you that the field of content-based retrieval is quite rich and has endless possibilities. We’re not worried about the future of our field, as we see numerous green shoots already planted, and we anxiously await their future growth so we can perform more detailed investigations. MM

References 1. S. Reisman, “Multimedia Is Dead,” IEEE Multimedia, vol. 5, no. 1, 1998, pp. 4–5. 2. R.C. Jain, “NSF Workshop on Visual Information Management Systems: Workshop Report,” Proc. SPIE 1908, Storage and Retrieval for Image and Video Databases, 1993; http://dx.doi.org/10.1117/12. 143650.

3. W.I. Grosky and R. Mehrotra, “Image Database Management,” Advances in Computers, vol. 34, M. Yovits, ed., Academic Press, 1992, p. 238. 4. S. Poria et al., “Fusing Audio, Visual and Textual Clues for Sentiment Analysis,” Neurocomputing, Jan. 2016, pp. 50–59. 5. C.X. Zhai and S. Massung, Text Data Management and Analysis, ACM Books and Morgan & Claypool Publishers, 2016, p. 244.

10. A. Kos, S. Tomazic, and A. Umek, “Suitability of Smartphone Inertial Sensors for Real-Time Biofeedback Applications,” Sensors, vol. 16, no. 3, 2016, p. 301; doi: 10.3390/s16030301. 11. T. Kuhn, The Structure of Scientific Revolutions, 4th edition, University of Chicago Press, 2012. 12. A.F. Raan, “Scientometrics: State-of-the-Art,” Scientometrics, vol. 38, no. 1, 1997, pp. 205–218; doi: 10.1007/BF02461131.

6. N. Indurkhya and F.J. Damerau, eds., Handbook of Natural Language Processing, 2nd edition, CRC Press, 2010. 7. G.L. Dillon, “Art and the Semiotics of Images:

William I. Grosky is professor and chair of the Department of Computer and Information Science at the University of Michigan-Dearborn. Contact

Three Questions about Visual Meaning,” Univ. of Washington, July 1999; http://faculty.washington.

him at [email protected].

edu/dillon/rhethtml/signifiers/sigsave.html.

Terry L. Ruas is a PhD student in the program of Infor-

8. “DoD News Briefing—Secretary Rumsfeld and Gen. Myers,” United States Department of

mation Systems Engineering at the University of Michigan-Dearborn. Contact him at [email protected].

Defense, 12 Feb. 2002; http://archive.defense. gov/Transcripts/Transcript.aspx?TranscriptID¼ 2636. 9. D. Chandler, Semiotics, 2nd edition, Routledge, 2007.

Read your subscriptions through the myCS publications portal at http://mycs.computer.org.

January–March 2017 11

Suggest Documents