a platform for retrieving and summarizing trendy ...

6 downloads 39279 Views 388KB Size Report
Dec 18, 2012 - select candidate keywords from the raw data collected from Twitter using a ..... During the analysis period, Andy Murray, a famous tennis player,.
Multimed Tools Appl DOI 10.1007/s11042-013-1547-0

TrendsSummary: a platform for retrieving and summarizing trendy multimedia contents Daehoon Kim & Daeyong Kim & Sanghoon Jun & Seungmin Rho & Eenjun Hwang

# Springer Science+Business Media New York 2013

Abstract With the flood and popularity of various multimedia contents on the Internet, searching for appropriate contents and representing them effectively has become an essential part for user satisfaction. So far, many contents recommendation systems have been proposed for this purpose. A popular approach is to select hot or popular contents for recommendation using some popularity metric. Recently, various social network services (SNSs) such as Facebook and Twitter have become a widespread social phenomenon owing to the smartphone boom. Considering the popularity and user participation, SNS can be a good source for finding social interests or trends. In this study, we propose a platform called TrendsSummary for retrieving trendy multimedia contents and summarizing them. To identify trendy multimedia contents, we select candidate keywords from raw data collected from Twitter using a syntactic feature-based filtering method. Then, we merge various keyword variants based on several heuristics. Next, we select trend keywords and their related keywords from the merged candidate keywords based on term frequency and expand them semantically by referencing portal sites such as Wikipedia and Google. Based on the expanded trend keywords, we collect four types of relevant multimedia contents—TV programs, videos, news articles, and images—from various websites. The most appropriate media type for the trend keywords is determined based on a naïve Bayes classifier. After classification, appropriate contents are selected from among the contents of the selected media type. Finally, both trend keywords and their related multimedia contents are displayed D. Kim : D. Kim : S. Jun : E. Hwang (*) School of Electrical Engineering, Korea University, Seoul, South Korea e-mail: [email protected] D. Kim e-mail: [email protected] D. Kim e-mail: [email protected] S. Jun e-mail: [email protected] S. Rho Department of Multimedia, Sungkyul University, Anyang-si, South Korea e-mail: [email protected]

Multimed Tools Appl

for effective browsing. We implemented a prototype system and experimentally demonstrated that our scheme provides satisfactory results. Keywords Twitter . Trends . Multimedia contents recommendation . Summarization . Naïve Bayes classifier . TreeMap

1 Introduction Recent years have witnessed tremendous developments in the field of information technology (IT) and a great increase in the pervasiveness of the Internet. With these changes, many news organizations and broadcasting companies have been ceaselessly producing a huge amount of news articles and TV programs, respectively. In particular, with the recent smartphone boom, user-generated content (UGC) has come to constitute a significant portion of the total volume of digital content. Accordingly, there is a strong and urgent need to provide or recommend appropriate digital contents to users. Many content recommendation systems have already been developed for this purpose. In such systems, a frequently used approach is to select hot or popular contents for recommendation based on some popularity metric. At the same time, the smartphone boom has also made various social network services (SNSs) such as Facebook and Twitter a widespread social phenomenon. For instance, Twitter is an online social networking and microblogging service that enables its users to post and share text-based messages of up to 140 characters, known as “Tweets.” It has gained global popularity, with over 500 million active users (as of 2012) generating over 340 million tweets daily. It is therefore often described as “the SMS of the Internet [30].” Although one tweet may contain at most 140 characters, the sheer number of tweets generated daily could collectively offer important clues about public opinions and current trends. In this study, we propose a platform called TrendsSummary for retrieving trendy multimedia contents and summarizing them. To identify trendy multimedia contents, we first select candidate keywords from the raw data collected from Twitter using a syntactic featurebased filtering method. Then, we merge various keyword variants based on several heuristics. Next, we select trend keywords and their related keywords from the merged candidate keywords based on term frequency. Usually, the number of these trend keywords is very small and hence their semantic coverage is very limited. To relieve this, we expand them semantically by referencing portal sites such as Wikipedia [31] and Google [11]. Based on the expanded trend keywords, we collect four types of related multimedia contents—TV programs, videos, news articles, and images—from various websites. The most appropriate media type is determined for the trend keywords based on a naïve Bayes classifier. After classification, appropriate contents are selected from among the contents of the selected media type. Finally, both the trend keywords and their related multimedia contents are represented on the screen using TreeMap [28] for effective browsing. In summary, the main contributions of this study are as follows: 1) Based on the trend keywords and their related keywords detected automatically from user tweets, we show how to decide the most suitable media type based on the naïve Bayes classifier. 2) In addition to the basic keyword matching for selecting the media type, we show how to select the most suitable contents from many candidate contents by using various characteristics of each media type, such as time line, hits, and relative similarity.

Multimed Tools Appl

3) Based on the TreeMap algorithm, we show how to represent both the trend keywords and the related multimedia contents effectively. Through this interface, users can see at a glance the relative popularity of trend keywords and their most appropriate digital contents. The remainder of this paper is organized as follows. In section 2, we discuss the background of this study and related works. In section 3, we describe our multimedia content recommendation and summarization scheme in detail. In section 4, we present experimental results. In section 5, we briefly discuss the conclusions and outline future plans.

2 Related work Thus far, many studies have focused on effective content recommendation and summarization. When recommendations are based on social popularity, it is necessary to first identify social trends. SNSs, owing to their popularity and user participation, are considered a good source for finding such social trends. 2.1 SNS analysis Compared to traditional documents, SNS messages possess several different characteristics from the viewpoint of text analysis and mining. Generally, a single SNS message is short in length, possibly containing only a few short sentences; however, the number of messages that are generated daily is enormous. Furthermore, owing to the size limitation of messages, they frequently contain acronyms and other types of abbreviations. Thus, SNS messages might be handled differently. Basically, SNS messages can be considered traditional text documents. Thus far, many studies have handled SNSs as traditional documents and used traditional text modeling. Yokomoto et al. applied an LDA-based document model to the task of labeling blog posts with Wikipedia entries [32]; they collected Wikipedia entries and found that their LDA parameters strongly depend on the distribution of keywords across all search results of blog posts. Si et al. proposed a scalable and real-time method for tag recommendation [29]; they developed a model of documents, words, and tags using the tagLDA model, which extends the LDA model by adding a tag variable. Quercia et al. focused on understanding how well a fairly new version of topic modeling works in the specific context of Twitter, which is emerging as an increasingly useful source of informative textual data [25]. However, all of these approaches are not suitable for detecting trend keywords from a large number of tweets in a short duration because the LDA model requires a large amount of computation time and numerous parameter adjustments. On the other hand, some studies consider SNS messages as streaming data or time sequence data and tried to extract trends from SNS messages. Mathioudakis et al. proposed TwitterMonitor for the detection of trends over Twitter streams [20]; TwitterMonitor first identifies emerging topics on Twitter in real time and provides meaningful analytics for the accurate description of each topic, and it discovers topic trends by detecting bursty single tags. However, it is difficult to obtain information about various events using only single tags. To overcome this problem, Alvanaki et al. presented the enBlogue system, an approach for automatically detecting emergent topics by detecting shifts in tag correlations as they dynamically arise [1, 2]. In addition, several studies have focused on extracting bursty keywords from text streams that arrive continuously over time. Fang et al. proposed a bursty keyword discovery scheme using the co-occurrence and time information of words [9]; they generated pairs of co-occurring words by applying OpenNLP and extracted bursty keywords

Multimed Tools Appl

by analyzing word clusters within a specified time range. Kleinberg proposed a text data mining method for detecting topics that suddenly increase in frequency [16]. Their approach was to use an infinite-state automaton for modeling a text stream in a manner analogous to queuing theory models; they created a hierarchical structure of the set of bursts for the overall streams; however, their method usually suffered from the difficulty of extracting temporal characteristics, and it required a significant amount of time. 2.2 Multimedia content recommendation Content recommendation methods have been studied in various fields. Chen et al. proposed a music recommendation system [5]; they analyzed the user access history to derive profiles of interests and behaviors for user grouping and performed music recommendation based on the favorite degrees of music groups by user groups. Cano et al. presented a system called MusicSurfer for realizing interactive collections of music [4]; a user could navigate through a collection of music based on extracted descriptions related to instrumentation, rhythm, and harmony from music audio signals. Eck et al. suggested a scheme for generating social tags of music for recommendation [8]; they automatically predicted social tags by using audio feature extraction and supervised learning for mapping audio features onto social tags; the generated social tags can predict similar artists or songs as recommendations. Several recommendation methods have also been developed for news articles. Liu et al. proposed a hybrid system based on Web history for personalized news recommendation [19]; they conducted large-scale analysis of anonymous user click logs and predicted users’ current news interests using a developed Bayesian framework for log analysis; they also combined content-based recommendation mechanism that used user profiles and collaborative filtering for generating personalized news recommendation. Francisci et al. suggested a Web-based personalized news recommendation system [7]; they used the learning-to-rank approach and support vector machines to rank news interests using Twitter logs and to generate user profiles; their system could also predict the degree of interest of news based on the generated profiles and social neighborhood of users. Phelan et al. developed a social news service called Buzzer [24] that ranked personal RSS subscriptions based on Twitter conversations; this was achieved by a content-based approach for mining trending terms from the Twitter timeline and friend subscriptions. Finally, many studies have focused on recommendations for TV programs. Lai used a user profiling method to recommend TV programs [18]; they proposed a cloud-based system for applying the KNN algorithm to a large amount of user preferences in order to generate profile data. Several studies have also focused on automatically evaluating TV programs to recommend and rank recommended results. Wakamiya et al. suggested a TV evaluation method based on the relation between user opinions on Twitter and TV programs for personalized recommendation [27]. 2.3 Content summarization Content summarization involves creating a summary of contents for efficient searching and recommendation. Kuo et al. proposed a system for summarizing Web search results to a tag cloud form [17]; they eliminated stop words from the abstracts returned by the query and applied Porter’s Stemming Algorithm to use stemmed words as tags in the tag cloud; they also computed the frequency of the tags to represent the relativeness of the tags. Shen et al. presented a conditional random-field-based framework for document summarization [26]; they treated summarization as a sequence labeling task and viewed each document as a sequence of sentences; the summarization procedure labeled the sentence to extract the core sentence. Otterbacher et al. proposed a method for

Multimed Tools Appl

transforming online contents for mobile devices [23]; they employed a hierarchical structure based on the relative importance of sentences within the document; most representative sentences were initially delivered to mobile devices, and users could drill down to view sentences at deeper levels of the hierarchy if they wanted more information about specific topics. Gong et al. applied two generic text summarization methods—the standard IR method and latent semantic analysis— for ranking the sentence relevance and generating a summary [10]; both methods select sentences that are highly ranked and different from each other to create a summary with a wider coverage of the document’s main content and less redundancy. Barzilay et al. proposed a sentence fusion method for summarizing news within multiple documents [3]; they used a novel text-to-text generation technique that involves bottom-up local multisequence alignment to identify phrases conveying similar information and statistical generation to combine common phrases into a sentence for synthesizing common information across documents. However, all of these methods provide a summary of only specific fields, and most of them require a significant amount of time. Furthermore, these methods usually create an abstract of documents containing sentences that help users to understand the main idea of the collected contents, although the user likely cannot easily understand the key points of each multimedia content in a large set owing to the length of the created abstract.

3 Multimedia content recommendation and summarization In this section, we describe the overall architecture of our prototype system and some of the implementation details. Our system consists of three major components: Trends Analyzer, Media Type Determiner, and Contents Selector & Summarizer. The Trends Analyzer receives raw tweet data from the Twitter server and detects trends and their related keywords. By using these trend keyword sets, the Media Type Determiner decides what media type is the most effective for the trend of a certain keyword set. Then, for the selected media type, the Contents Selector & Summarizer chooses the most suitable contents and summarizes them. Through these steps, we can provide a variety of trend-related information to users. Figure 1 shows the overall architecture of our proposed system. 3.1 Trend analyzer To detect trends and their related keywords from the raw tweet data, we first select candidate trend keywords from tweets by performing simple filtering based on some syntactic rules. Then, we calculate the co-occurrence of candidate keywords in tweets for finding related keywords. Finally, we decide the trend keywords and their related keywords based on term frequency and expand them semantically by referencing portal sites including Wikipedia and Google. Owing to the small screen of most smartphones, many tweets contain various word variants such as abbreviations and usage errors such as typing and spacing errors. Trend keywords should be handled effectively for more accurate detection. Table 1 shows the word variant types we considered and their examples, which were observed in user tweets between 01/03/2013 and 01/06/2013. We merged variants into their full keyword and accumulated their term frequencies under those of the full keyword. These steps are described in greater detail in [13]. 3.2 Media type determiner: basic model In our model for determining the appropriate media type for trend keywords based on a naïve Bayes classifier, we first find the candidate contents using a simple keyword matching

Multimed Tools Appl TrendsSummary

Tweets raw data from Twitter server

Trends Analyzer

Media Type Determiner

Contents Selector & Summarizer

Select candidate trend keywords

Collect related multimedia contents

Select the most appropriate content

Merge keyword variations

Determine user preferred media type

Calculate score of each content for the preferred media type

Summary of trend keywords and multimedia contents

Find related keywords using co-occurrence Expand keywords semantically

Trendy Multimedia Contents Repository (TV program, News article, Video, Image)

Fig. 1 Overall system architecture

algorithm. Then, we collect four types of relevant multimedia contents including TV programs from mc2xml [21], videos from YouTube [33], news articles from Google News, and images from Google Image. By using these contents as a training set, we classify the most suitable media type for each trend keyword set using the Media Type Determiner. Let W be a set of trends and their related keywords, as obtained in section 3.1. Then, W can be restated as follows.   ð1Þ W ¼ kTrend ; K Twitter ; K Google ; K Wiki Here, kTrend is a main trend keyword and KTwitter, a set of related keywords detected from Twitter raw data. KGoogle and KWiki are keyword sets selected from Google and Wikipedia, respectively. Every keyword except the main trend keyword represents a set of words, and therefore, it may contain multiple words. M is a set of media type classes. We can formally define four media types as follows.   M ¼ mTV ; mVideo ; mNews ; mImage

ð2Þ

Table 1 Keyword variants Variant type

Full keyword

Acronym

NEW YEARS EVE

NYE

CALL OF DUTY NEW YORK CITY

COD NYC

Reduction

Variant

DESIGNATED DRUNK DRIVER

DDD

CHRIS BROWN

CHRIS

WATERLOO ROAD

WATERLOO

STEVEN GERRARD

GERRARD

LAS VEGAS

VEGAS

Typo

HIP HOP

HIP POP

Spacing

COLLIN KLEIN KSTATE

COLIN KLEIN K_STATE

KEVIN_PRINCE BOATENG

KEVINPRINCE BOATENG

Multimed Tools Appl

For each media type m, we can define the maximum a posteriori probability (MAP) estimation MMAP using Eq. (3):     MMAP ¼ argmax P mW

ð3Þ

m∈M

MMAP in the equation can be expanded to Eq. (4) using Bayes’s rule:

M MAP ¼ argmax

    P W m PðmÞ

m∈M  PðWÞ  ∝ argmax P W m PðmÞ m∈M     ¼ argmax ∏ P wm PðmÞ m∈M

ð4Þ

w∈W

A naïve Bayes classifier is a simple probabilistic classifier based on the application of Bayes’s theorem under strong independence assumptions. Despite their naïve design and apparently oversimplified assumptions, naïve Bayes classifiers have worked quite well in many complex real-world situations [22]. In our model, the conditional probability P(W|m) P(m) is simplified to Πw∈W P(w|m) P(m) by naïve conditional independence assumptions. Although keyword set W includes multiple keywords, it is usually possible to assume that keywords are independent of each other because they are all different. Accordingly, we can calculate these probabilities quickly using simple multiplications. Because we deal with a large amount of SNS data, this characteristic is also suitable for our proposed scheme. P(m) and P(w|m) in Eq. (4) can be obtained easily by calculating the term frequencies as shown in Eq. (5). nðC m Þ P ð mÞ ¼ nðC M Þ    cnt ðw; mÞ  P wm ¼ X cnt ðw; mÞ w∈V

ð5Þ

The function n(Cm) indicates the number of contents of media type m and cnt(w, m), the number of keywords w in the contents of media type m. V is the number of entire keywords in the entire contents of a media type. However, if the entire contents of a certain media type do not contain a specific keyword, the probability of this keyword will be 0. Thus, we apply an additive smoothing technique to Eq. (5) to obtain Eq. (6).

   b wm ¼ X cnt ðw; mÞ þ 1 P ðcntðw; mÞ þ 1Þ w∈V cnt ðw; mÞ þ 1  ¼ X cnt ð w; m Þ þ jV j w∈V

ð6Þ

Multimed Tools Appl

Here, |V| indicates the number of unique keywords in the entire contents of the current media type. If the number of words in the document is large, the comparison may not be successful owing to the low probability values obtained by repetitive multiplication. Therefore, we replace all probability multiplications with the sum of products, as given in Eq. (7).    X  logP W m ð7Þ M NB ¼ argmax log PðcÞ þ m∈M

w∈W

If we use these equations for keyword matching, it is possible to calculate the probability for a keyword set indicating how suitable each media type is for the keyword set. 3.3 Media type determiner: advanced model The basic Media Type Determiner model can determine the most suitable media type for a certain keyword set. Regardless of its simplicity, this basic model is supposed to give the same result for all users under the same environment. That is, this model would be unsuitable for users who have a preference for a specific media type. Such user preference can be considered in the selection by adjusting the probability of the preferred media type. Furthermore, by using the characteristics of related keywords that are extracted from Google and Wikipedia, our model can be used more flexibly. To increase the probability of the media type that a user prefers, the probability P(m) in Eq. (5) is changed as follows. b ðmÞ ¼ nðC m Þ  αm ð8Þ P nðC M Þ Here, αm is a positive constant value for each media type that is predetermined by the user preference; its default value is 1. When a user selects a certain media type as his/her preferred media type, the probability of the media type is increased by increasing its constant value. On the other hand, if a user does not prefer a particular media type, the probability of the media type is decreased by decreasing its constant value αm. Through this adjustment, we can consider the user preference during the selection of the media type. During keyword detection, we looked up Google and Wikipedia for the semantic expansion of trend keywords because the properties of Google/Wikipedia keywords differ from those of Twitter trend keywords. Tweets usually contain the latest and hottest keywords, and therefore, they are very small and limited in the number and semantic coverage, respectively. For example, during the 2012 London Olympic Games, there were a few Olympics-related trend keywords on Twitter. At the same time, Wikipedia and Google contain static and general data such as population, geography, history, and demography. It is almost impossible that such data will be the hottest or latest keyword in Twitter. We can make such keywords be considered more seriously in the selection of media type for a variety of purposes. b ðwjcÞ is recalculated as follows: For this purpose, the probability P 8 cnt ðw; mÞ þ 1 >  >  β; if w∈W i > X <    > cnt ðw; mÞ þ jV j  w∈V b wm ¼ P cnt ðw; mÞ þ 1 > >   > ; if w∉W i > : X cnt ðw; mÞ þ jV j

ð9Þ

w∈V

Here, we consider a set of keywords Wi that are of interest to the user. These keywords are either trend keywords from Twitter or related keywords from Google/Wikipedia. The constant β is a

Multimed Tools Appl

value for each keyword with a positive value for adjusting its probability. Hence, when a keyword belongs to the set Wi, this will affect the media type classification and content selection by Eq. (9). 3.4 Contents selector We used the Media Type Determiner to decide the most suitable media type for a set of a trend and its related keywords. We then use the Contents Selector to select the most appropriate contents from a variety of multimedia contents of the selected media type. In this study, we consider four media types: TV program, video, news article, and image. Because each media type has different characteristics, we need a different rating method for each media type. The suitability of certain content for recommendation is calculated by the following equation.    X b W m þ log g m logP ð10Þ SðcÞ ¼ w∈W

b ðW jmÞ. We can measure the relevance S(c) is calculated based on the probability P between the specific keywords set and the content using this equation. Here, g m is used to consider the different characteristics of each media type. For instance, all news articles were written in the past. On the other hand, we can obtain information about past and future TV programs using Electronic Program Guide (EPG) data. The importance of contents for these two media types depends on their creation time. For example, let us assume that we have two news articles that have the same keywords but different creation times. In this case, the recently created article will likely contain the latest and hottest information, and therefore, it would be more important than the older article. Thus, in the case of these two media types, we use the following formula to consider the time difference information.    minð7; Dcurrent −Dwritten Þ    ð11Þ g TV =News ¼ 1−  7 Here, Dcurrent and Dwritten indicate the current date and content creation date, respectively. If the two dates are the same, the value of this constant will be 1, which is the maximum value. On the other hand, if the two dates differ by more than 7, the value of this constant will be 0, which is the minimum value. Therefore, this method ensures that new contents are selected more frequently than old contents. g Video ¼

nðhcurrent Þ nðhmax Þ

ð12Þ

For video contents, we consider the hits value of each video as its popularity and use Eq. (12) for calculating the relative hits of each video compared to the maximum hits of the candidate media contents. Here, n(hcurrent) and n(hmax) respectively indicate the hits value of a current content and the maximum hits value of the collected videos.   n Pmatched rep   ð13Þ g Image ¼ n Prep Finally, for image contents, we use the representative point method proposed in our previous studies [14, 15]. Previously, we proposed an efficient multi-object recognition scheme based on the interest points of objects and their feature descriptors. In this scheme, we first define a set of object types of interest and collect their sample images for training the learning machine. For each sample image of an object type, we detect interest points and construct their feature descriptors using SURF. Next, we perform a statistical analysis of the local features to select representative points among them. Intuitively, the representative points of an object are those

Multimed Tools Appl

interest points that best characterize the object. n(Prep) indicates the total number of representative points and n(Pmatched_rep), the number of matched representative points. By using the candidate image contents as a training set, we extract the representative points. Then, for each candidate image, we count how many representative points exist in the image and use the ratio of this value to the total representative points as an index for the selection of image content. Informally speaking, image content that has more representative points has a higher probability. Through these steps, the most appropriate multimedia content will be selected for a trend keywords set 3.5 Contents summarizer In this section, we describe our method for displaying trend keywords and their multimedia contents on the screen effectively. As shown in Table 2, each main trend keyword has a score that is based on its term frequency. It is reasonable to consider that a trend keyword with higher score is more important than the other keywords, and therefore, it should have greater visibility on the screen. For this purpose, we use the TreeMap algorithm. This algorithm represents multiple rectangles in proportion to their values. For each trend keyword, the TreeMap allocates one rectangle whose size is proportional to its score and fills the rectangle with the selected multimedia content and related keywords from Google and Wikipedia. Depending on the rectangle size, the user can easily guess how trendy its content is. Furthermore, the full content can be browsed by clicking the corresponding box.

4 Results and discussion We have implemented a prototype system based on our scheme using MathWorks MATLAB 2011b, and we evaluated its performance through various experiments on a desktop PC with an Intel Core i7 2600 K 3.4 GHz processor, 8 GB of RAM, and the Windows 7 Enterprise Table 2 Trends and related keywords Trend keyword/Score

Twitter

Google

Wikipedia

ANDY MURRAY/631

GRAND SLAM NOVAK DJOKOVIC

girlfriend height

Andy Murray Andy Murray (ice hockey)

tennis

Andy Murray career statistics

father

Andrew Murray

US OPEN

hockey ED REED/243

NFL

retiring

Ed Reed

INT

stats

Edward K. Reedy

brother hit contract 911 VICTIMS/229

stories

Victim Compensation Fund

cantor fitzgerald

Kenneth Feinberg

911 videos

John Ashcroft

911 survivors

Special Master

911 memorial

September 11th Fund

Multimed Tools Appl

SP1 operating system. For the experiment, we collected tweets from the Twitter server using the Twitter Streaming API. 4.1 Comparison with Google trends In the first experiment, we evaluated the property of our trend keyword detection scheme by comparing its result with that of Google Trends, which shows several hot search keywords daily for some countries [12]. From 03/22/2013 to 04/04/2013, we collected 111 trend keywords using Google Trends, with an average of 7.93 keywords daily. For the same period, we detected 418 trend keywords using our scheme. We found 35 Google Trends keywords in our detected keywords list, for an overlap of 31.5 %. It seems that the overlap is not that high. We attribute this to the differences of usage purpose or the characteristics of the users. Furthermore, we found that our scheme finds new trend keywords that cannot be found through Google or Wikipedia. 4.2 Trend and related keywords with scores In the second experiment, we detected the trend and its related keywords from user tweets. For this purpose, we analyzed 255,064 tweets on 9/11/2012 between 10 and 11 AM, the results of which are shown in Table 2. Table 2 shows some trend keywords with their corresponding scores as well as some related keywords detected from Twitter, Google, and Wikipedia. As mentioned before, the properties of words that can be detected from Google/Wikipedia differ from those of words detected from Twitter. During the analysis period, Andy Murray, a famous tennis player, defeated Novak Djokovic to win the U.S. Open cup. We could obtain the latest information related to this event from Twitter, but only general information from the other two sources. Having said that, Twitter contained no keywords related to trend keywords such as “911 VICTIMS.” 4.3 Selection of media types The constant value αm significantly influences the recommendation results. As mentioned before, when the user prefers a certain media type, our scheme increases the corresponding constant value. In an experiment, we measured the effect of different αNews values from 0 to 2 using 1,161,052 user tweets from 4/1/2013 to 4/3/2013. Table 3 shows the selection ratio of each media type for different αNews values. For this experiment, we fixed the α values of other media types. The values in the table indicate the ratio of the corresponding media types. The sum of all values in a column will be 1. The table shows that the selection ratios of news articles increase with the αNews values, confirming that the preferences of users can be considered by changing the αm values. Table 3 Effects of changing αm Media type

αNews = 0

αNews = 0.5

αNews = 1

αNews = 1.5

αNews = 2

0.02 (1/50)

0.08 (4/50)

0.16 (8/50)

0.26 (13/50)

News articles

0 (0/50)

TV programs

0.42 (21/50)

0.4 (20/50)

0.4 (20/50)

0.36 (18/50)

0.34 (17/50)

Movies

0.28 (14/50)

0.28 (14/50)

0.22 (11/50)

0.18 (9/50)

0.12 (6/50)

Images

0.3 (15/50)

0.3 (15/50)

0.3 (15/50)

0.3 (15/50)

0.28 (14/50)

Multimed Tools Appl Table 4 Effects of changing β Trend keyword

βTwitter=0

LIONEL MESSI

Image: Lionel Messi agrees new contract to stay at Barcelona until 2018 … CARDINALS Program: Fox 5 Sports Extra

NIALL

Movie: Niall Horan—Facts

NORTH KOREA ARMY

Image: A look back at North Korea—Ct Post

JOHN CENA Image: john cena—hello my name is hasib

βTwitter=1

βTwitter=2

Image: Lionel Messi agrees new contract to stay at Barcelona until 2018 … Image: Up-Date On Louisville Cardinal Basketball Player Kevin Ware … Movie: Niall Horan—Facts

News: Lionel Messi injury: Barcelona star could return against PSG Image: Up-Date On Louisville Cardinal Basketball Player Kevin Ware … Movie: Niall—I’m just trying to unwind

News: North Korea could inflict significant damage—ArmyTimes.com

News: North Korea could inflict significant damage—ArmyTimes.com

Image: classified | Dull and Boring

Image: classified | Dull and Boring …

4.4 Selection of media contents The constant value β also significantly influences the selection of contents. As mentioned before, the properties of words that can be selected from Google/Wikipedia differ from those of words that can be selected from Twitter. Thus, if we change the β value of a certain keyword set, Twitter or Google/Wikipedia, we may obtain different multimedia contents. We used 930,592 user tweets from 4/1/2013 to 4/8/2013 for this experiment. Table 4 shows the effect of changing the βTwitter value. Tweets usually contain the latest and hottest keywords, and therefore, they are very limited in terms of number and coverage. The results shown in this table confirm that a higher βTwitter value leads to the selection of the latest contents. For example, Lionel Messi, a famous football player of F.C. Barcelona, was injured in the Champions League quarter-final game on 4/3/2013. With higher βTwitter value, we could obtain news articles about this injury. However, with lower βTwitter value, we obtained other news articles written on 12/18/2012.

Fig. 2 User interface for browsing trend with content

Multimed Tools Appl

4.5 User interface for summarizing trends In this study, we have developed a Web user interface for summarizing current trends. To represent the popularity of each trend, the TreeMap algorithm was adopted. We used Cowie’s Js-Treemap implementation [6] on javascript handling xml-type data. Both a trend and its contents are represented within boxes in TreeMap based on their popularity. The contents of each trend contain its representative image and related tag information. Related Web pages or media are shown when the user clicks a trend in TreeMap. Figure 2 shows the prototype implementation, which is now available on http://mil.korea.ac.kr/TrendsSummary. This site works properly with Google Chrome, Mozilla Firefox, Apple Safari, and other browsers.

5 Conclusions In this study, we have proposed a platform called TrendsSummary for retrieving trendy multimedia contents and summarizing them. To retrieve trendy multimedia contents, we first detected trend keywords and their related keywords by analyzing the raw data collected from Twitter using the Twitter Streaming API. Then, we expanded them semantically by looking up portal sites such as Wikipedia and Google. Based on the expanded trend keywords, we collected four different types of multimedia contents from various websites. The most appropriate media type was determined for the trend keyword based on a naïve Bayes classifier. Then, the most appropriate content was selected from among the contents of the selected media type. Finally, both the trend keywords and their multimedia contents were represented on the screen using the TreeMap algorithm for effective browsing. We implemented a prototype system and reported some of the experimental results. Future work involves extending our platform for packaging multiple media types and their trendy multimedia contents more effectively and concisely. We also wish to further investigate the user interface s that users can browse diverse multimedia contents more flexibly and intuitively.

Acknowledgments This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF-2013R1A1A2012627) and the MSIP(Ministry of Science, ICT&Future Planning), Korea, under the C-ITRC(Convergence Information Technology Research Center) support program (NIPA-2013-H0301-13-3006) supervised by the NIPA(National IT Industry Promotion Agency)

References 1. Alvanaki F, Michel S, Ramamritham K, Weikum G (2012) See what’s enBlogue: real-time emergent topic identification in social media. In: Proceedings of the 15th International Conference on Extending Database Technology, New York, NY, USA. pp. 336–347 2. Alvanaki F, Sebastian M, Ramamritham K, Weikum G (2011) EnBlogue: emergent topic detection in web 2.0 streams. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. New York, NY, USA. pp. 1271–1274 3. Barzilay R, McKeown KR (2005) Sentence fusion for multidocument news summarization. Comput Linguist 31(3):297–328 4. Cano P, Koppenberger M, Wack N (2005) Content-based music audio recommendation. In: Proceedings of the 13th Annual ACM International Conference on Multimedia. New York, NY, USA, pp. 211–212 5. Chen HC, Chen ALP (2005) A music recommendation system based on music and user grouping. J Intell Inf Syst 24(2–3):113–132

Multimed Tools Appl 6. Cowie M, Js-Treemap, http://js-treemap.sourceforge.net 7. De Francisci Morales G, Gionis A, Lucchese C (2012) From chatter to headlines: harnessing the real-time web for personalized news recommendation. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining. New York, NY, USA, pp. 153–162 8. Eck D, Lamere P, Bertin-Mahieux T, Green S (2007) Automatic Generation of Social Tags for Music Recommendation. In: Neural Information Processing Systems Conference (NIPS) 20. MIT Press, Cambridge, MA, p M70 9. Fang F, Nargis P, Anindya D, Kaushik D, Debra V (2011) Detecting Twitter Trends in Real-Time. In: Proceedings of 21st Annual Workshop on Information Technologies and Systems (WITS). pp. 49–54 10. Gong Y (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA, pp. 19–25 11. Google, http://www.google.com 12. Google Trends, http://www.google.com/trends 13. Kim D, Kim D, Rho S, Hwang E (2013) Detecting trend and bursty keywords using characteristics of Twitter stream data. International Journal of Smart Home 7(1):209–220 14. Kim D, Rho S, Hwang E (2012) Local feature-based multi-object recognition scheme for surveillance. Engineering Applications of Artificial Intelligence 25(7):1373–1380 15. Kim D, Rho S, Hwang E (2013) Multi-camera-based security log management scheme for smart surveillance. Security and Communication Networks. doi:10.1002/sec.735 (published online) 16. Kleinberg J (2002) Bursty and hierarchical structure in streams. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA. pp. 91–101 17. Kuo BYL, Hentrich T, Good BM, Wilkinson MD (2007) Tag clouds for summarizing web search results. In: Proceedings of the 16th International Conference on World Wide Web. New York, NY, USA. pp. 1203–1204 18. Lai CF, Chang JH, Hu CC, Huang YM, Chao HC (2011) CPRS: a cloud-based program recommendation system for digital TV platforms. Future Generation Computer Systems 27(6):823–835 19. Liu J, Dolan P, Pedersen ER (2010) Personalized news recommendation based on click behavior. In: Proceedings of the 15th International Conference on Intelligent User Interfaces, New York, NY, USA. pp. 31–40 20. Mathioudakis M, Koudas N (2010) TwitterMonitor: trend detection over the twitter stream. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, New York, NY, USA. pp. 1155–1158 21. mc2xml, http://mc2xml.hosterbox.net 22. Naive Bayes classifier, https://en.wikipedia.org/wiki/Naive_Bayes_classifier 23. Otterbacher J, Radev D, Kareem O (2008) Hierarchical summarization for delivering information to mobile devices. Inf Process Manage 44(2):931–947 24. Phelan O, McCarthy K, Bennett M, Smyth B (2011) Terms of a feather: content-based news recommendation and discovery using twitter. In: Proceedings of the 33rd European Conference on Advances in Information Retrieval, Berlin, Heidelberg. pp. 448–459 25. Quercia D, Askham H, Crowcroft J (2012) TweetLDA: supervised topic classification and link prediction in Twitter. In: Proceedings of the 3rd Annual ACM Web Science Conference, New York, NY, USA. pp. 247–250 26. Shen D, Sun JT, Li H, Yang Q, Chen Z (2007) Document summarization using conditional random fields. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, San Francisco, CA, USA. pp. 2862–2867 27. Shin C, Woo W (2009) Socially aware TV program recommender for multiple viewers. IEEE Transactions on Consumer Electronics 55(2):927–932 28. Shneiderman B (1992) Tree visualization with tree-maps: 2-d space-filling approach. ACM Trans Graph 11(1):92–99 29. Si X, Sun M (2009) Tag-LDA for scalable real-time tag recommendation. Journal of Computational Information Systems 6(1):23–31 30. Twitter, http://en.wikipedia.org/wiki/Twitter 31. Wikipedia, http://www.wikipedia.org 32. Yokomoto D, Makita K, Suzuki H, Koike D, Utsuro T, Kawada Y, Fukuhara T (2012) LDA-based topic modeling in labeling blog posts with Wikipedia entries. In: Wang H, Zou L, Huang G, He J, Pang C, Zhang HL, Zhao D, Yi Z (eds) Web Technologies and Applications. Springer, Berlin Heidelberg, pp 114–124 33. YouTube, http://www.youtube.com

Multimed Tools Appl

Daehoon Kim received his B.E. degree in Electrical Engineering from Korea University, Seoul, Korea, in 2006. He also received his MS in Electrical Engineering from Korea University, Seoul, Korea in 2008. Currently, he is a Ph.D. candidate at the Electrical Engineering faculty in the School of Electrical Engineering, Korea University, Korea. His current research interests include database, multimedia information retrieval, image processing, and ubiquitous computing.

Daeyong Kim received his B.E. degree in Electrical Engineering from Tsinghua University, Beijing, China, in 2011. Currently, he is a MS. candidate at the Electrical Engineering faculty in the School of Electrical Engineering, Korea University, Korea. His current research interests include data mining, text processing, and big data processing.

Sanghoon Jun received his BS Degree in Electrical engineering from Korea University, Korea, in 2008, respectively. Currently, he is a Ph.D. candidate at the Electrical Engineering faculty in the School of Electrical

Multimed Tools Appl Engineering. His current research interests include music retrieval and recommendation, multimedia systems, machine learning, emotional computing and Web applications.

Seungmin Rho received his MS and Ph. D Degrees in Computer Science from Ajou University, Korea, in Computer Science from Ajou University, Korea, in 2003 and 2008, respectively. In 2008–2009, he was a Postdoctoral Research Fellow at the Computer Music Lab of the School of Computer Science in Carnegie Mellon University. He had been working as a Research Professor at School of Electrical Engineering in Korea University during 2009–2011. He is currently on the faculty of the Baekseok University, Cheonan, Korea. His research interests include database, music retrieval, multimedia systems, machine learning, knowledge management and intelligent agent technologies. He has been a reviewer in Multimedia Tools and Applications (Springer), Journal of Systems and Software, Information Science (Elsevier), and Program Committee member in over 20 international conferences. He has published 40 papers in journals and book chapters and 45 in international conferences and workshops. He is listed in Who’s Who in the World in 2007–2009.

Eenjun Hwang received his B.S. and M.S. degrees in Computer Engineering from Seoul National University, Seoul, Korea, in 1988 and 1990, respectively; and his Ph.D. degree in Computer Science from the University of Maryland, College Park, in 1998. From September 1999 to August 2004, he was with the Graduate School of Information and Communication, Ajou University, Suwon, Korea. Currently, he is a member of the faculty at the School of Electrical Engineering, Korea University, Seoul, Korea. His current research interests include database, multimedia systems, information retrieval, XML, and Web applications.