An Information Recommendation Agent on Microblogging Service Taketoshi Ushiama and Tomoya Eguchi Faculty of Design, Kyushu University, 4-9-1 Shiobaru, Minami-ku, Fukuoka 815-8540, Japan {ushiama@,
[email protected].}design.kyushu-u.ac.jp http://www.design.kyushu-u.ac.jp/∼ushiama/
Abstract. In this paper, we introduce an agent for recommending information to a user on Twitter, which is one of the most popular microblogging services. For recommending sufficient information for a user, it is important to extract automatically user’s interest with accuracy and to collect new information which interests and attracts the user. The agent that is introduced in this paper extracts automatically user’s interests from tweets on the timeline of the user and finds the web sites that would provide new information which interests the user from the tweets. The agent selects recommending information from the web sites and posts it on the user’s timeline. Experimental results show that our agent is able to recommend sufficient information for users of Twitter in a natural manner. Keywords: Recommendation, microblogging service, Twitter, contentbased filtering, automatic profile composition
1
Introduction
Today, large amount of information can be found on the Web. Its volume is so large that it is difficult for people to find information which interests and/or attracts them. Based on such background, the importance of the systems which support people to obtain useful and interesting information for them is increasing. In a recent few years, microblogging services such as Twitter become very popular[1]. Users of microblogging services post short messages which are limited up to about one or two hundreds characters. In this paper, we introduced a method for recommending information by means of Twitter. Twitter has some characteristic features which traditional blogging services do not have. Firstly, the length of a text message posted by users on Twitter is limited to 140 characters so users can post their messages lightheartedly. Messages which are posted by a user may contain various and amount of hints which are useful for conjecturing some interests of the user. Secondly, Twitter provides a function for gathering the messages that are posted by people who are expected to provide information which are useful for a user. Such messages are listed in a timeline format, where messages flow according to their posted date time. Users
2
Taketoshi Ushiama and Tomoya Eguchi
do not expect that all messages on their timelines are valuable for them. We think that the timeline format is suitable for recommendation. This is because relatively useless information which is recommended on their timelines would be acceptable for them. In this paper, we propose a method to recommend information which a user on Twitter would be interested in based on his/her tweets posted previously. Using this method, users can obtain personalized attractive information based on daily tweeting without giving any explicit queries. This paper consists of the follows sections. Section 2 gives an approach of our method. Section 3 describes details of the method. Section 4 shows some experimental results and discusses the effectiveness of the method. Section 5 describes conclusion and future directions.
2
Approach
Generally recommendation methods can be categorized into two types: contentbased filtering approach and corroborative filtering approach[2]. In content-based filtering approach[3], a profile of a user is generated based on analyzing the items that the user has selected. The profile represents interests of the user. A recommendation system searches information or items which matches to the profile and recommends them. In order to compose user profiles, various techniques have been proposed. One of popular approaches is based on the vector-space model[4]. This approach uses a vector for representing a user profile. Furthermore it uses a vector for representing features of an information item which is a candidate for recommendation. Recommendation systems based on the vector-feature model filter information items to be recommended based on the similarity on a pair of a user profile and an information item. If an information item is similar to the user profile, it is recommended to the user. In collaborative filtering approach[5–7], recommendation systems do not have to use features of information items which are candidates for recommendation. The basic idea of collaborative filtering is to recommend to a user A the information items that are evaluated to high scores by a user B if the profile of the user A is similar to the profile of the user B. Practically, recommendation systems calculate the similarity of every pair of users based on evaluations of each user for information items and recommend to a user one or more information items which are well evaluated by the other users who can be considered to be similar to him/her. Content-based filtering techniques can hardly be adaptable to items, such as movies or shopping products, from which it is difficult for computers to extract their highly semantic features automatically. This is because that it is necessary to compose a sufficient feature vector which represents highly semantic characteristic features of each recommendation item for content-based filtering. Collaborative filtering techniques can be adaptable to such items because collaborative filtering techniques do not use features of recommendation items directly but use evaluations on the recommendation items by users. On the other hand,
An Information Recommendation Agent on Microblogging Service
3
collaborative filtering techniques have some drawbacks. One of them is that collaborative filtering techniques cannot recommend new items which have not been evaluated by any user. Furthermore, the precision of recommendation might be low if the numbers of evaluations for recommendation items by users are small. The objective of the method proposed in this paper is to recommend information items which might interest or attract a user from the information items which are newly uploaded on the Web. We cannot suppose that such information items have been evaluated by many enough users for effective recommendation by means of corroborative filtering techniques. Therefore we employ a contentbased filtering technique for our objective. When using content-based filtering techniques, it is important to compose a sufficient profile of a user. Some techniques have been proposed to compose sufficient profiles. One major approach is to compose user profiles based on the keywords that are given explicitly by users and another major approach is based on users’ behaviors such as their access histories of web pages. The keyword based approach requests users to do a kind of extra load and we cannot always obtain sufficient profiles about them. Moreover, even if a user composes his/her sufficient profile once, the user has to update manually his/her profile when new interesting topics appear on the Web and/or the user becomes interested in different topics. It is additional extra load for the user. When users have to do extra load for updating their profiles, their profiles would leave insufficient and it causes less precision of recommendation. The objective of our system is to recommend a user, who utilizes microblogging services routinely, information items which would interest and/or attract the user without explicit specification of his/her profile. The system extracts interests of the user automatically by means of analyzing the tweets that appear in the timeline of the user. Furthermore, the system extract references to the web sites from the tweets. We considered that the extracted references would be candidates for good information sources where valuable information items for the user will be uploaded. The system observes uploads of the extracted references and the updates become candidates of information items for recommendation. The system checks whether each information item is valuable for the user based on the extracted user’s interests and decides information items to be recommended. Finally the system posts the recommending information items on the timeline of the user. This method enables users to obtain information items which would interest and/or attract them in a natural manner without any explicit operation for specifying their interests and/or preferences.
3
Information Recommendation Agent
This section introduces an agent for recommending information items on a microblogging service. This agent is mainly two components: the module for extracting interests of a user and the module for gathering and selecting information items to recommend for the user based on his/her interests. Fig. 1 shows the data flow of our recommendation system. Tweets which are posted by a user
4
Taketoshi Ushiama and Tomoya Eguchi
Fig. 1. Data flow of the recommendation system
are fetched by his/her recommendation agent. The interest extraction module analyzes the tweets and extracts his/her interests. Then the recommendation module gathers information items form some web sites and select information items which are recommended for the user based on his/her extracted interests. The recommending information items are formatted as tweets and posts them on the agent’s timeline, which has been followed by the user automatically. The posted agent’s tweets, which are the recommendation information, are displayed on user’s timeline, so the user can obtain the recommended information. 3.1
Extracting User’s Interests
Tweets and User’s Interests Our method tries to extract interests of a user by means of the tweets that the user has posted previously. This approach is employed based on the assumption that tweets of a user reflects his/her interests. On Twitter, a user would report events which happen around the user or personal simple opinions which the user has as tweets. The person who reads tweets which are posted by a user is not specified by the user. Most of tweets are able to be read by anyone who wants to. Twitter provides a function to receive the tweets posted by a target person. This function is named as follow. On Twitter, users would try to gather information items which interest and attract them by following
An Information Recommendation Agent on Microblogging Service
5
other users who are expected to provide such information items. Because of this system, many tweets are not expected to be replayed by other person. A user can stop following another user when he/she wants to. Because Twitter has those features, the initiative of communication on Twitter is controlled by readers. Therefore, users on Twitter are easy to post their natural opinions as tweets without considering reactions for them by their readers. According to the above features of Twitter, we considered that tweets of a user reflect interests of the user. Therefore, in our method, the system extracts interests of a user from tweets by means of analyzing them and recommends information items based on the extracted interests. Extracting User’s Interests from Tweets Our method for extracting interests of a user from tweets consists of following steps: 1. 2. 3. 4.
obtaining tweets of users, morphological analysis on the tweets, selecting keywords, and expanding keywords.
Firstly the agent collects all tweets from the timeline of a user whom information is recommended to. Two types of tweets exist in the timeline: the tweets posted by the user and the tweets posted by others whom the user follows. Then the agent decomposes the collected tweets into words by morphological analysis and extracts nouns. They are candidates for the keywords that represent interests of the user. Each extracted nouns is assigned its weight. In searching documents, the TFIDF method has been used conventionally for calculating the weight of a term in a document in a document set[8]. The TF-IDF value of a term t in a document d is defined as the following formula: ft d,t |D| tfidf(t, d) = ∑ , · log fd t ft i d,i
(1)
where ft d,t is the number of occurrences of the term t in the document d, D is the target set of documents, and fd t is the number of documents which contain the term t in D. The TF-IDF method is designed based on the assumption that a term which occurs frequently in a document would represent an important concept in the document and a term which is rarely contained in documents would represent a characteristic concept in the document set. For our objective, it is required to extract interests of a user from the tweets on the timeline of the user. The TF-IDF method cannot be applied directly for our purpose because it represents importance of a term on a document in a document set and many tweets (documents) would be exists on the timeline of the user. Therefor, we introduce a novel weighing schema called TF-IUF. The TF-IUF value of a term t for a user u is defined as following formula: ∑ |U | tw ∈T Wu ft tw ,t · log , (2) tfiuf(t, u) = ∑ ∑ fu t i tw ∈T Wu ft tw ,i
6
Taketoshi Ushiama and Tomoya Eguchi
where T Wu is a set of tweets on the timeline of the user u, and U is a set of users, and fu t is the number of users whose tweets contain the term t. We employ the TF-IUF value for weighing the importance of a term for a user and consider that it reflects the degree of interest of the user about the topic represented by the term. The TF-IUF method is designed based on the assumption that a frequently used term in the tweets on the timeline of the user would concern a topic he/she is interested in and a term which are seldom appeared on the timelines of others represent a characteristic interest about the user. Keywords which represent interests of a user are selected based on the weights of terms. Top n terms in the list that is ordered by their weights are treated as the keywords. n is a threshold given by a user. Some of the selected keywords might be too specialized for representing an aspect of user’s interests. Therefore, each keyword is expanded in order to recommend information in areas which the user is interested in. For the keyword expansion, we use Yahoo web search API1 . This API has function to return the category of a term. 3.2
Recommending Information
The agent recommends information for a user based on the extracted interests of the user. The information recommendation consists of the following steps: 1. 2. 3. 4. 5.
finding information sources (web sites), obtaining information items from the information sources, selecting information items to be recommended, composing recommendation messages, and posting the messages as tweets for the user.
In order to recommend information to a user, the agent has to find information sources which have possibility to provide information items attracting the user. In order to find such information sources, our method utilize the URLs that are addressed in tweets on the timeline of the user. Many Twitter users sometimes include one or more URLs in their tweets for reference when they would like to introduce interesting topics found on the Web and to describe their opinions about them. The timeline of a user contains tweets of the other users followed by the user. Some of those tweets also contain URLs which provide information topics which have possibility to interest the user. Therefore, the agent extracts the URLs on the timeline of the user and utilizes them as clues to find information sources. The agent accesses the extracted URLs and checks whether each web site provides RSS feeds or not. RSS is a mechanism for notifying users of new information items on a web site. Users can obtain summaries and simple metadata of new information items from RSS feeds. When a web site provides RSS feeds, the 1
http://developer.yahoo.co.jp/webapi/search/websearch/v2/websearch.html
An Information Recommendation Agent on Microblogging Service
7
agent periodically gets the RSS feed and obtains new information items, which are candidates for recommendation. The agent examines whether each of the obtained information items should be recommended or not. Firstly, an information item is decomposed into nouns by morphological analysis. Then the feature vector of the information item, which is called as information vector, is composed based on a keywords vector of the user. When a keyword vector k is given, the information vector of an information item d is defined as follows: ivd = (tfidf(k1 , d), · · · , tfidf(kn , d)),
(3)
where ki is a i-th elements in the keyword vector k. An information item is decided whether it should be recommended by calculating the similarity between its information vector and the interest vector of the user. The interest vector represents interests of a user. The interest vector of a user u is defined as the following formula: itu = (tfiuf(k1 , u), · · · , tfiuf(kn , u)),
(4)
where ki is a i-th elements in the keyword vector k. The similarity of an information vector if d and an interest vector itu is defined as follows: sim(if d , itu ) =
if d • itu . |if d | · |itu |
(5)
If the similarity is greater than the threshold t, the information item is decided to be recommended. For every information item, the similarity between it and the interest vector of the user is calculated and decided whether it should be recommended or not. Then a tweet is generated for notifying each recommendation information item. The tweet is consists of title and its URL, which can is obtained from RSS. When the URL is too long, it is shorten by a URL shortening service. The generated tweet is posted on Twitter by the account of the agent and it is listed on the timeline of the user. If the user is interested in the tweet, he/she can click the URL on it and the web page of the recommended information item is displayed.
4
Evaluation
In order to evaluate the effectiveness of the proposed method, we implemented a prototype of the agent and conducted some user studies. 4.1
Prototype System
A prototype system of the information recommendation agent was implemented on a personal computer, which runs Windows 7 OS and has Intel Core 2 Duo E4500 CPU and 4GB memory. The program of the agent is coded by means of
8
Taketoshi Ushiama and Tomoya Eguchi
Fig. 2. Results of evaluation for extracted keywords
the Perl programming language with Twitter REST API and Twitter Streaming API2 . The agent has its account on Twitter. Thus, a user can receive recommended information from the agent by just following the account. When the agent is followed by the user, the agent extracts interests of the user from his/her timeline, then recommends information items as tweets for the user. The agent updates the interests of the user when a new tweet is posted. The agent recommends information once in a day. 4.2
Experimental Results
In order to evaluate the effectiveness of the proposed method, we conducted two types of user studies. One is to evaluate whether the proposed interest extraction method is able to extract keywords that reflect the interests of a user. The other is to evaluate whether the proposed information recommendation method is able to find and select information items that interest and/or attract a user. Extraction of User’s Interests We made the agent extract interests of an experimental subject from his/her timeline and selected top-15 keywords based on the TF-IUF schema, then asked the subject to evaluate each of the keywords by means of 1 to 5 scale from a view of reflecting his/her interests. The number of the subjects is seven. Fig. 2 shows the results of this study. The average score of the evaluations is 3.5. 2
http://dev.twitter.com/doc
An Information Recommendation Agent on Microblogging Service
9
Fig. 3. Results of evaluation for recommended information items
Recommendation of Information We made the agent find and select top10 recommending information items for an experimental subject based on the extracted URLs and interests from the timeline of the user, then asked the subject to evaluate each of the recommending information items by means of 1 to 5 scale from a view of it interests and/or attracts the user. The number of the subjects is seven. Fig. 3 shows the results of this study. The average score of the evaluations is 3.9.
4.3
Discussion
In the extracted keywords, 59.2% keywords were given positive grades (very good or fairly good) by the subjects. Many of the lowly graded keywords are names of friends of the subjects. By excluding such names the precision of extracting user’s interests would be improved. Although such names are seldom contained the information items that are candidate for recommendation so we think that the effects of those noise for recommendation is not much. In the recommended information items, 72.5% items are given positive grades (very good or fairly good). This ratio is greater than the ratio on the extracted keywords. This means that even if some insufficient keywords are selected, their effects on the performance of recommendation might be limited. For some of the recommended items that were given negative grades, the subjects could not understand the details of their contents from the tweets that introduce those items. This is because the technique for composing tweets for recommendation is too simple, a tweet contain only the title and URL of the recommending
10
Taketoshi Ushiama and Tomoya Eguchi
information item. We think that more sufficient techniques for composing tweets for recommendation would improve the performance of recommendation.
5
Conclusion
In this paper, we introduced an information recommendation agent for a user on Twitter. This agent observes tweets of a user and his/her companions (followers) and extracts his/her interests and web sites that have possibility to provide information items attracting the user. We designed the TF-IUF method for extracting user’s interests. This method is able to weight a term from a view of user’s interest. The agent accesses the web sites periodically and selects recommending information items for the user. Then the agent posts tweets about the recommending information items on the timeline of the user. Our prototype of the agent starts recommending information by only following its account of Twitter. Results of the user studies on the prototype show that our method is effective for recommendation. We have a plan to improve the precision of recommendation based on feedback of users such as clicks on recommended URLs. Moreover, we would like to extract interests of communities based on following networks on Twitter.
References 1. Jansen, B.J., Zhang, M., Sobel, K., Chowdury, A.: Twitter power: Tweets as electronic word of mouth. J. Am. Soc. Inf. Sci. Technol. 60 (November 2009) 2169–2188 2. Hanani, U., Shapira, B., Shoval, P.: Information filtering: Overview of issues, research and systems. User Modeling and User-Adapted Interaction 11 (August 2001) 203–259 3. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22 (December 2000) 1349–1380 4. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18 (November 1975) 613–620 5. Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Adv. in Artif. Intell. 2009 (January 2009) 4:2–4:2 6. Resnick, P., Iacovou, N., Suchak, M., Bergstorm, P., Riedl, J.: GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In: Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work, Chapel Hill, North Carolina, ACM (1994) 175–186 7. Shardanand, U., Maes, P.: Social information filtering: algorithms for automating word of mouth. In: Proceedings of the SIGCHI conference on Human factors in computing systems. CHI ’95, New York, NY, USA, ACM Press/Addison-Wesley Publishing Co. (1995) 210–217 8. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’98, New York, NY, USA, ACM (1998) 275–281