Blog Content Based Recommendation Framework Using WordNet and Multiple Ontologies
Santosh Kumar Ray
Shailendra Singh
Department of Computer Science and Engineering BIT, Mesra, Muscat Oman
[email protected]
Technology Planning Group Samsung India Software Limited Noida, India
[email protected]
Abstract— Social networking portals like Twitter, Facebook, LinkedIn etc. are getting popular day by day among users' community and many more such portals are getting users attentions to cater their specific needs. Users write blogs on these social networking websites on a variety of topics as per their and other user’s interests. By means of social networking blogs, a large amount of interesting information is scattered on the Web which could be structured in a meaningful way for better services. The objective of this paper is to focus on categorization of blog content into ten demanding themes like Technology, Entertainment, News, Business, Health, Sports, Tourism, Widgets, Vehicles, and Products for effective retrieval of information from categorized blog content. Further, a user can also search by feeding specific query to retrieve information from blog. In this paper, we are proposing a WordNet and multiple Ontologies based blog content theme expansion approach and a concept combination based ranking algorithm for blog content based recommendation framework that considers original themes of blog content as an input and recommends conceptually related expanded themes of blog content. The distinctive point of this research is to use concept combination approach based on rough sets to categorize retrieved results for demanding themes as well as for user specific preferences. This kind of blog content categorization approach would be very effective to retrieve meaningful and conceptually related blog information written by a large number of users using different vocabularies. We have experimented with the contents of top blogs related to each theme and got very good results. Keywords: Social networking, Recommender system, Twitter, Facebook, LinkedIn, Query expansion, Rough sets, Ontologies, WordNet, and Swoogle.
I.
INTRODUCTION
For the last two decades, the World Wide Web has been the chief source of information for everyone-from a general user to prolific researchers- for fulfilling their information needs. Though the number of web pages on the World Wide Web has crossed the trillion mark [4], the content of a web page is usually static and these are not updated regularly. With the growth of the World Wide Web based applications, an advanced Web 2.0 framework was introduced for a variety of applications such as blogging,
online gaming, social networking, knowledge sharing, and chat rooms etc. A large number of users are using such services and sharing massive information among each other and becoming dependent on such services to fulfill their daily needs. Social networking discussions are related to almost every domain ranging from general to very specific domains. [13] lists more than 150 popular social networking websites on a variety of topics. The famous social networking websites such as Facebook [3], Orkut [6], LinkedIn [5], and Twitter [17] are becoming popular for the users ranging from school kids to qualified professionals. These social networking websites are allowing users to build relationship among them by joining one or more groups or communities and giving freedom to write their blogs on their choice of subjects. In a typical social networking website, Internet users are invited by the members of the social networking website to join their interest related communities, groups, and people. A user has the freedom to explore his/her interest related communities and can join those communities. Also, there is no limitation on expanding social network. One can join multiple communities, groups, and people to get diversified information on different topics, reading their interest related contents and writing contributions on the blogs. At present, social networking websites do not have cross-website information and consequently this scattered information on different topics could not be effectively processed for further analysis. Secondly, users on the social networking websites keep on looking at the end of other users who are having similar interests. However, different users may define their area of interest using different terms, synonymous words, and conceptually related terms. This meaningful information may not be retrieved due to the gap in the syntax and semantics of the terms defining their interests. Another important issue is that sometimes the information needed by the user may not be available in his/her network communities and it could be available in other networks or on the World Wide Web. These are the major drawbacks in the present social networking approaches. In designing an efficient social network, Semantic Web plays an extra-ordinary role in exchanging conceptual information. Semantic Web represents the World Wide Web data in the form of a mesh linked in a way so that it could be
easily processed by machines on a global scale. We can think of it as a globally linked database. Fortunately, Semantic Web is also growing very fast and it is providing several advanced tools which can be effectively utilized in the context of social networking domains. In this paper, we are presenting a recommendation framework that keeps on scanning the blogs on the social networking sites specifically Twitter, Facebook, LinkedIn, and classifies them into various themes according to their contents. We have considered top 10 general user oriented themes namely Technology, Entertainment, News, Business, Health, Sports, Tourism, Widgets, Vehicles, and Products to classify blog contents. However, a user has the choice to add new themes to retrieve user specific information from the blogs. The proposed framework considers top 10 demanding themes as input keywords and expands these keywords conceptually with the help of the WordNet [12] and multiple ontologies retrieved from ontology search engine Swoogle [16]. This will reduce the gap between the syntax and the semantics of the terms used by other connected users who are having similar interests. The other contribution of the paper is to propose a novel approach to match relevant blog content with expanded themes and ranking of retrieved results for each theme using rough set based concept combination. With the help of proposed framework, the user will be able to get regular themes updates from more than one blog with a click. The rest of the paper is organized as follows: Section 2 briefly describes related work. Section 3 discusses the architecture of the proposed blog content based recommendation framework. Section 4 proposes the query expansion algorithm. Section 5 presents the results and section 6 concludes main contributions of the paper and future research work. II.
RELATED WORK
Social networking was introduced in 2003 and it is becoming popular very rapidly. Nowadays, social networking services are being used extensively by Internet users all over the world which has resulted into accumulation of huge information on these websites. Another important feature of social networking websites like Facebook [3], Orkut [6], LinkedIn [5], and Twitter [17] is blogging where user can write on his/her interest related themes. The available social networking websites as discussed in [11], [2], and [1] are using tagging approach to improve the search mechanisms as well as for personalized recommendations. However, tagging for any kind of information, particularly for user interest, might be done by different users using different vocabularies. So statically tagging approach is not useful to retrieve relevant information lying at the end of other users. Therefore, conceptually expanded user input (Query or Question) may solve the term mismatch problem in building efficient social networking recommender system. The use of semantic web tools such as ontologies and WordNet [14] has been a preferred choice of researchers to propose query expansion methods. Sabrina et al. [10] have combined Web ontology and WordNet together to expand user query. Ray
et al. [8] have proposed a query expansion framework using multiple domain ontologies along with WordNet to incorporate domain knowledge and natural language rules. Semantic tools have been used with commercial search engines too. Ioannis et al. [7] have combined Google with WordNet for word sense disambiguation. These methods may be very effective for the query expansion in a webbased Question answering systems but they perform poorly on highly unstructured social networking corpus. In this paper, we are applying the query expansion method described in [8] to expand queries and concept combination based ranking method [9] to categorize blog content efficiently in a meaningful way. III.
ARCHITECTURE OF PROPOSED BLOG CONTENT BASED RECOMMENDATION FRAMEWORK
In this section, we are describing the architecture of the proposed blog content based recommendation framework. This framework is having three major modules as discussed below: A. Expansion of Themes using WordNet and Multiple Ontologies Users are writing blogs on a large number of varying themes. Some themes are useful for a large number of users while some themes attract a very limited number of users. So we decided to consider only those popular themes which are addressing a large mass of users while giving flexibility to a user to add his/her specific themes apart from these predefined themes. There are ten predefined themes in the proposed recommender system which are described as follow: • Technology: Covering all aspects of technology advancement, research, etc. • Entertainment: Movies, Music, Gossips, Astrology, Jewelry etc. • Sports: All news about sports and their stars. • News: General news and country specific news. • Business: General business news as well as country specific business news. • Health: Health tips, Alternative Medicine, Yoga, Acupuncture etc. • Tourism: Country specific tourism places and their history. • Application Widgets: New application developed on Android, i-phone, Samsung bada as well as on other open source platforms. • Products: New products for consumer electronics. • Vehicles: New launched Vehicles country specifics. This module expands each of these basic themes using WordNet and multiple ontologies and feeds them into blogs written on the social websites such as Twitter, Facebook, & LinkedIn.
ontologies and WordNet [8] based theme expansion method to add the related terms in respective themes. The proposed method takes the concepts described by the user as an input and finds the most relevant senses for key concepts in the user input using WordNet. To disambiguate the correct sense, the method further uses multiple domain ontologies retrieved from Swoogle semantic web search engine. When key concept(s) entered by the user is fed into Swoogle, it returns ontology classes describing the concept(s). In the proposed method, we compute semantic distance of the key concepts from retrieved ontology class, super class, and its subclasses and consider the class with the lowest semantic distance for the further processing. The complete algorithm is given as follow: Algorithm 1: Theme_Expansion_MultipleOntologies Input: User’s Question considered as Query (Q) Output: Expanded query (QE) •
• • Fig. 1 Blog Content Based Recommendation Framework
B. Extraction of Blog Content and Ranking
This module of the proposed system periodically retrieves blogs and analyzes their contents. Depending upon the content of the blog, a blog may fall into more than one themes. A blog is assigned weight according to keywords present in the blog and retrieved results for each theme is ranked using rough set based concept combination approach. Depending upon the degree of closeness with the themes, the blogs are tagged with relevant themes. The matching and tagging of blog contents is performed on a regular basis. C. User Specific Retrieval of Blog Content In this module, a user can add his/her interest related themes. The keywords in the user's interest are expanded and the expanded keywords along with the original keywords are used for the retrieval of relevant content from blogs. In the next section, we will explain methodology to expand themes and categorization of blog content for all listed themes as well as for individual user specific query for retrieval. IV.
WORDNET AND MULTIPLE ONTOLOGIES BASED THEME EXPANSION AND ROUGH SET RANKING METHOD
This section provides the detailed description of the proposed theme expansion and rough set based ranking algorithms. In algorithm (1), we are proposing multiple
•
•
•
•
•
Step 1: Let T be a set of quadruples and defined as T={}, where C denotes user concepts, O represents ontology for the concept C, W represents weight of an ontology O, and R represents one of the semantic relations retrieved from WordNet. Initially, T is empty. Step 2: User enters a set of concepts Q = {C1, C2 ...Ck }. Step 3: User assigns W1, W2 ... Wk weights to the concepts C1, C2 ...Ck on the scale of 1-10. The concepts with higher weights are considered as important concepts. Step 4: Search Swoogle for the combination of concepts using term dropping strategy. Query for the Swoogle is fed into Conjunctive Normal Form. All ontologies describing a concept combination are put into one group. Let us assume ‘n’ ontology groups defined as OG1, OG2, ...., OGn. Step 5: Let WNc1, WNc2,..., WNck be the domain set in WordNet for concepts C1, C2 ...Ck. Elements in WNci are denoted by couple (S, R) where S is synonymous set for concept Ci and R represents relation in WordNet that connects Ci to S. Step 6: for (i=n; i>0; i--) do following for each ontology of OGi group. T= T ∪ (,∀ x∈(Oij ∩ WNc1) ) ∪ (,∀ ∪ (,∀ x∈(Oij ∩ WNck)) (where Oij is jth ontology of ontology group OGi ) (where Oij is jth ontologies of ontology group OGi ). Step 7: If T is empty for each Ci add one sense from all relations available in WordNet to T. We select most frequently occurring sense of the word and assign zero weight to the ontologies. Step 8: QE= (C1 OR O11 OR O12...OR O1m) AND
(C2 OR O21 OR O22...OR O2n) AND ......(Ck OR Ok1 OR Ok2...OR Okr) where Oij is the common ontology for concept Ci found in previous steps. Let us assume, for example, the user enters the phrase “Nokia latest mobile phone”. The keywords “Nokia”, “latest”, “mobile phone” are entered into WordNet and Swoogle. The WordNet returns several synonyms for these concepts such as “recent” for “latest” and “cellular telephone, cellular phone, cellphone, cell, mobile phone, radiotelephone, radiophone, wireless telephone” for “mobile phone”. The list is further narrowed by Swoogle semantic web search engine and finally expanded keywords are (Nokia) AND (Latest OR Recent) AND (Mobile phone OR Cellular phone OR cell phone). A. Concept Extraction and Rough Set based Blog Document Ranking Algorithm Rough set theory proposed by Z. Pawlak [15] in 1982 is an extension of classical set theory to deal with vagueness and uncertainty present in data set. This has been successfully applied in classification and knowledge discovery from incomplete knowledge bases. In this section, we are describing the rough set based blog document ranking algorithm [9] used in the proposed system. The blog document ranking method takes expanded set of terms from the query expansion module and ranks concept combinations using algorithms Concept_Extraction (algorithm 2) and, assign weights to concept combinations and classifies the blog in one of the theme using Blog Document_Ranking algorithm (algorithm 3) as given below. Algorithm 2: Concept_Extraction(Q, D) Input: User query (Q) and set of blog documents (D) Output: Ranked concepts list (Gr) • •
•
• • •
Algorithm 3: Rough Set based Blog Document_Ranking (Q, D) Input: User query (Q) and set of blog documents (D) Output: Ranked blog documents list (Dr) • •
•
⎧0 if any of the concepts in Gi is not present in Di ⎫ f ( Di , Gi ) = ⎨ ⎬ ⎩1 if all concepts in Gi are present in Di ⎭
( 2)
where p is the cardinality of the set G (step 3 in Algorithm Concept_Extraction ) and rj is the rank of Gj obtained in step1. W0 is the initial weight assigned to each blog document. Step 3: For each blog document Di ∈ D and
Wi 2 = Wi1 + k1Wi1
t ⊆Gj
+ k3Wi1
∑
and 't ' isin one sentence
∑
ats + k2Wi1 bj t ⊆Gj
∑
and 't ' isin one subtitle
att
t ⊆G j and 't ' isin title b j
th
•
⎞ ⎞⎟ ⎟⎟ ⎟ W0 ⎠⎠
concept group Gj, re-compute blog document score (Wi2) using equation (3).
C i = C i1 ∪ C i 2 ∪ ... ∪ C ik and C ij indicates •
Step 1: Run Concept_Extraction (Q, D) algorithm to get ranked list of concept groups. Step 2: For each blog document Di ∈ D and concept group Gj, compute blog document score (Wi1) using (2). ⎛ ⎛ p − rj Wi1 = ⎜1+ ∑ ⎜ ⎜ 1≤ j ≤ p and G ⊂ D ⎝⎜ p j i ⎝
Step 1: Extract key concepts C1, C2, …, Cn from the query. Step 2: Expand query using query expansion algorithm [8]. The resulting query is C1 ∪ C 2 ∪ ... ∪ C n where the j semantically related word to concept C i . Step 3: Let G = C1 × C 2 × ... × C n where × indicates the Cartesian product. Step 4: Define an information system I = (U, A, V, f),where U = { Di | Di ∈ D} , A = {Gi | Gi ∈ G} , V is the domain of values of Gi, and f is an information function (U, A) → V such that:
Step 5: Use (1) for “Knowledge Quantity” (KQ) of Gi KQi = m ( n − m ) (1) where n and m represents cardinality of D and no. of blog documents in which concept group Gi occurs respectively. Step 6: Repeat step 5 for all Gi. Step 7: Sort G according to “Knowledge Quantity” and return Gr (sorted G). Step 8: END
• • •
Here k1, k2, and k3 are constants indicating weight assigned for occurrence of concept combination in sentences, sub-titles, and titles within the blog documents. ats, ath and att are the cardinality of subset ‘t’ in sentences, sub-titles, and title respectively. bj is the cardinality of Gj. Step 4: Rank the blog document set according to the blog document scores obtained in step 3. Step 5: Put the blog in the theme with highest rank. Step 6: END
The underlying intuition behind the algorithm Concept_Extraction and Document_ranking is that a blog document is more relevant if it contains combination of concepts together rather than containing individual
ath bj
(3)
concepts. Secondly, the algorithm first considers the most descriptive concepts of the blog document which are generally used to define title or subtitle of the blog document and accordingly assigns higher weight to those blog which contain keywords in titles or sub-titles. V.
EXPERIMENTS AND RESULTS
To test the effectiveness of the proposed theme expansion and ranking methods in the social networking context, we retrieved over 300 blog documents from the social networking websites Facebook, LinkedIn, and Twitter. Each of the basic themes is originally having 30 blog documents and these documents were converted into a set of keywords. On the other side, keywords extracted from the user query and themes were expanded using theme expansion algorithm (1). Then, the contents of the retrieved blogs were analyzed for the presence of keywords extracted from the user input and their related expanded words using rough set based concept combination and ranking methods. Each blog was assigned a score with respect to each of the 10 themes. The blogs were tagged with the themes having highest score for that theme. Then we computed the precision of our categorization of blog documents which was approximately 70%. Snapshots of the blog documents along with the marked keywords from the final categorized themes are shown in figure 2(a) and 2(b) .
Fig. 2(a) Blog document retrieved from Facebook and tagged as entertainment blog
REFERENCES
Fig. 2(b) Blog document retrieved from LinkedIn and tagged as technology blog
VI.
CONCLUSION AND FUTURE WORK
Social networking domain is growing rapidly because a large number of users are joining daily and thousands of users are getting benefits by sharing information on different topics using their blog features. In this paper, we have proposed an architecture of blog content based recommendation framework using WordNet and multiple Ontologies. We have proposed a theme expansion algorithm using WordNet and multiple ontologies to expand initial keywords conceptually and retrieve relevant blog content using expanded keywords. Further, the proposed system analyzes the relevance of retrieved results with respect to user’s initial input. The system also uses a distinctive hybrid approach based on the concept combination algorithm and rough set theory to present highly relevant blog documents to the users. The proposed algorithms have been experimented over a collection of 300 blog documents related to different themes and found results to be promising. This kind of blog content analysis system could be developed for mobile phones and IPTV users to get relevant updates in an efficient manner. In future, we are extending current framework to include more blogs as well as more themes to assist social networking user in a convenient way.
[1] C. Marlow, M. Naaman, D. Boyd, and A. Davis, “Position Paper, Tagging, Taxonomy, Flickr, Article, To Read”, In Proceeding of the 17th ACM Conference on Hypertext and Hypermedia (HT’06), August, 2006. [2] D. Zhou, J. Bian, S. Zheng, H. Zha, and C.L. Giles, “Exploring social annotations fro information retrieval”, In Proceedings of International World Wide Web Conference (WWW2008), April 2008. [3] Facebook, website, www.facebook.com. [4] J. Alpert and N. Hajaj, “We Knew the Web was Big.....” website: http://googleblog.blogspot.com/2008/07/we-knewweb-was-big.html. (2008) [5] LinkedIn , website, www.linkedIn.com. [6] Orkut, website: http://www.orkut.com [7] P. Ioannis, G. Klapaftis, S. Manandhar, “Google & WordNet based Word Sense Disambiguation”, In Proceedings of the 22nd ICML Workshop on Learning & Extending Ontologies, Bonn, Germany (2005). [8] S. K. Ray, S. Singh, and B. P. Joshi, “Exploring Multiple Ontologies and WordNet Framework to Expand Query for Question Answering Systems”, In Proceedings of the First International Conference on Intelligent Human Computer Interaction , IIIT Allahabad, India, January 21-24, 2009, http://hci.iiita.ac.in/ihci2009/. [9] S. Singh, S. K. Ray, and B. P. Joshi, “Rough Set Based Concept Extraction Paradigm for Document Ranking” , In Proceedings of 6th Atlantic Web Intelligence Conference (AWIC 09), Advances in Soft Computing, Springer, Conference Venue: Charles University, Prague, Republic of Czech, September 9-11, http://arg.vsb.cz/awic2009/Default.aspx. [10] T. Sabrina, A. Rosni, T. Enyakong, “ Extending ontology tree using NLP techniques”, In Proceedings of National Conference on Research & Development in Computer Science REDECS 2001, Selangor, Malaysia, (2001). [11] W. Choochaiwattana and M. B. Spring, “Applying Social Annotations to Retrieve and Re-rank Web Resources”, In Proceedings of the International Conference on Information Management and Engineering, pp. 215-219, IEEE computer Society, 2009. [12] Wikipedia encyclopedia, website: http://en.wikipedia.org/ [13]Wikipedia List of Social Networking websites, http://en.wikipedia.org/wiki/List_of_social_networking_webs ites [14] WordNet, website:http://wordnet.princton.edu [15] Z. Pawlak, “Rough Sets”, International Journal of Computer and Information Science, 11(5), pp. 41-356, 1982. [16] Swoogle the semantic web search engine, website: swoogle.umbc.edu/ http://twitter.com [17]Twitter, website: