Rough Set based Social Networking Framework to Retrieve User – Centric Information Santosh Kumar Ray 1 and Shailendra Singh 2
1 2
Birla Institute of Technology, Muscat, Oman Samsung India Software Centre, Noida, India
{
[email protected],
[email protected]}
Abstract. Social networking is becoming necessity of the current generation because of its usefulness in several ways like searching the user’s interest related people around the world, gathering information on different topics, and for many more purposes. In social network, there is abundant information available on different domains by means of variety of users but it is very difficult to find the user preference based information. Also it is very much possible that relevant information is available in different forms at the end of other users connected in the same network. In this paper, we are proposing a computationally efficient rough set based method for ranking of the documents. The proposed method first expands the user query using WordNet and domain Ontologies and then retrieves documents containing relevant information. The distinctive point of the proposed algorithm is to give more emphasis on the concept combination based on concept presence and its position instead of term frequencies to retrieve relevant information. We have experimented over a set of standard questions collected from TREC, Wordbook, WorldFactBook and retrieved documents using Google and our proposed method. We observed significant improvement in the ranking of retrieved documents. Keywords: Rough sets, Document Ranking, Concept Extraction, and Social Domain Networking.
1 Introduction Today, the WorldWideWeb is growing very fast. Recently published article [1] says that the number of web pages on the Internet increased tremendously and crossed 1 trillion counts in 2008 which was only 200 billion in 2006 as reported in [18]. With the growth of the WorldWideWeb based applications, an advanced Web 2.0 framework was introduced for a variety of applications such as blogging, online gaming, social networking, knowledge sharing, chat rooms etc. Social networking is related to almost every domain from general to specific domains.
2
[17] discusses about more than 150 popular social networking websites on a variety of topics. The famous social networking websites such as Orkut [10], Facebook [5], and Linkedln [8] are becoming essential for users ranging from school kids to qualified professionals. These social networking websites are allowing users to build relationship among them by joining one or more groups or communities. In a typical social networking website, Internet users are invited by the members of the social networking website to join their interest related communities, groups, and peoples. The user has freedom to explore his interest related communities and can join those communities. Also, there is no limitation on expanding social network. One can join as many as communities, groups, and peoples to get diversified information on different topics. At present, social networking websites do not have cross-website information and as a result, scattered information on different topics could be not processed together for effective use. Another important point is that sometimes the information needed by the user is not available in their network communities and it could be available in other networks as well as could be retrieved from WorldWideWeb. To make an efficient social network, Semantic Web plays an extra-ordinary role in exchanging information conceptually. Semantic Web represents WorldWideWeb data in the form of mesh and linked in such a way so that it could be easily processed by machines on a global scale. In this paper, we are presenting a document retrieval system which will take the user question as an input and will first generate concept based expanded questions which are close to original question. These expanded questions reduce the gap between the syntax and the semantics of the terms used by other users having similar interests. Then, the proposed document retrieval system will retrieve relevant information from other network and World Wide Web using new expanded questions and will rank retrieved results as per user relevance. The rest of the paper is organized as follows: section 2 describes related research work while section 3 provides details of proposed social network architecture. Section 4 explains the proposed rough set based document ranking algorithms. Section 5 shows our observations and results. In the last section, we have stated our conclusion and future directions.
2 Related Work Social Networking was introduced in 2003 and becoming popular very rapidly. Nowadays, social networking services are being used extensively by internet users all over the world which has resulted into accumulation of huge information on these websites. The available social networking websites as discussed in [3], [21], and [9] are using tagging approach to improve the search mechanisms as well as for personalized recommendations. However, tagging for any kind of information, particularly for user interest, might be done by different users using different vocabularies. So tagging approach is not useful to retrieve relevant information lying at the end of other users. Therefore, conceptually expanded user input may
3
solve the term mismatch problem in building efficient document retrieval system in social networking domain. The use of semantic web tools such as ontologies and WordNet [19] has been a preferred choice of researchers to propose input expansion methods. We have also used ontologies and WordNet combination to solve the term mismatch problem in document retrieval. There are number of document ranking models proposed such as extended Boolean model [13], Vector space model [7], and Relevance model [4]. These models are largely dependent on the query term frequency, document length etc to rank the documents. These methods are computationally fast. However, they ignore the linguistics features and the semantics of the query as well as the documents which inversely affects their retrieval performance. [16] and [12] propose conceptual models which map a set of words and phrases to the concepts and exploit their conceptual structures for retrieval. [15] proposes an ontology hierarchy based approach for automatic topic identification which can be further extended for automatic text categorization. These models are complicated but retrieve more precise information in comparison of other statistical models. However, these methods are not able to handle imprecise information which is necessary to fulfill users need. Therefore, rough set based methods [14] [2] were proposed for document classification to handle imprecise and vague information. [6] proposes automatic classification of WWW bookmarks based on rough sets while [20] proposes extension of document frequency metric using rough sets. They have used indiscernibility relation for text categorization. In this paper, we have proposed a document ranking method which uses an extension of their research work.
3. Social Networking based Information Retrieval System Architecture This section describes the architecture of the proposed personalized question answering recommender system. The proposed system is based on the hypothesis that problem of correlation between syntax and semantics of the terms used by the user’s in social networking domain to define his/her interest could be solved by incorporating conceptualized information using semantic web. The proposed social domain document retrieval system (Fig. 1) considers the user’s interest as an input and extracts important terms then finds the semantically related concepts using its query expansion module described in [11]. These conceptually related terms along with the user input are passed to the document retrieval phase. The document retrieval phase searches for the documents relevant to the user’s interests and presents a list of the document in the order of their relevancy using rough set based ranking algorithm. The algorithm for document retrieval is described in the next section.
4
Fig.1. Architecture of Social Networking bases Information Retrieval System
4 Rough Set based Document Ranking Method In this section, we are proposing a rough set based document ranking algorithm. In the proposed method, we are not considering term frequencies for ranking of retrieved documents as the algorithms based on term frequencies tend to be more biased towards longer documents. The proposed algorithm expands the user input, selects the relevant features from the set of documents returned by search engines and ranks extracted concept combinations according to their relevancy to the user’s input. Finally, the algorithm performs re-ranking of the documents based on the position of the concept combinations in the set of documents.
5
4.1 Concept Combination Ranking Algorithm In this section, we are proposing an algorithm that uses the indiscernibility relation of the rough set theory to rank the concept combinations. The basic idea is based on the algorithm discussed in [20] which uses document frequency to extract the important features from a set of documents and categorizes them on the basis of their features (terms). We are extending their algorithm for ranking a concept combination. The underlying intuition is that a document is more relevant if it contains combination of concepts together rather than containing individual concepts. Let us assume that the user input contains concepts C1, C2,…Cn and the input is expanded using algorithm proposed in [11]. The key concepts in the expanded set are then grouped into concept combinations using Cartesian product and ranked according to the knowledge quantity contained in them. The complete algorithm for ranking the concept combinations is described below. Algorithm: Concept_Extraction(Q, D) Input: User input (Q) and set of documents (D) Output: Ranked concepts list (Gr) Step 1: Extract key concepts C1, C2, …, Cn from the input. Step 2: Expand input using expansion algorithm [11]. The resulting set is C1 ∪ C 2 ∪ ... ∪ C n where C i = C i1 ∪ C i 2 ∪ ... ∪ C ik and C ij indicates the jth semantically related word to concept C i . Step 3: Let G = C1 × C 2 × ... × C n where × indicates the Cartesian product. Step 4: Define an information system I = (U, A, V, f), where
U ={Di
Di ∈ D}, A = {Gi
Gi ∈ G}, V is the domain of values of Gi, and
f is an information function (U, A) →V such that:
f ( Di ,
0 if any of the concepts in G i is not present in D i Gi ) = 1 if all concepts in G i are present in D i
Step 5: Determine the “Knowledge Quantity” (KQ) of Gi using the equation (1)
KQi = m(n − m )
(1)
where n and m represents cardinality of D and no. of documents in which concept group Gi occurs respectively. Step 6: Repeat step 5 for all Gi. Step 7: Sort G according to “Knowledge Quantity” and return Gr (Sorted G). Step 8: END
6
4.2 Document Ranking Algorithm The proposed document ranking algorithm considers ranked concept combination as discussed in section 4.1 and searches the document sets for these concept combinations. The algorithm considers the most descriptive concepts of the document which are used to define title or subtitle. Secondly, we consider those sentences more relevant which contain more number of concepts. Algorithm Document_Ranking describes the proposed document ranking algorithm. Algorithm: Document_Ranking (Q, D) Input: User query (Q) and set of documents (D) Output: Ranked documents list (Dr) Step 1: Run Concept_Extraction (Q, D) to get ranked list of concept groups. Step 2: For each document Di ∈ D and concept group Gj, compute document score (Wi1) using equation 2.
p − rj Wi1 = 1 + ∑G ⊂ D p 1≤ j ≤ p and j i
W0
(2)
where p is the cardinality of the set G (step 3 in Algorithm Concept_Extraction ) and rj is the rank of Gj obtained in step1. W0 is the initial weight assigned to each document. Step 3: For each document Di ∈ D and concept group Gj, re-compute document score (Wi2) using equation 3.
ats + k 2Wi1 and 't ' is in one sentence b j t ⊆G j
∑
Wi 2 = Wi1 + k1Wi1 t ⊆G j
ath and 't ' is in one subtitle b j
∑
att and 't ' is in title b j
∑
+ k3Wi1 t ⊆G j
(3) Here k1, k2, and k3 are constants indicating weight assigned for occurrence of concept combination in sentences, sub-titles, and titles within the documents. ats, ath and att are the cardinality of subset ‘t’ in sentences, sub-titles, and title respectively. bj is the cardinality of Gj. Step 4: Rank the document set according to the scores obtained in step 3. Step 5: END
5 Experiments and Results To test the efficiency of the proposed algorithm, we have chosen a set of 50 questions from social networking domain. All of these questions were expanded
7
using the query expansion algorithm [11]. These expanded questions were fed into Google and we downloaded 25 documents corresponding to each of these 50 questions separately. The retrieved documents were re-ranked using Document_Ranking algorithm. The average number of documents containing correct answers in top 10 documents increased from 3.56 to 4.48. This indicates an improvement of 25% in document retrieval. Table 2. Comparative performance analysis S.N Performance Parameters
With Google
With Proposed Approach
1
17
25
3.56
4.48
3
No. of questions whose answers were present in at least 5 documents (out of first 10 documents) Average no. of documents containing correct answers (out of first 10 documents) Number of questions with answer in the first document
22
23
4
Average rank of the document containing first correct answer
2.78
2.44
2
We also observed increased number of correct answers in top ranked documents. There were 17 questions whose answers were present in at least 5 documents out of top 10 documents using Google but using proposed algorithm, this count increased to 25. These results reflect that the algorithm Document_Ranking is bringing more relevant documents to higher ranks. We have summarized these results in table 2. Doc. containing answers
Precision graph Total documents containing answer (out 0f 25) before (out of 10)
18 16
1
0.8
N u m b e r o f d o cu m e n ts
14
after(out of 10)
12
P re c is io n
0.6
10 8
precision before precision after
0.4
6
0.2
4 2
0 1
0
Question number
Fig. 1 Number of documents containing correct answers
5
9 13 17 21 25 29 33 37 41 45 49
Question number
Fig. 2 Precision graph
8 First Answer 10 9 8
Doc Number
7 6 5
before
4
after
3 2 1 0 0
5
10
15
20
25
30
35
40
45
50
Question Number
Fig. 3 Documents’ rank containing first correct answer
Results of the experimental questions are shown in fig. 1. As seen from the figure, number of documents containing correct answers is higher compared to original retrieval. Thus, our algorithm helps the information retrieval system to improve the precision of the system which is more explicit in fig. 2. Fig. 2 can be derived from fig. 1 by using the formula for precision calculation. Further, we represent document rank containing first correct answer in Fig. 3. The ranks of the first document with correct answer were same for Google and our algorithm in 28% cases (which was mostly rank 1 and hence there is no scope of improvement). In 46% cases, the algorithm improved the ranks of the first document containing correct answer while rank of the same declined in case of 26% questions. Thus, it is clear from the fig. 3 that algorithm is improving the rank of relevant documents.
6 Conclusion and Future Scope Social networking domain is growing rapidly because large no. of users are joining daily and thousands of users are getting benefits by sharing information on different matters. In
this paper, we have presented two algorithms to rank documents conceptually. Our first algorithm ranks concept combination of the documents which is useful to find more conceptually relevant answers. Further, second algorithm ranks retrieved documents using position of concept combination which improves the precision of the information retrieval system. Though this algorithm uses modern semantic tools such as rough set and ontologies but it is a simple and computationally efficient method. We have experimented with 1250 documents from social networking domains retrieved using to judge the effectiveness of the proposed method.
9
References [1] Alpert, J., Hajaj, N.: We Knew the Web was Big.....website: http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html. (2008) [2] Bao,Y., Aoyama S., Yamada,K., Ishii, N., Du, X.: A Rough Set Based Hybrid Method to Text Categorization. In: Second international conference on web information systems engineering (WISE’01), vol. 1, pp. 254-261. IEEE Computer Society, Washington, DC, USA (2001) [3] Choochaiwattana, W., Spring, M.B.: Applying Social Annotations to Retrieve and Rerank Web Resources. In: Proceedings of the International Conference on Information Management and Engineering. pp. 215-219, IEEE computer Society, 2009. [4] Crestani, F., Lalmas, M., Rijsbergen, J., Campbell, L.: Is This Document Relevant? …Probably. A Survey of Probabilistic Models in Information Retrieval. ACM Computing Surveys, 30 (4), 528 -- 552 ( 1998) [5] Facebook, website:www.facebook.com. [6] Jensen, R., Shen, Q.: A Rough Set-Aided System for Sorting WWW Bookmarks. In: First Asia- Pacific conference on Web Intelligence: Research and Development, LNCS, vol. 2198, pp. 95--105. Springer-Verlag, London, UK (2001) [7] Lee, D. L., Chuang, H., Seamons, K..: Document Ranking and the Vector Space Model. IEEE Software, 14 (2), 67 -- 75 ( 1997) [8] Linkedln, website: www.likedln.com. [9] Marlow, C., Naaman, M., Boyd, D., Davis, A.: Position Paper, tagging, Taxonomy, Flickr, Article, To Read. In: Proceedings of the 17th ACM Conference on Hypertext and Hypermedia (HT’06), August 2006. [10]Orkut, website: http://www.orkut.com. [11]Ray, S. K., Singh, S., Joshi, B. P.: Question Answering Systems Performance Evaluation – To Construct an Effective Conceptual Query Based on Ontologies and WordNet. In Proceedings of the 5th Workshop on Semantic Web Applications and Perspectives, Rome, Italy, December 15-17, 2008, CEUR Workshop Proceedings, ISSN 1613-0073. (2008) [12]Rocha, C., Schwabe, D., Poggi de Aragão, M.: A Hybrid Approach for Searching in the Semantic Web. In: 13th International Conference on World Wide Web, pp. 374 383. ACM, New York, USA (2004) [13]Salton, G., Fox, E. A., Wu, H.: Extended Boolean Information Retrieval. Communications of the ACM, 26 (11), 1022 -- 1036 (1983) [14]Singh, S., Dey, L.: A Rough-Fuzzy Document Grading System for Customized Text Information Retrieval. Information Processing and Management: an International Journal, 41(2), 195--216 (2005) [15]Tiun, S., Abdullah, R., Kong, T. E.: Automatic Topic Identification using Ontology Hierarchy. In: Second International Conference on Computational Linguistics and Intelligent Text Processing, LNCS, vol. 2004, pp. 444-453. Springer-Verlag, London, UK (2001) [16]Vallet, D., Fernández, M., Castells, P.: An Ontology-Based Information Retrieval Model. In: Gómez- Pérez, A., Euzenat, J.(eds.) 2nd European Semantic Web Conference (ESWC 2005), LNCS, vol. 3532, pp. 455--470, Springer, Heidelberg (2005) [17]Wikipedia List of Social Networking, website: http://en.wikipedia.org/wiki/List_of_social_networking_websites [18]Wirken, D.: The Google Goal Of Indexing 100 Billion Web Pages. website: www.sitepronews.com/archives/2006/sep/20.html. (2006) [19]WordNet, website:http://wordnet.princton.edu
10
[20]Xu,Y., Wang,B., Li, J.T., Jing, H.: An Extended Document Frequency Metric for Feature Selection in Text Categorization. Information Retrieval Technology, LNCS, vol. 4993, pp. 71-82. Springer Berlin/Heidelberg (2008) [21]Zhou, D., Bian, J., Zheng,S., Zha, H., Giles, C.L.: Exploring social annotations fro information retrieval. In: Proceedings of International World Wide Web Conference (WWW2008), April 2008.