Our proposed model uses a non-personalized approach of IR. The model depends on the social information available publicly by social media used. 228 ...
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
Validating Ranking in Web Documents Using Normalized Social Media Information Ahmed S. Hashwa*, Nawal A. EI-Fishawy* and Sherin M. Youssef**
* Dept. of Computer Science and Eng., Faculty of Elect., Eng., Menoufia University ** Dept. of Computer Eng., Arab Academy for Science and Technology & Maritime Transport, Egypt.
(Received: 28-December-2015 - Accepted: 15-February-2016)
Abstract Social Information Retrieval (SIR) is a relatively new domain which uses social information to enhance information retrieval processes. To find more interesting search results, social behavior can indicate how much these results are interesting. Social interaction over Web 2.0 are used here to enhance ranking of web results in response to a query. A dataset from Open Directory Project (ODP) is used here to show the improvement of ranking. We propose the usage of normalization and social services weights to give better performance. The proposed framework gets data from various types of social info (social bookmarking, social news, social network, discovery engines). Data is parsed into fields and significant values are used in the ranking process. Precision and Mean Average Precisio·n (MAP) are used to evaluate results. Simulation results show better ranking with the proposed model.
1 . Introduction Exponential increase in the World Wide Web (WWW) [I] makes users confused when they search for information. The existence of Search Engines made that problem much easier to handle. Users now can use search engines to find information. As time went by, a lot of search engines were developed such as Google, Bing, Yahoo, Ask, AOL Search and much more [2].Their task is to find the user's desired information need. Unfortunately, most of them use the content of these documents 227
--
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
mainly to match a query and the link structure between the documents to rank the results-among other unpublished copyrighted factors that are not public for researchers. Lately, as the Web of interaction - also known as Web 2.0 - is developing rapidly, many services appeared to give the people the power to express their feedback about any web resource in the form of comments, tags and rating. The interaction of people with these resources via social media gives valuable infortnation about those resources, which is known as "wisdom of the crowd" [3-4].Social media include social networks (Facebook [5], Twitter [6], Google+ [7], Linkedln [8], etc.), social bookmarking and discovery systems (Delicious [9], StumpleUpon [10], etc.) and social news sites (Reddit [11]).Unfortunately, major search engines like Google doesn't make use of this valuable infor111ation directly yet there are some researches that use social annotations in search results [12]. The gap between social information generated by social networks and Information Retrieval (IR) needs to be bridged to increase the quality of the results requested by a user. The main tracks to enhance the IR process and reduce the amount of irrelevant documents are: (i) rewriting query using extra knowledge, i.e. expansion of the user query, (ii) adding more inforn1ation about the document from other sources, i.e. document expansion, (iii) improvement of the IR model, i.e. the way documents and queries are represented and matched to quantify their similarities (iv) Post filtering or re-ranking of the retrieved documents (based on the user profile or context or some document - related information). Mainly many researches [6-9] tried to bridge this gap using two general approaches: (i) personalized approach by modeling the user ( extracting a profile or a group of keywords that represent topics interesting for a user) and customizing the result for him [6, 7], (ii) non-personalized approach which uses social information generally to better enhance infor1nation retrieval like [8, 9]. Our proposed model uses a non-personalized approach of IR. The model depends on the social information available publicly by social media used
''
228
-
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
at the ranking phase to enhance the quality of ranking. Ranking is done by using a calculated social score for webpages using all available information from social media. The aim of our proposed model is to achieve the following: 1) Re-ranking results using the available social information in nonpersonalized approach. 2) An evaluation study of the proposed approach and a comparison with the closest works on a large public dataset. The remainder of this paper is structured as follows. Section 2 reviews some related work. Section 3 describes our IR Model. In section 4, we describe our dataset and show evaluation methodology and metrics then discuss the results. Finally, we conclude the paper and propose some future work.
2. Related Work Many researchers used social information to enhance information retrieval in various methods. In [ 15], they tried to combine link-based ranking metric (PageRank) with another metric derived using social bookmarking data (SBRank). They also used metadata from social bookmarking to enable more search capabilities such as temporal search. That increases the precision of a standard link-based search by incorporating popularity estimates from aggregated data of bookmarking users. Their approach was non-personalized. Their framework enhanced results but depended on only two source for analysis. Their evaluation used analytical studies of social bookmarks as well as comparative analysis between PageRank and SBRank. Another non-personalized approach was in [16] where the authors used two factors for ranking pages ; SocialSimRank (SSR) which calculates the similarity between social annotations and web queries and SocialPageRank (SPR) which captures the popularity of web pages. They used Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) to evaluate their results. However, both previous approaches depend on social bookmarking only although many other sources of social data are available. Recently, a research by [17] used social signals from different sources to enhance search. Their dataset was collected from IMDb which contains movie resources basically where as our dataset is a general dataset from ODP. Their experimental
'.....
229
--
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
evaluation reveals that the incorporation of social properties within a textual search model allows to enhance the quality of the search results. In [18] the authors used ODP as a reference and used social information from different sources to both get more relevant search results and re-ranks results. Their work also used a dataset from ODP. Better search results were found by adding social information from Delicious then boosting this field at query time. Using social infonnation for re-ranking didn't perform as much.
A brief description about social websites used • Delicious Delicious is a social bookmarking web service for storing, sharing, and discovering web bookmarks/resources. Delicious uses a nonhierarchical categorization system in which users can tag each of their bookmarks/resources with freely chosen index ter111s (generating a kind of folksonomy). Founded in 2003 , many researches were done on its publicly available data such as [13 , 14, 16, 19]. • Reddit Reddit is an entertainment, social networking and news website where registered community members can submit content, such as text posts or direct links. Registered users can then vote submissions "up" or "down" to organize the posts and determine their position on the site's pages. Content entries are organized by areas of interest called "subreddits". • Facebook Facebook is a well-known online social networking service. It allows its users to like, share and comment on any web resource and any posts created by users.
•
• Twitter Twitter is an online social networking service that enables users to send and read short 140-character messages called "tweets". Registered users can read and post tweets, but unregistered users can only read them. Users access Twitter through the website interface, SMS , or mobile device app. Users are allowed to embed URLs within their tweets.
''
230
-
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
• Google+ Google+ is a social network and social layer for google services that allows for sharing, liking and commenting on web resources or other
posts. • StumbleUpon StumbleUpon is a discovery engine (a form of web search engine) that finds and recommends web content to its users. Its features allow users to discover and rate Web pages, photos and videos that are personalized to their tastes and interests using peer-sourcing and social-networking
principles. • Linkedln Linkedln is a business-oriented social networking service. It is mainly used for professional networking. Users have the ability to comment, like and share posts of other users which may include URLs of web resources.
3. Proposed IR Model 3.1. Problem Definition As a result of information revolution, users couldn't get the information they need in a high quality. Here we can see the great role of Search Engines, which help users in retrieving the information they need. Search engines such as Google, Yahoo and others are useful in finding the user's desired information but most of them use the content of documents mainly to match a query or user's need and the link structure between documents to rank the results among other unpublished proprietary factors. As stated in [3] information generated by the crowds is of value and needs to be considered when ranking web pages for users which is what we consider in
our paper.
3.2. Proposal Overview ''
Figure 1 presents the search engine framework as in [18] which uses social information from diverse sources so that it is included in the 231
-
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
ranking model. It shows the main components of the proposed model which is surrounded by the dashed polyline inside search system. -,
Note:
------
~="'- ~ _.. _,. = ..,. r = 1
y .
Y1f ;: -· -
-
~
~
}' .:n'I .•
--==1
~
ii".,. .
(5)
~
Where, m is the number of social services used, x c is the score calculated for the social service and
wiis the weight given to each social service which is equal to coverage percentage of the dataset given by Table 1. To calculate social service score x: Eq. (6) is used:
=
~- 1 S -C JJ Jl ~ ~2
....... i =1
''
.)
s_,,-
(6)
234
-
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
Where, n is the number of social components used, cf is the value for the social service component and si is the weight given to each social service component (such as like or share in Facebook) which are chosen as in Table 2. ~
Table 1: Social Information coverage of the used dataset
Co,-erage Percentage~1o ( b~- each source)
Social Info1mation Source
Co,-eredo/o
~on-Co,-ered o.,o
Facebook
80.944
19.056
Tuitter
55.378
44.611
Delicious Stnmblerpon
., . 6'> )- .... .... 1
47~379
50.52 3
49.477
GooglePlus
34.597
65.403
Linkedln
28.868
71.132
Reddit
4.833
95.167
Table 2: Social services components and their weights Social Set'"\-ice Facebook
T,vitter
Compone11t Likes count Share count
,,-eight
P osts count taos Total count
1
Posts count
1
L"s er 1i st:::.
3
-Cser likes Likes co1.1.nt P osts cotint score Total st.un Posts cotint
1 1 1
.....
-
4
~
Delicious
St11n1bleI:pon GooglePlus T .inkedln
Reddit _..,....._-_._-........-.............. ...
1
.... _j
1
Weights in Table 2 are estimated by authors to represent the belief of a user that the content he/she shares is important (for example liking a page is fine but sharing means that you believe it is worthy of letting others view its content).
''
235
--
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
In addition, to solve nor1nalization issue, another method is used to calculate social score. For results of a certain query qr, normalized social service score xn is calculated by nor1nalizing each component for the maximum of that component in all results. This can be illustrated by the following equation:
,n
C,-
S (
.......f =1 .i 1\f ax ( Ctn er) xn -
:
)
•
=~~~~~~~~-
>-n S ...J.} =1 j
(7)
Components of the soci~l services and their weights are listed in Table 2.
C. Combining textual and social rank Finally re-ranking is done using only social information as well as a combination between the textual rank (r 1) and the social rank (rs) resulting in a new aggregated rank (ri.s). This is accomplished using rank aggregation using Weighted Borda-Fuse (WBF) where :~ is an empirical value that can be in the range of [0,1] and was chosen as y = ~ in Eq. (8)
•
as in [ 13].
r:: ~
= (.
A l- -
v)
">'"
,..1' t
-
• _, --,.. J
(8)
s
3.3. Description of the proposed model Our model is basically the same as in [ 18] which is constructed mainly to use social information on the web to enhance web retrieval. The process is divided into background and fore ground parts. The construction of each of them is described in the following figures. The process in Figure 2 shows the steps of sub-processes done in the background with no interaction with the user. It first starts with getting social information by using the URLs of cached web documents and requesting and retrieving all possible data from different sources on the web. Then the retrieved information is stored in a database for later use by the system. To extract and use this data which is usually found in either JSON (Java Script Object Notation) forrnat or RSS/XML (Really Simple Syndicate/Extensible Markup Language) fonnat - parsers are used to split data into usable fields in the database. Indexing documents then happen by including data from Delicious (from the database) as well as the document itself in different fields in the index.
''
236
-
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
Backgr o und P art
Start \\'eb Documents
1- Get social information from
Web
2- Store in a database social
information for later use
3- Parse social information
4- Index documents and include
Index~---1 Delicious info into index (Document Expansion)
5- Calculate Social Rank for each document and store it in DB
Figure 2: Process of the background part of the model
Foreground Part Start
1-Receive Q uery from user and .-----11.i Search the index
Get users query '
Result Set Index
2-Re-rank result set using~ the ~~ oa~base social information from database
Re-ranked Result Set
'.... Figure 3: Process of the foreground part of the model
237
---
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
All documents indexed are stemmed (broken into terms) using Porter Stemmer. In addition, a social rank is calculated using social information retrieved for every document then stored in the database. In Figure 3, the user submits the query to the system and the system searches the index for the information need and retrieves the result that is then re-ranked using the social information stored in the database.
4. Dataset and Evaluation Methodology 4.1· Dataset description The dataset used in this work - as a document collection-is the same as in [18]. It was collected from the "DMOZ" Open Directory Project (ODP) [22] during March 2014. ODP is the biggest, most comprehensive directory of the web maintained by humans. This high quality and free web taxonomy resource has been used in a number of previous researches for different purposes such as topic extraction in [ 13 , 23], finding relative pages ages in [ 19], as a reference model in [ 16]. In this paper ODP is used as a reference model using the same way as in [16]. Category paths were used as queries and their corresponding URLs as the ground truth. A collection of randomly selected 3960 categories were collected as queries of which only 2439 were actually used and the rest has been ignored. Queries have been filtered as follows:
1. Queries for which relevant URLs had no social information and/or no 2.
3. 4. 5.
longer exist on the web were ignored. Because ODP categorizes some URLs alphabetically, queries that had a capital alphabetic letter as the last category were removed and if a letter appeared within the query, the letter alone was removed. Also, any duplicated words in the query were removed. To ensure measurable quality, all queries must have at least 10 relevant results. Finally, all non-English queries were removed.
A total of 79275 URLs were used in the experiment. All of them have social infor1nation from at least one of the used social information sources and has a web document with a title and html content. The web documents and all social information have been downloaded using a custom C#
'....
238
--
Minufiya J. of Electronic Engineering Research (MJEER) , Vol. 25, No. 2, July 2016.
application. Table shows the coverage percentage of social information for the dataset URLs used in the experiment.
4.2 Proposal Evaluation Methodology To evaluate the effectiveness of using social infor1nation to enhance infonnation retrieval process a framework was built using C# and Lucene .Net (the .Net port of an Open Source search engine called Luceneoriginally written in Java). The evaluation was executed in stages numbered and shown in Figure 4. /
ODP
t-j
------
ecute I'- - - --1 Queries ~ Index
l
~ [)rm port ODP
D1t1i:t
'-
----
:Ja ta ba sc srora ge
'
;
:- - ~ L1+c.ene index storage
~--
I
I
.,J
D
?ro::::ess
Resultsi------+---------.. u \·aluatc
L""RL a1aset
Reference
1----
---+---
--~
\\~eh ......_..e)Indexing []Get Social Senices Info 1 - - ~ Documents
Rc9Jlts
Final Cotn parison Result
Social Ranks
~se Delicious Ra\\.
Social Info
·lli-~-~i~ilw1 ~_fQ
li]Social Rank Cal::::ulator
..
__
~:..J ...: .. ~.... :..._ ,.__... .._
-. IL. fus~ Social Info ;. .--·-- ~.!°T:;:-;:_---1------
~-- - -- ---·"-·-~---
Figure 4: Proposal evaluation framework
The evaluation framework presented in Figure 4 is numbered to show the sequence used in the evaluation process. Phase 1 has input of data from ODP as categories and subcategories with each category containing a set of URLs relative to that category. Queries, URLs and reference dataset were extracted from ODP categories. Then in phase 2 social services infor1nation is queried from the web using the URL dataset and stored in a database. In addition, the document itself is also downloaded and cached. Parsing all social infor1nation into usable form is done in phase 3 (only data from delicious is used in the indexing process). To calculate social ranks phase 5 uses parsed social infom1ation and stores results in social ranks database. Fifth phase is responsible for indexing web documents
''
239
--
Minufiya J. of Electronic Engineering Research (MJEER) , Vol. 25, No. 2, July 2016.
cached from phase 2 and adding social information from Delicious as fields in a Lucene document. Queries are executed using the information from Delicious and document content, title and URL giving Delicious more importance than other fields in Lucene. In phase 6 only the first 100 results for each query are stored for the fmal phase where they are reranked by combining social score and textual score (Lucene) using Eq. (8). Where social score is calculated using Eq. (5). Weights and nor111alization are used in six methods as stated by Table 3. Finally, in phase 7 evaluation metrics are calculated from the results and final comparison result is extracted. Table 1 Ranking methods Social Ser\.ice Component \\' eight (sj)
Are Social Senices Con1por1cnts Nonnalized
~O.
~iethod
Social Senice \\"eight (1r:)
l.
\\.eighted ~fean Rank
1
Table 2
Xo= uses Eq.(6)
:vfean Rank
1
1
~o. uses Eq.(6)
"I
.:, .
~ ormalized \\. eighted ~lean Rank
1
4.
X onnalizcd ~1ea.n Rank
1
Table 1
")
-· ,.. ) .
6.
SmiceS Social \\?eighted 1-1ean Rank
~onnalized Table 1
Social Senices ~ormalized ~1ean Table 1 Rank
~
\.cs= using Eq. (7) 1·cs, using Eq. (7)
Table 2
\~cs= using Eq. (7)
1
Ycs= using Eq. (7)
4.3 Evaluation Metrics To evaluate the system two metrics are used: Precision and Mean Average Precision (MAP) - defined in [24].
4.3.1 Precision Precision (P) is the fraction of retrieved documents that are relevant and is expressed as a percentage. Precision (P)
=
# (r elevantitemsretrie i:;ed~) # (retriev edi.t e·111s )
(9)
For ranked results, Precision @k is used to represent precision when only inspecting the first k items in the results, calculating how many relevant items in the first k items retrieved.
'....
240
---
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
4.3~2 Mean Average Precision (MAP) Most standard among the TREC (Text REtrieval Conference) community is MAP, which provides a single-figure measure of quality across recall levels. Among evaluation measures, MAP has been shown to have especially good discrimination and stability. MAP is represented here as percentage. For a single infonnation need, Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved, and this value is then averaged over infor111ation needs. That is, if the set of relevant documents for an infor1nation need q1 E Q is {d1, . .. dm1} and R1k is the set of ranked retrieval results from the top result until you get to document dk, then I QI
m j:
(10)
4.4 Evaluation Results and Discussion Evaluation was performed using a Windows PC whose specifications are as follows: (CPU: Intel Core i7 3.9GHz Quad Core with hyper-threading enabled, RAM: 32GB DDR3), Microsoft SQL server was used to store all kinds of information but index which was stored using Lucene.Net as independent custom files. Code and libraries were used in C# programming language and parallelism was used to improve performance and minimize experiment time. In this section the results from the evaluation phase (No.7) in Figure 4 is presented and discussed.
The following legend is for all subsequent charts,
''
- SocialService st~orrr1a lizedWeighted f\~ear1
Ill Social Se rvicesNorr11al izedfv1 ear,
~
I Norr11alizedfV edt1
\.,ormalizedWeigt1tedf\1 ea r1
~
:-5 Weigl1tedMean
241
--
Mear1
Minufiya J. of Electronic Engineering Research (MJEER), Vol. 25, No. 2, July 2016.
In Figure 5, using normalization enhances results over non-normalized methods. Yet using social services weights slightly reduces results, which means some relevant results were pushed down the ranked list. This indicates that some relevant results were associated with services of lower weights (of less coverage) in the top IO results. In comparison with MAP@lO results in Figure 7 the four normalized methods are better than the non-normalized methods. As for normalized-social-services-weighted ranking methods results are almost identical to methods that didn't use social services weights. Figure 6 shows that values for normalized results with no- social weights are the least. This indicates that normalization pushed more relevant results down the top 20. But using social services weights improved results even surpassing the non- normalized methods. Unfortunately, the social services components weighted version of the social-services-weighted-nor1nalized version has less values than the nonsocial-services-components-weighted method. 19.900
~
19.856
19.844
19.850
ri
@J 19.80C
19.787
19.795
C
.E 19.7SC ill
·-u
(l)
t.O
('C
..····~~-..JI·-:,
-·1 o _... b-,-8 0
..,~ 19.700
.._
• • ••
... ••
19.650
:
~
•
'I
• I
•
~
I
> 19.60C
•
,? ')\ w\