Near-Duplicates Detection and Elimination Based ... - Semantic Scholar

20 downloads 21059 Views 423KB Size Report
websites and/or authors who host information on web. ... the ranking function to provide the user with best documents at .... blog/wiki. Labels consists of access control labels, which consist of lists of groups of users with read access and write.
22 

(IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, 2011

Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search Y. Syed Mudhasir

J. Deepika

S. Sendhilkumar

G. S. Mahalakshmi

Department of Computer Science & Engineering College of Engineering Anna University, Chennai-25 [email protected]

Department of Computer Science & Engineering College of Engineering Anna University, Chennai-25 [email protected] 

Department of Information Science & Technology College of Engineering Anna University, Chennai-25 [email protected]

Department of Computer Science & Engineering College of Engineering Anna University, Chennai-25 [email protected]

Abstract—  Users of World Wide Web utilize search engines for information retrieval in web as search engines play a vital role in finding information on the web. However, the performance of a web search is greatly affected by flooding of search results with information that is redundant in nature i.e., existence of nearduplicates. Such near-duplicates holdup the other promising results to the users. Many of these near-duplicates are from distrusted websites and/or authors who host information on web. Such nearduplicates may be eliminated by means of Provenance. Thus, this paper proposes a novel approach to identify such near-duplicates based on provenance. In this approach a provenance model has been built using web pages which are the search results returned by existing search engine. The proposed model combines both content based and trust based factors for classifying the results as original or near-duplicates. Keywords— Web search, Near-duplicates, Provenance, Semantics, Trustworthiness

I. INTRODUCTION Finding information on the internet has become a day-to-day task for billions of internet users. Hence it has become very important that the users get the best results for their queries. However, in any web search environment there exist challenges when it comes to providing the user with most relevant, useful and trustworthy results, as mentioned below: • The lack of semantics in web • The enormous amount of near-duplicate documents • The lack of emphasis on the trustworthiness aspect of documents There are also many other factors that affect the performance of a web search. Several approaches have been made and still researches are going on to optimize the web search. This includes semantic analysis of web to provide relevant results as in [1], optimizing the indexing functions in order to improve the storage and retrieval process of web documents [2], optimizing the ranking function to provide the user with best documents at the top of the results as in [3]. However the efficiency of these

approaches to optimize web search depend upon the amount of data that is available over the internet. Information in WWW is enormous and redundant. There comes the problem of near-

duplicate documents. The following sub sections 1.1 and 1.2 briefly discuss the concepts of near-duplicates detection and Provenance. A. Near- Duplicates Detection The processes of identifying near duplicate documents can be done by scanning the document content for every document. That is, when two documents comprise identical document content, they are regarded as duplicates. And files that bear small dissimilarities and are not identified as being exact duplicates of each other but are identical to a remarkable extent are known as near-duplicates. Following are some of the examples of near duplicate documents in [4] • Documents with a few different words - widespread form of near-duplicates • Documents with the same content but different formatting – for instance, the documents might contain the same text, but dissimilar fonts, bold type or italics • Documents with the same content but with typographical errors • Plagiarized documents and documents with different versions • Documents with the same content but different file type – for instance, Microsoft Word and PDF. • Documents providing same information written by the same author being published in more than one domain. There are several existing approaches based on syntactical comparison, URL based and also semantic comparisons. However, this paper suggests an effective way of identifying and eliminating near-duplicates using provenance that will help in comparison of documents based on provenance factors.

B. Provenance One of the causes of increasing near-duplicates in web is that the ease with which one can access the data in web and the lack of semantics in near-duplicates detection techniques. It has also become extremely difficult to decide on the trustworthiness of such web documents when different versions/formats of the same content exist. Hence, the needs to bring in semantics say

(IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, 2011

meaningful comparison in near-duplicates detection with the help of the 6W factors – Who (has authored a document), What (is the content of the document), When (it has been made available), Where (it is been available), Why (the purpose of the document), How (In what format it has been published/how it has been maintained) [5]. This information can also be useful in calculating the trustworthiness of each document. A quantitative measure of how reliable that any arbitrary data is could be determined from the provenance information. This information can be useful in representative elimination during near-duplicate detection process and to calculate the trustworthiness of each document. The existing approaches of near-duplicates detection and elimination do not give much importance to the trustworthiness aspect of the content in documents retrieved through web search. Thus, Provenance based factors may be used for nearduplicates detection and elimination which provides the user with the most trustworthy results. II. RELATED WORK A. Detection and Elimination of Near-Duplicates Works on near-duplicates detection and elimination are many in the history. In general these works may be broadly classified as shown in Fig. 1 into Syntactic, URL based and Semantic based approaches. Near-Duplicates Detection Techniques

Syntactic

‘Shingling’

‘Signature’

URL Based

Semantics

‘sentence-wise’ Similarity Fuzziness based ‘pair-wise similarity’

‘Semantic Graphs’

Fig.1 Near-duplicates detection techniques

1) Syntactical Approaches: One of the earliest was by Broder et al [6], proposed a technique for estimating the degree of similarity among pairs of documents, known as shingling, does not rely on any linguistic knowledge other than the ability to tokenize documents into a list of words, i.e., it is merely syntactic. In this, all word sequences (shingles) of adjacent words are extracted. If two documents contain the same set of shingles they are considered equivalent and can be termed as near-duplicates. The problem of finding text-based documents similarity was investigated and a new similarity measure was proposed to compute the pair-wise similarity of the documents using a given series of terms of the words in the documents. Also, a kappa measure was developed for computing the similarity of documents. Then ordered weighting averaging operator was used to aggregate the similarity measures between a set of documents [7].

23 

Reference [8] shows another approach based on the similarity measure can be acquired by comparing the exterior tokens of inter-sentences, but relevance measure can be obtained only by comparing the interior meaning of the sentences. A method to explore the Quantified Conceptual Relations of wordpairs by using the definition of a lexical item was described, and a practical approach was proposed to measure the inter-sentence relevance. An approach based on the ‘signature’ concept as in [9], suggested a method of descriptive words for definition of nearduplicates of documents which was on the basis of the choice of N words from the index to determine a signature of a document. Any search engine based on the inverted index can apply this method. Any two documents with similar signatures are termed as near-duplicates. The method based on shingles and the signature method when compared, the signature method in the presence of inverted index was more efficient. As a result, the above stated syntactic approaches carry out only a text based comparison. And these approaches did not involve the URLs or any link structure techniques in identification of near-duplicates. The following subsection discusses the impact of URL based approaches on nearduplicates detection. 2) URL Based Approaches: A novel algorithm, Dust Buster, for uncovering DUST (Different URLs with Similar Text) was intended to discover rules that transform a given URL to others that are likely to have similar content. Dust Buster employs previous crawl logs or web server logs instead of probing the page contents to mine the dust efficiently. Search engines can increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as Page Rank, which are the benefits provided by the information about the DUST [10]. Reference [11] shows another approach where detecting process was divided into three steps. 1) Removal according to URLs. First, remove pages with the same URL in the initial set of pages to avoid the same page been download repeated due to repeat links. 2) Remove miscellaneous information in the pages and extract the texts. Pretreatment the pages, remove the navigation information, advertising information, html tags, and other miscellaneous information on the pages, extract the text content and get a set of texts. 3) Detect with DDW algorithm. Use the DDW algorithm to detect similar pages. The combination of such URL based approaches along with syntactic approaches is still not sufficient as they do not have semantic in identifying near-duplicates. The following subsection discusses briefly a few semantic based approaches. 3) Semantic Approaches: A method on plagiarism detection using fuzzy semantic-based string similarity approach was proposed. The algorithm was developed through four main stages. First is pre-processing which includes tokenization, stemming and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using shingling and Jaccard coefficient. Suspicious documents are then compared sentence-wise with the associated candidate

(IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, 2011

documents. This stage entails the computation of fuzzy degree of similarity that ranges between two edges: 0 for completely different sentences and 1 for exactly identical sentences. Two sentences are marked as similar (i.e. plagiarized) if they gain a fuzzy similarity score above a certain threshold. The last step is post-processing hereby consecutive sentences are joined to form single paragraphs/sections [12]. Recognizing that two Semantic Web documents or graphs are similar, and characterizing their differences is useful in many tasks, including retrieval, updating, version control and knowledge base editing. A number of text based similarity metrics are discussed as in [13] that characterize the relation between Semantic Web graphs and evaluate metrics for three specific cases of similarity that have been identified: similarity in classes and properties used while differing only in literal content, difference only in base-URI, and versioning relationship. Such techniques are inadequate as the importance is on providing the relevant content and not much on the trust, originality or the authenticity of the documents. Thereby Provenance is likely to play an important role in near-duplicates detection. B. Provenance Researches on Provenance are being made wherein the amount of trustworthiness of the content is given importance. In general these works on provenance techniques may be classified as shown in Fig. 2 in to workflow oriented, network oriented, trustworthiness and collecting provenance.

Provenance Techniques

Workflow Oriented

Collecting Provenance

Network Oriented

Trustworthiness

Provenance Graphs Fig. 2 Provenance techniques

An approach based on Collaborative Planning Application (CPA) was to help users organize information, potentially at a variety of security levels. Data provenance is a natural fit to the CPA. The goal of the CPA is to help users organize information, potentially at a variety of security levels, in the style of a blog/wiki. Labels consists of access control labels, which consist of lists of groups of users with read access and write access, and provenance labels which are comprised of a list of ProvAction labels that have affected the labeled data. As with most wiki software, a log of how, when, and by whom pages are edited is a matter of concern. To implement data provenance,

24 

every public function that modifies the state of the wiki is wrapped updating the appropriate label. Provenance policies consider creating, modifying, deleting, restoring, and relabeling the blocks [14].

1) Trustworthiness: Another line of research on provenance was Knowledge provenance, to determine the validity and origin of web information by means of modeling and maintaining information sources and information dependencies. It constructs a trust judgment model for knowledge provenance. Trust judgment can be done with following factors: (1) the trustworthiness of information creator can be used to represent the trustworthiness of the information created (2) trust can be placed in what the trusted individual behaves like. This type of trust is intransitive (3) trust can be placed in what the trusted friend believes to be true in a field. This type of trust is transitive and can propagate in social networks (4) trust in an organization in a field can be transferred to a professional member of the organization. The importance is on trustworthiness of the content and the measures to find the amount of trustworthiness by means of social networking [15].

2) Collecting Provenance Information: Recording provenance information is a fundamental topic of provenance research as discussed in [16]. While traditional provenance research usually addresses the creation of data, this provenance model also represented data access in the context of Web data. A system that applies this provenance model generates provenance graphs for data items. Some pieces of provenance information can be recorded by a system; for other pieces the system relies on metadata provided by third parties. Thus, recordable provenance information and metadata-reliant provenance information are properly distinguished. Provenance information that is common to all provenance elements of this type is the access time and the access method. The provenance element could describe the creation and the expiration date. Provenance-relevant metadata is either directly attached to a data item or its host document or it is available as additional data on the Web. Examples for attached metadata are RDF statements about an RDF graph that contains the statements, author and creation date of blog entries. Therefore, Provenance based near duplicates detection and elimination process will help in retaining the original i.e. most trustworthy documents and eliminating the other replicas to let alone the overhead involved in facing the near-duplicate documents in the search results which plays an important role in any web search environment. III. WEB PROVENANCE BASED DETECTION AND ELIMINATION OF NEAR-DUPLICATES The entire process of web provenance based near-duplicates detection and elimination is represented in the architecture as shown in Fig. 3. The architecture comprises of the following components: (i) Data collection (ii) Preprocessing (iii) Document Term Matrix(DTM) construction (iv) Provenance

25 

(IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, 2011

Matrix(PM) Construction (v) Database (vi) Singular Value Decomposition (vii) Document Clustering based on similarity scores (viii) Filtering (ix) Re-ranking based on trustworthiness values. DATA COLLECTION 

PREPROCESSING 

DOCUMENT TERM  MATRIX CONSTRUCTION

PROVENANCE  MATRIX  CONSTRUCTION 

SINGULAR VALUE  DECOMPOSITION 

TRUSTWORTHINESS  CALCULATION 

Trustwort hiness  value  RE‐RANKING 

Provenance  Records 

FILTERING NEAR‐ DUPLICATES 

DATABASE 

DOCUMENT  COMPARISON 

Filtered  Documents  Near‐ Duplicates 


IEEE

Suggest Documents