Identifying Equivalent URLs using URL Signatures Lay-Ki Soon Sang Ho Lee School of Information Technology, Soongsil University, Seoul, Korea.
[email protected],
[email protected] Abstract In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to enhance the standard URL normalization by incorporating the semantically meaningful metadata of the Web pages. The metadata taken into account are the body texts of the Web pages, which can be extracted during HTML parsing. Given a URL which has undergone the standard normalization mechanism, we construct its URL signature by hashing or fingerprinting the body text of the associated Web page using Message-Digest algorithm 5. URLs which share identical signatures are considered to be equivalent in our scheme. The experimental results show that our proposed method helps to further reduce redundant Web information retrieval by 34.57% in comparison with the standard URL normalization mechanism.
1. Introduction Due to the overwhelming size of the World Wide Web (WWW), finding and retrieving relevant information from the Web has become inevitably challenging. To support effective searching and retrieval of the required information from the Web, search engines periodically crawl and index the WWW. In a typical crawling process, a set of seed uniform resource locators (URLs) is provided to initiate the crawling [9]. Upon fetching the Web pages, HTML parsing will be conducted to extract the URLs or any other intended data contained in the Web pages. Before adding the unvisited URLs to its frontier, URL normalization will be performed by the crawler to identifying equivalent URLs. URLs are considered equivalent if they point to the same Web pages. As such, URL normalization is essential to avoid fetching the same Web pages more than once [7, 9]. Nevertheless, the standard URL normalization
mechanism normalizes URLs syntactically without referring to the Web pages linked by the URLs. In this paper, we propose to enhance the URL normalization process by considering the metadata of the corresponding Web pages. In addition to normalizing the URLs syntactically, each URL is represented by a URL signature, which is constructed using the body text of the corresponding Web page. URL signatures are generated for URLs which have undergone the standard URL normalization process. The experimental results indicate that our proposed method manage to improve the process of identifying equivalent URLs by 34.57% compared to the standard URL normalization mechanism. This paper is organized as follows: Some related works are discussed in Section 2. Section 3 explains the standard URL normalization mechanism. URL signature and the flow of its application in addition to the standard URL normalization are presented in Section 4 and 5 respectively. Section 6 describes our experiment as well as the results. This paper is finally concluded in section 7.
2. Related Works The standard URL normalization mechanism has been studied and extended by [5] and [7]. Lee et al. extended the standard URL normalization mechanism with three additional steps, which include changing the path component of the URL into lower case, eliminating the last slash symbol at the non-empty path component and eliminating the default pages [7]. The default pages considered are default.html, default.htm, index.html and index.htm. In [5], four evaluation metrics were defined, which are URL consistency, URL applying rate, URL reduction rate and true positive rate. The elimination of the trailing slash symbol was also proposed to further extend the standard URL normalization mechanism. Our proposed method is able to identify equivalent URLs even without eliminating the default pages from the URLs, as further discussed in Section 6.
Another related work was done by Schonfeld et al. where they studied the patterns of how URLs are constructed within a particular website, followed by constructing DUST (different URLs with similar text) rules [1]. For an instance, the DUST rule “.co.il/story_” → “.co.il/story?id=” will replace “.co.il/story_” in all the URLs with “.co.il/story?id=”. Since each website may observe different formatting in its URLs, the DUST rules must be mined separately for each Website. As of June 2008, Netcraft charted 174 millions of websites available on the WWW [8]. As such, it is practically infeasible if DUST rules are to be mined from each website individually. Different from Schonfeld et al.’s approach, our proposed method is not site-dependent and can be applied to all URLs without any additional costs. To avoid crawling duplicated Web pages, certain crawlers store the digests of the Web page contents in an index [3]. When a Web page is crawled, its digest is checked against the index or the frontier of the Web crawler. If the digest already exists, then the Web page will not be crawled. In our case, instead of hashing the complete Web page contents for generating the digest or fingerprint, we use only the textual contents which are not embraced by any HTML tags.
3. Standard URL Normalization The steps in the standard URL normalization mechanism are listed in Table 1. The steps in the standard URL normalization mechanism can be categorized into syntax-based, scheme-based and protocol based [2]. As we can see, the standard URL normalization focuses on normalizing the URLs syntactically. There are five components of URL as illustrated in Figure 1 [2]. Each step in Table 1 focuses on specific components of the URLs. The fragment component of a URL denotes indirect identification of a secondary resource by referencing to a primary resource. For example, there may be links in a Web page which link to different fragments within the same Web page. Since the fragment component identifies the different fragments or resources within the same Web page or the same URL, it is not considered in the standard URL normalization mechanism [2]. http://www.example.com/folder/exist?name=sky#head scheme
authority
path
query
Figure 1: Components of URL
fragment
As aforementioned, URLs are extracted from the Web pages during HTML parsing. Before adding these unvisited URLs into the frontier, these extracted URLs will be normalized following the steps specified in the standard URL normalization mechanism. After performing the standard URL normalization, URLs which are syntactically identical will not be added to the frontier in order to avoid crawling and fetching the same Web pages. Table 1: Steps in the Standard URL Normalization Steps Syntax-Based Normalization i. Case normalization – convert all letters at scheme and authority components to lower case. ii. Percent-encoded normalization – decode any percent-encoded octet that corresponds to an unreserved character, such as %2D for hyphen and %5F for underscore. iii. Path segment normalization – remove dot-segments from the path component, such as ‘.’ and ‘..’. Scheme-Based Normalization i. Add trailinig ‘/’ after the authority component of URL. ii. Remove default port number, such as 80 for http scheme. iii. Truncate the fragment of URL. Protocol-Based Normalization i. Only appropriate when the results of accessing the resources are equivalent. ii. For example, http://example.com/data is directed to http://example.com/data/ by http origin server.
4. URL Signature By convention, after undergoing the standard URL normalization process, URLs which are syntactically identical are deemed equivalent and thus get eliminated [9]. However, there are many syntactically different and yet equivalent URLs, which point to the similar Web pages. For examples, two pairs of equivalent URLs which have been identified by our proposed URL signatures are http://www.cnn.com/TECH/ equivalent to http://www.cnn.com/technology/, and http://www.weather.com/jobs/ equivalent to http://www.weather.com/careers/. These two pairs would not be identified by the standard URL normalization mechanism since they are syntactically different. These syntactically different but equivalent URLs have motivated us to propose URL signatures by incorporating semantically meaningful metadata of the
Web pages linked by the URLs. The metadata considered are the body texts of the Web pages. In our proposed method, body text is defined as the textual data which is not embraced by any HTML tags that appear within the content of a Web page. Body texts can be easily extracted during HTML parsing in a crawling process. In fact, this proposed method is part of our effort to enhance focused crawling, which aims to crawl Web pages of certain topics. As such, analysis of the Web page contents is essential in focused crawling. Note that the extracted body texts are considered as the metadata describing the Web pages as body texts by no means represent the complete contents of the Web pages, since there are many other types of data carried in the Web contents nowadays, such as scripting, images, hyperlinks, video as well as audio data. Once the body text bi of URL ui is extracted, bi will then be hashed into 32 hexadecimal characters hash(bi) by using Message-Digest algorithm 5 (MD5), forming the URL signature sig(ui) for ui. In other words, instead of representing each URL with hash(ui′), where ui′ denotes ui after the standard URL normalization [3], we propose to use hash(bi) as the sig(ui) to represent ui. MD5 has been widely used to encrypt messages for secured transmission [10]. Besides, digest or fingerprints of files are also generated by using MD5, which is mainly used for checking the integrity of files. We use MD5 as the hashing function to generate our URL signatures because it is sensitive to even a small change. Figure 2 shows the basic steps in MD5 algorithm, more detailed algorithm can be referred at [10]. In short, given a set of URLs U, we extract the body texts B of the associated Web pages, followed by generating URL signature sig(ui) for each URL ui, where sig(ui) = hash(bi), ui ∈ U and bi ∈ B. URLs with the same signatures are considered equivalent and will be eliminated for avoiding redundant crawling and fetching of the same Web pages.
5. The Flow of Our Proposed Method Figure 3 shows the flow of applying URL signatures in addition to the standard URL normalization, which contains five main steps. To obtain a steady set of URLs for our experiment, we have selected on regular URLs Ureg which appear in all the crawling sessions. Much benefit can be gained if equivalent URLs from Ureg can be identified and eliminated for avoiding redundant Web pages in the future crawling.
Step 1: Append padding bits. Pad the original message so that its length is divisible by 512. Add 1 to the end of the message and followed by 0s to make the length up to 64 bits fewer than a multiple of 512. Step 2: Append length. 64 bits are appended to the end of the padded message to denote the length of the original message in bytes. Break the 64 bit length into 2 words (32 bits each). Subsequently, append the loworder word and followed by high-order word. Step 3: Initialize MD buffer. The 128-bit buffer of MD5 is initialized as: • The buffer is divided into 4 words (32 bits each), named as A, B, C, and D. • Word A is initialized to: 0x67452301. • Word B is initialized to: 0xEFCDAB89. • Word C is initialized to: 0x98BADCFE. • Word D is initialized to: 0x10325476. Step 4: Process message in 512-bit blocks. For each input block, 4 rounds of operation, each with 16 operations are performed. The details of this main step can be referred at [10]. Step 5: Produce output. The contents in buffer words A, B, C and D are returned following the sequence of low-order byte first. Figure 2: Basic Steps of MD5 Algorithm Given Ureg, we apply all the steps of the standard URL normalization listed in Table 1. Subsequently, we omit all the syntactically identical URLs in order to obtain a set of syntactically unique URLs in Ustd. After fetching all the Web pages of the URLs from Ustd in the second step, we de-tag the Web pages and extract the body texts accordingly. We de-tag the Web pages using ParserDelegator from javax.swing.html. In the subsequent step, the extracted body texts are MD5-hashed using the MessageDigest class provided by java.math. The generated 32 hexadecimal characters message digests or fingerprints form the URL signatures for all the URLs in Ustd. Finally, URL signatures among the URLs are compared and URLs which share same signatures are eliminated to form Ufin.
Ureg
Accuracy = Sensitivity * (Positive / (Positive + Negative)) + Specificity * (Negative / (Positive + Negative)) (5)
Standard URL Normalization
n
Table 2. Description of Contingency Table Prediction
o Web Pages
Fetch Web Pages
De-tag Web Pages & Extract Body
q URL Sig.
r
Compare Sig(ui), ui ∈ U
Body Texts
Ufin
Figure 3: Flow Diagram of Our Proposed Method Owing to the relatively small dataset, we verify the identified equivalent URLs manually in this paper. To evaluate the effectiveness of our URL signatures, we apply one of the metrics proposed by Lee et al. [5], which is URL reduction rate:
URL Reduction Rate =
U std − U fin U std
Negative
Total
Positive
True Positive (TP)
False Negative (FN)
Positive
Negative
False Positive (FP)
True Negative (TN)
Negative
Actual
p
Calculate MD5 Message Digest
Positive
Ustd
Positive indicates the number of URLs which are equivalent whereas negative indicates the number of non-equivalent URLs in our dataset. True positive denotes the number of equivalent URLs which are identified or predicted as equivalent by our URL signatures and false negative otherwise. Likewise, true negative carries the number of non-equivalent URLs which are predicted as non-equivalent. As such, sensitivity and specificity evaluate the performance of our proposed URL signatures in terms of identifying actual equivalent and non-equivalent URLs respectively. On the other hand, precision shows the percentage of correct prediction in identifying equivalent URLs while accuracy presents the performance of our URL signatures as a whole.
6. Experiment and Results 6.1. Experiment Setup and Dataset
(1)
Since Ustd contains all the URLs which have undergone the standard URL normalization while Ufin contains the remaining URLs with unique URL signatures, the value returned by this URL reduction rate shows the probability of having equivalent URLs among the set of syntactically unique URLs in Ustd. In addition, we also tabulate our results in a contingency table as shown in Table 2. Having the contingency table, we further analyze the results by using the following metrics [4]: Sensitivity = TP / Positive
(2)
Specificity = TN / Negative
(3)
Precision = TP / (TP + FP)
(4)
We have used Web Data Extractor [11] to obtain the lists of URLs from five websites. These websites have been adopted considering their diverse natures and different geographical locations. They were crawled every two days, starting from 9 April until 16 July 2008, amounting to 50 crawling sessions. For the purpose of this experiment, we have configured Web Data Extractor to collect maximum 5000 URLs from each website, where all URLs contain the complete root domain. For an instance, all URLs crawled from CNN contains http://www.cnn.com/. URLs which appear in all the crawling sessions are selected for Ureg. Since the file which store the list of URLs crawled from CNN on 7 May 2008 is corrupted, URLs which appear in the 49 crawling sessions are selected as the regular URLs for CNN.
Table 3. Experimental Dataset Website
Company / Nature / Location
Average URLs per Crawling
Number of Regular URLs
Percentage of Regular URLs
http://www.arirang.co.kr
Arirang / Broadcasting / Korea
4353
694
15.94%
http://www.bt.com
British Telecom / Telecommunications / United Kingdom
4806
601
12.51%
http://www.cnn.com
Cable News Network / News Cable Television Network / United States
3913
424
10.83%
http://www.mmu.edu.my
Multimedia University / Education / Malaysia
1257
584
46.47%
http://www.weather.com
The Weather Channel / Weather Forecasts / United States
4004
1528
38.16%
Table 3 lists the information about our dataset. As we can see, there are 3831 URLs in Ureg, which is around 20.90% of the complete dataset. Having collected the URLs for all the five Websites in Ureg, we have then proceeded to Step 1 of our proposed method, as shown in Figure 3.
6.2. Experimental Results and Discussion After undergoing the standard URL normalization, 101 URLs in Ureg have been found syntactically identical and thus excluded from Ustd. Having fetched the Web pages of all the 3730 URLs in Ustd, we have then de-tagged the Web pages in order to extract the body texts. Subsequently, URL signatures are generated for all URLs in Ustd by MD5-hashing the extracted body texts. After comparing the URL signatures of all the URLs in Ustd, we have managed to identify 1287 URLs with same URL signatures (see Table 7). In other words, there are only 2443 URLs remained in the Ufin. Table 4 lists some of the equivalent URLs identified by URL signatures within these five websites. Rows numbered 3, 5, 6, 7 and 9 show that URL signatures are able to identify equivalent URLs which contain the default pages without omitting the default pages as proposed by [7]. Besides, the 4th and 8th row indicates that these are the default Web pages in the particular directories although they do not contain the typical names for default pages, such as index.html in the URLs. Likewise, the first and second pairs of equivalent URLs demonstrate that “&id=&page=1” and “&page=1” are the corresponding default Web pages. Obviously, the standard URL normalization mechanism would not be
able to detect the equivalency listed in Table 4 since all of them are syntactically different. Table 4. Examples of Equivalent URLs No. 1
2
3
Equivalent URLs • http://www.arirang.co.kr/Blog/Arirang_Town.asp? code=Bl1 • http://www.arirang.co.kr/Blog/Arirang_Town.asp? code=Bl1&id=&page=1 • http://www.arirang.co.kr/News/News_List.asp?cod e=Ne5 • http://www.arirang.co.kr/News/News_List.asp?cod e=Ne5&page=1 • http://www.bt.com/ • http://www.bt.com/index.jsp
4
• http://www.bt.com/business/ • http://www.bt.com/business/home/
5
• http://www.cnn.com/CNN/Programs/american.mor ning • http://www.cnn.com/CNN/Programs/american.mor ning/index.html
6
• http://www.cnn.com/TECH/science/archive/ • http://www.cnn.com/TECH/science/archive/index. html
7
• http://www.mmu.edu.my/~fom/ • http://www.mmu.edu.my/~fom/index.html
8
• http://www.mmu.edu.my/~dev/ • http://www.mmu.edu.my/~dev/aboutus.htm
9
• http://www.weather.com/activities/homeandgarden /home/mosquito/ • http://www.weather.com/activities/homeandgarden /home/mosquito/index.html
10
• http://www.weather.com/achesandpains/ • http://www.weather.com/activities/health/achesand pains/
Percentage of Equivalent URLs Non-EEquivalent URLs
Std.-Normalized URLs
Equivalent URLs
Weather
63.29%
0.13%
36.58%
MMU
73.29%
10.10%
16.61%
CNN
0.24% 11.08%
88.68% BT
17.30%
0.17%
82.53%
Arirang
81.84%
0.00%
18.16%
Figure 4. Percentage oof Equivalent and Non-Equivalent URLs for each Web bsite Figure 4 on the next page illustrates the percentage of equivalent URLs found in these fivee websites. The first percentage in every bar denotes thhe percentage of unique URLs, the second denotes thhe syntactically identical URLs found after the standard URL normalization while the last percentagge indicates the percentage of equivalent URLs detectted by our URL signatures. All websites except M MMU’s observe higher reduction of equivalent URLs by using URL signatures compared to the sstandard URL normalization. Most of the URLs in M MMU have been found equivalent after the standardd normalization because they contain percent-encodded unreserved characters. For an instance, ‘~’ is staated as ‘%7E’ in some of its URLs. As shown in Figure 4, CNN has the least equivalent URLs compared to the rest. Note tthat BT has the highest equivalent URLs because 4844 of the regular URLs in its Ustd linked to Web pages which are not found in its Web server. Therefore, iff URL signatures are applied, we may reduce up to 80.667% (484 of 600 URLs) of crawling these unavailable W Web pages. Table 5. Contingency Table of Staandard URL Normalization
Actual
Positive Negative Total
Positive
Predicction Negative
Total
101 0 101
10886 26444 37330
1187 2644 3831
Table 5 and 6 show the overall conttingency table of the standard URL normalization techhnique and our proposed method respectively, while Table 7 compares the effectiveness of our URL signaturees in comparison with the standard URL normalization technique using the evaluation metrics. As we can see ffrom table 5, out
of 3831 URLs in Ureg, 3730 of th hem are syntactically different. Out of 1187 equivalentt URLs in Ureg, only 101 equivalent URLs are reduced from f Ureg in order to form Ustd. The remaining 1086 6 URLs which are syntactically different will form Ustd with other 2644 non-equivalent URLs in Uneg. Nev vertheless, note that the standard URL normalization records 0 of false positive because only syntactically y identical URLs are considered to be equivalent. Likewise, in Table 6, 2644 ou ut of 3831 URLs in Ureg have unique URL signatures and a grouped in Uneg. However, the remaining 1086 equ uivalent URLs which are not identified by the standard URL normalization technique (1086 of false negative in Table 5) have identical URL signatures. Hence, these t 1086 URLs are predicted to be equivalent by ou ur proposed method. However, there are 201 non n-equivalent URLs mistakenly predicted as equivalent by our method. o Our Proposed Table 6. Contingency Table of Method
Actual
Positive Negative Total
Positive
Prrediction Negative N
Total
1187 201 1388
0 2443 2443
1187 2644 3831
n metrics Table 7. The evaluation Metrics
U Standard URL Normalizattion
URL Signatures
URL Reduction Rate
0.03
0.36
Sensitivity
8.51%
100%
Specificity
100%
92.40%
Precision
100%
85.52%
Accuracy
71.65% %
94.75%
Table 8. Results comparison between the standard URL normalization mechanism and URL signature Arirang
BT
CNN
MMU
Weather
Total URLs
Regular URLs
694
601
424
584
1528
3831
Reduction by the standard URL normalization
694
600
423
487
1526
Reduction by using URL signatures in addition to the standard URL normalization
568
104
376
428
967
Comparison / Websites
The URL reduction rate in Table 7 shows that there is a possibility of 0.33 (0.36 – 0.03) for each URL in Ustd to have same URL signature with its peer URLs in Ustd. The 100% of sensitivity of our proposed method shows that all equivalent URLs in Upos are identified by the URL signatures. However, only 85.52% (1187) of the 1388 URLs predicted to be equivalent by URL signatures are indeed equivalent. On the other hand, there are 201 or 7.6% (100% 92.40%) of non-equivalent URLs in Uneg share the same URL signatures. This is mainly due to the fact that our signatures are constructed using only the body texts of the Web pages, without considering other types of Web contents. The standard URL normalization technique records higher specificity and precision compared to our proposed method as it records 0 false positive. The main reason is that URLs are only considered as equivalent if are they syntactically identical in the standard URL normalization technique. Since the main objective of using URL signatures is to identify equivalent URLs, our proposed method is more preferable as it records 100% of sensitivity compared to only 8.51% by the standard URL normalization. Overall, our proposed method shows promising accuracy of 94.75% compared to only 71.65% of accuracy by the standard URL normalization technique. Table 8 further compares the results of using our proposed URL signatures in addition to the standard URL normalization mechanism. As we can see, when URL signatures are constructed and referred on top of the standard URL normalization, another 1287 (1388 – 101) equivalent URLs have been identified, contributing to 34.57% (37.21% - 2.64%) of additional reduction in redundant crawling and retrieval of the same Web pages.
Reduced URLs
%
3730
101
2.64%
2443
1388
37.21%
7. Conclusion and Future Works In contrast to the conventional way of representing URLs using the hashed value of URLs after the standard URL normalization, we proposed to represent URLs using URL signatures. The URL signature for each URL is the MD5-hashed body text of the corresponding Web page, where body text is the textual data which is not embraced by any HTML tags within that particular Web page. In other words, we enhance the standard URL normalization mechanism by generating and utilizing URL signatures to further identify syntactically different and yet equivalent URLs. Comparing to the standard URL normalization mechanism, our experimental results demonstrate that the proposed URL signatures managed to further reduce equivalent URLs by 34.57%, which is 37.21%. To study the performance of our proposed method in a larger scale, which includes crawling more websites, we will apply our method in our Web crawler [6]. Besides body texts, we also plan to explore the possiblity of incorporating other metadata of Web pages, such as page size to dynamically construct the URL signatures. Since the Web changes rapidly, we are also interested to derive the optimum frequency of re-contructing URL signatures for the URLs.
Acknowledgement This works was supported by the Korea Research Foundation funded by the Korean Government (MOEHRD) (KRF-2006-005-J03803).
8. References [1] Bar-Yossef, Z., Keidar, I., Schonfeld, U., “Do Not Crawl in the DUST: Different URLs with Similar Text”, in the Proceedings of the International World Wide Web Conference (WWW 2007), May 2007, pp. 111 – 120.
[2] Berners-Lee, T., Fielding, R., Masinter, L., “Uniform Resource Identifier (URI): General Syntax”, available at http://gbiv.com/protocols/uri/rfc/rfc3986.html. [3] Chakrabarti, S.: Mining the Web, Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, Elservier, San Francisco, CA, 2003. [4] Han, J., Kamber, M.: Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, Elsevier, San Francisco, CA, 2006. [5] Kim, S. J., Jeong, H. S., and Lee, S. H., “Reliable Evaluations of URL Normalization”, in Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), Glasgow, May 2006, pp. 609 – 617. [6] Kim, S. J. and Lee, S. H., “Implementation of a Web Robot and Statistics on the Korean Web”, in Proceedings of the Second International Conference on Human.Society@Internet (HIS), Seoul, June 2003, pp. 341 – 350.
[7] Lee, S. H., Kim, S. J., Hong, S. H., “On URL Normalization”, in Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), Singapore, May 2005, pp. 1076 – 1085. [8] Netcraft June 2008 Web Server Survey, available at: http://news.netcraft.com/archives/web_server_survey.ht ml [9] Pant, G., Srinivasan, P., Menczer, F., “Crawling the Web”, Web Dynamics 2004, pp. 153 - 178. [10] The MD5 Message-Digest Algorithm, available at: http://tools.ietf.org/html/rfc1321 [11] Web Data Extractor, available at: http://www.webextractor.com/