Aiding Web Crawlers; Projecting web page last ... - IEEE Xplore

0 downloads 0 Views 489KB Size Report
Abstract—Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the ...
Aiding Web Crawlers; Projecting web page last modification Adeel Anjum

Adnan Anjum

Ecole Polytechnique University of Nantes Rue Christian Pauc - BP50609 - 44306 Nantes cedex 3 - France Email: [email protected]

NUST Islamabad, Pakistan Email: [email protected]

Abstract—Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page’s version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous. Index Terms—World Wide Web, Web Crawlers, HTTP response headers, Web Archive.

I. I NTRODUCTION Web Archiving corresponds to a process of collecting portions of Web page(s) and store them in an archive for future reference specially for research community, historians and general public. The accuracy of this collection largely depends on the scheme typically employed for this purpose i.e. under what settings the Web pages are crawled and what are the parameters for choosing a Web page to store in the archive. Each crawl setting makes use of HTTP response headers - sent by a Web server - in order to decide whether or not to chose a Web page for preserving in an archive. It is thus important to identify how old a Web page is, in order to avoid confusion of using older information. Consequently, our proposal of detecting the last modification date of a Web page eventually uses HTTP response headers. A. HTTP Metadata HTTP 1.1 , the main protocol used by Web clients and servers for exchanging the information and it offers various features for timestamping of Web pages. Below we present a brief overview of HTTP metadata that can be used for 978-1-4673-2252-2/12/$31.00 ©2012 IEEE

timestamping. HTTP headers are sent by the server before the HTML content, and only seen and interpreted by the browser and any intermediate caches. The most common timestamp related response headers given by the server are Last-Modified: and E-Tag: header fields. They are extremely accurate and are supposed to change only if a document indicates the change by itself. Last-Modified: header indicates the last modification date of the Web page. The client can use this information and may cache the Web page and provide a date by an If-Modified-Since: request header later on. This request is normally treated as a conditional GET, the document being returned only if the Last-Modified: date is older than the one sent by If-Modified-Since:. Otherwise, a 304 (Not Modified) status is returned, and the client is obliged to use the cached Web page. HTTP 1.1 introduce a new kind of response header called the Entity tags or E-Tag. The ETags are the comparison mechanism employed by web servers and browsers in order to determine whether the component in the browser’s cache is the same as that of the server. (An “entity” in another words is a “component”: images, scripts, stylesheets, etc.). An E-Tag: is basically a string which is uniquely identifies a specific version of a component. Klausen [1] empirically shows the reliability of E-Tag and Last-Modified response headers on a set of a few million Danish Web pages. He shows that the unnecessary downloads can be avoided by either downloading a Web page only if the Etag is missing or if Last-Modified response header indicates change. The results indicate 63% accuracy in predicting non-change. Given that, in this particular set of experiments, the majority of Web servers use some version of Apache Web server [2], checking whether these results are due to the inherent nature of this software is an interesting idea. Moreover, a large scale experimentation with recent set of Web pages would be rather interesting. The Expires: HTTP header is main source for controlling caches as it indicates for how long the specific representation is fresh for. The client may refresh their version of the document after this deadline. Expires headers are provided by nearly every caching software or hardware and

most Web servers provide the liberty of setting this response header in several ways. Although the Expires: header is useful, it has few limitations. Firstly, as the date is involved, the clocks of the cache and Web servers must be synchronized as discrepancy in this may lead to incorrect timing information. Secondly, Expires: is easy to forget i.e. if one does not update an Expires: response header before it passes, every request will be sent to server thereby increasing the load and latency. HTTP 1.1 also provide, Cache-Control: response headers which give more control to the user and addresses the limitation of Expires. Cache-Control: response headers comprise of: • max-age =[seconds] — gives the amount of time the page will be considered fresh. • s-maxage =[seconds] — similar to max-age, except that it is used only in shared (e.g, proxy) caches. In some specific and controlled environments e.g. (intranets) , it might be useful to manipulate these chunks of information to estimate the refresh rate of a Web page.

computer on which they were modified have a reliable time source. Image files, sounds and videos usually provide timestamping meta-data but they may not be a reliable source of information. Furthermore, if we consider the EXIF (Exchangeable image file) meta-information that comes with JPEG images, the included timestamp gives the capture date of the picture, which may have nothing to do with its published date. Sometimes, the external semantic contents (e.g. RSS feeds) can be used to date an HTML Web page. For example, one can map RSS feed containing some blog entries to the specific Web page for timestamping individual items. Another possibility in this case is to use Sitemaps [4]. Sitemaps are used to describe the organization of a Web site in order to improve its indexation by search engines. Sitemaps could also contain not only the timestamps for Web pages but they may also be helpful in estimating the change rates of the Web page e.g. hourly, monthly etc.

B. Content and Semantic timestamping

Information resource timestamping is an important task. [5] proposed a novel technique for the estimation of last modification date of a Web page, which complements the traditional approaches based on HTTP headers, by also looking at a document’s neighborhood. The neighborhood of a document, as mentioned in the paper, is its incoming links, outgoing links and media files, e.g., pdf files, image files, sound files etc. Though the precision of timestamping the Web page using this technique may not be very high, when no other information is available, we can use this technique in order to estimate the last modification date of a Web page.

Content management systems (CMS) and Web authors often provide reliable human-readable information about Web page’s last modification date in its contents. This particular piece of information could be a global timestamp (can be found in the footer of Web page with keywords e.g. Last-Modified) or could be a set of timestamps for particular elements in that Web page, for instance, blog posts, shoutbox, latest news etc which in this case can be calculated as the maximum of the set of individual timestamps. We can extract such kind of information by matching keywords relating to dates e.g. (“Last modification date” or “Last modified on” or “Last updated on” etc) or by using entity recognizers for dates (made of some regular expressions). Though the information extracted by this method looks quite interesting, sometimes it is not complete and most of the time it does not have information about the timezone. Heuristics for adding the timezone to a timestamp given the language or other geographical information for the Web site should be relatively precise, though. Finally, it is unclear whether such information is always trustworthy. Some CMSs do not consider the modification in a subpart of a Web page as a change. In addition to these human understandable timestamps, HTML contents could also be searched by some machinereadable timestamp information. One such case could be the use of meta-data in the form of < 𝑚𝑒𝑡𝑎 > tags, one particular profile of such meta-data being Dublin Core [3] whose Modified term indicates the last modification date of a Web page. This possibility is however occasionally used by the content management systems and Web authors. Semantic timestamping may include the possibility to look for other document formats pointed out by a Web page, such as PDF, Microsoft Office documents etc which include both creation and modification dates in meta-data. These are usually reliable pieces of information as long as the local

C. Neighborhood technique

D. Other approaches When no other technique looks feasible, there is always a possibility of comparing a Web page with its previous versions (either by repeated crawling on the page or by using an existing Web archive) in order to determine the change (indeed, repeated crawling will result in finer change determination). One might find it useful to manipulate these pieces of information in order to estimate the refresh rate of a Web page. This approach is based on the work done in [6]. Ideally, the comparison of the hash of a Web page with its stored hash might do the trick but practically some insignificant differences may appear in Web page e.g. ads or tip of the day etc. We can also use the comparison of the distribution of n-grams with some shingling [7] techniques in order to compare two versions of a Web page or by having full computation of the edit distance to the previous version, as done in [6]. Our approach extends the idea presented in [6]. A detailed version of techniques that can be used for timestamping are mentioned in [8]. II. M ETHODOLOGY AND I MPLEMENTATION In order to implement and test selected techniques, around 500GB of archived files were arranged with the help of

European Archive 1 . The data has been extracted from the vast Internet cache crawled at European Archive. It contains variety of data that are useful to various research groups. For example, data collected from various stock exchanges over the world would be very useful for Machine Learning research. In this section, we shall describe the usage of archived files and series of tasks conducted in order to implement the solution. A. Data Collection The data consists of Internet cache, which was crawled by European Archive over a period of few years. It has a variety of data, which are very useful for various fields of research not limited to Computer Science. Total size of archived files was around 500 GB and it consists of ARC files (short for ARCHIVED FILES) 2 . B. Implementation The main idea of our implementation is to read an ARC file for its contents and Java provides such a functionality in an easy way. We used the Java API for this purpose that has some interesting features like the Red-Black tree implementation which also affects the efficiency of the overall process. The application has following phases: • ARC Reading • Data Extraction and Processing • Output a) ARC Reading: This phase operates on ARC files in the order of their listing in their source directory. Every time the algorithm encounters an ARC file, the information about that file e.g. name is stored in a hashtable. This is important in keeping the history of ARC files read. b) Data Extraction and Processing: This phase is dedicated to extracting the link structure and the supplementary data. Along with the other information, this phase also extracts the data for each page that not only contains the response headers but also raw data (either HTML, XML or XHTML). This raw data is then used for processing ∙ HTTP Metadata: HTML response headers in order to extract Last-Modified: and E-Tag: of current Web page. ∙ Contents Stamping: The contents of the page are manipulated for the: 1) Extraction of any keywords like Modified, Last-Modified, Updated On, Last Modification Date, Last Modified, Updated-On, Last-ModificationDate, Last-Modification or Last Modification in order to extract any date and time information.2) Extraction of text that looks like date. Such dates can be found in comments, blogs, shoutbox etc. For the purpose of extracting this information, regular expressions have been extensively used. Several widely used datetime formats are taken into consideration (e.g., May 03,2009 ,03 May 2009 ,03-05-2009 ,03/05/2009 or 03.05.2009 ,2009-0503 ,05/03/2009 etc). We need to focus on both date 1 http://www.europarchive.org/ 2 http://archive.org/web/researcher/ArcFileFormat.php







and time extraction but obviously in some cases only the date might be available. A proper way to represent it would be to have an interval covering the whole range of possible timestamps (first possible timestamp to last possible timestamp; e.g. “May 3, 2007” might represent the interval “May 3, 2007 00:00:00 GMT-12” to “May 3, 2007 23:59:59 GMT+12”, if GMT-12 and GMT+12 are valid timezones). Having this interval and not a precise timestamp might not be a problem; it is still useful to get a general idea of the frequency of refreshing of a page, or for presenting the archive to a user.By using the above two techniques, we extract all the dates from this raw HTML and put them in a pool of accepted dates only if it is not equal to crawl date. We can still have multiple dates in the pool and among those, we accept one date which is closest to the current date. Semantic Stamping: As discussed before, we can timestamp HTML pages using semantic contents (RSS, document files e.g. PDF, Microsoft Office files, images and sound files etc). The contents of HTML file is searched for any semantic contents pointed out by that page. This is done by extracting Last-Modified: response header for each file and then among the list of all the dates extracted through this method, we accept the one which is the closest to current date. Neighborhood Stamping: The idea of using neighbors to date Web pages is discussed in [5]. For higher accuracy, we considered two types of neighbors Outgoing links and assets. This is because the correlation in this case is much higher. Thence, firstly Last-Modified: response header is extracted for each outgoing link and an average of all these dates is calculated. (by keeping in mind that if the value of this date is either current date or archived date or null, it is not acceptable) Secondly, Last-Modified: for all the assets are extracted, with same constraints as above, and an average of all these dates is calculated. Finally, we take average of both dates and accepts the resulting date keeping the above mentioned constraints in mind. Multiversion timestamping: Archived files contain multiple versions of a Web page depending on the number of crawls made to that page. We can estimate last modification date of a Web page by comparing various versions of the same page archived during multiple crawls. The idea behind this technique is that we compare the two versions of a page in an archive. By comparing the contents of these two pages (e.g. by shingling technique [7]), we can check whether the contents of the page have changed between two successive crawls. If this is the case then we can say that the last modification date of that page is in between the two crawl dates. If no change is detected between two successive crawls, this means that last modification date can be before the later version’s crawl date. We can add the granularity by minimizing the time between two successive crawls. The smaller the time, the greater will be the precision of the estimation.

c) Output: Output of the above steps is a summary file which calculates the average accuracy in each technique used. For each technique, we calculate the accuracy by counting the number of pages that can be correctly timestamped by using that technique and then taking average from the total number of HTML pages on which this technique is applied. The higher the percentage, the better and accurate is the technique. C. Experiments using live crawler In theory, use of the Last-Modified: and E-Tag: response headers should give exact modification information of Web pages thereby allowing the download of exactly those pages that have changed, and only downloading the headers of other pages. However, in practice, many servers send no change indicators or change indicators that are not consistent with changes in the actual content [1]. Our initial data contains 12,033 URLs from the Yahoo! Directory extracted using software (Directory Extractor) provided by Yahoo!. These experiments actually reassure the work done in [1] with a new set of data. In our experiments, we crawled the links extracted from Yahoo! directory every other night for a period of 20 days. On every crawl, we save Last-Modified:, E-Tag: response headers and contents of HTML page. Once the data is collection, it is compared to find changes in Etags and datestamps. We choose the same set of settings as provided in [1]. III. R ESULTS AND D ISCUSSION Experiments are conducted on two different sets of data. First dataset comprises of Web archives obtained from European Archive [9], which contains around 500GB of data and second dataset is composed of links (URLs) extracted from Yahoo directory services. Several same sets of experiments were conducted on both sets of data in order to estimate the best possible technique to date Web pages. This section describes in detail the outcome of these experiments. A. Terminology To be of use for estimating a relevant timestamp of a Web page, a strategy must possess coverage and effective measure. Coverage means how many Web pages can be dealt with a particular strategy. Coverage of a strategy is the percentage of URLs on which that particular strategy can be applied. Effectiveness means that among the URLs where a particular strategy can be applied (i.e., the coverage of that strategy), how many Web pages could be correctly timestamped using that particular strategy. Effectiveness is the percentage of URLs on which that particular strategy can be successfully implemented among the ones on which it can be applied. B. Experiments on real Web 1) Dataset characterization: Our initial data contains 12,033 URLs from the Yahoo! Directory. This sample was obtained using Yahoo!’s Random Link (YRL) service using an open source software Directory Extractor. Table Ia shows the distribution of top-level domains (TLD) in the sample, while Table Ib lists the top 5 domains obtained.

Domain .com .org .uk .net .edu .nl .de .cz .it .au .ca .dk .ru .jp .fr .lt .gov

No of Pages 6888 1601 619 580 342 145 139 130 115 108 98 87 77 64 57 51 49

% age 57.24 13.31 5.14 4.82 2.84 1.21 1.16 1.08 0.96 0.90 0.81 0.72 0.64 0.53 0.47 0.42 0.41

(a) TLD Distribution Domain yahoo.com tripod.com wikipedia.org google.com msn.com aol.com geocities.com

No of Pages 58 52 41 21 10 8 7

% age 0.48 0.43 0.34 0.17 0.08 0.07 0.06

(b) Top domains in sample

Table I: TLD Distribution

The initial sample contained 77 duplicate entries which were removed from the set of URLs. It is important here to clarify that this dataset cannot be seen as a random sample from Yahoo search engine. 2) Dataset manipulation: In our experiments, we crawled the dataset of URLs every night over a period of 20 days. For each page, we preserved the date, the Etag, the size, and an MD5 sum of the body of that page. The contents of each page were stored in a Sql Server database and an MD5 checksum of the body was calculated using Java Md5 API. The headers, size of body and MD5 sum of body were also stored in database for further processing. Once the harvesting is done, the processed data was compared to find changes in Etags, datestamps and MD5 sums. For comparison purposes, we also calculated tri-gram shingling co-efficient for the similarity measure between the contents of two consecutive downloads of a Web page. This is done by implementing the n-grams shingling technique proposed in [7] using Java programming language. The shingling approach is discussed in [7]. The idea of shingling is to identify the similarity between two documents. The similarity of two documents A and B is a number between 0 and 1, such that when the similarity is close to 1, it is likely that documents are nearly the same. In n-grams shingling, each document can be viewed as a sequence of words, and start by lexically analyzing it into a canonical sequence of tokens. This canonical form ignores minor details such as formatting, HTML commands, and capitalization. We then associate with every document D a set of subsequences of tokens S(D,w). A

Total Pages

Consecutive Downloads

Changed (shingling)

Unchanged (shingling)

240,660

229,828

25,115 (11%)

204,713 (89%)

ARCs Read 12,338.00

Size Processed (Bytes) 2,128,877,382,300 (2000 GB)

HTML Pages Processed 45,654,176

Table III: Consecutive Downloads

Table V: Processed Web archives

contiguous sub-sequence contained in D is called a shingle. For a given shingle size (3 in our case), similarity is defined as

Datestamps are more common, with 223,226 pages (97%) have that. The results for both indicators are encouraging, missing a minor and negligible amount of changed pages where the Etag missing just 7%. Table IV indicates the output of experiments in order to estimate better technique for timestamping of Web pages. As can be seen that the strategies using Image, sound and look at the video files, Keyword Extraction and Outgoing links indicate rather high percentage in mispredicting changes. Static contents like Images, sounds and video files not always indicate change when the contents of Web page change. For each of the above sets, an average Last-Modified value was calculated. As defined before, requests returning the current date were considered invalid and not included in the average. In the end, for each URL, we had its Last-Modified value, the average Last-Modified value of the each set.

𝑠(𝐴, 𝐵)

=

∣𝑆(𝐴) ∩ 𝑆(𝐵)∣ ∣𝑆(𝐴) ∪ 𝑆(𝐵)∣

where ∣𝐴∣ is the size of set A. In our experiments, we performed tri-grams shingling and the similarity measure was set to 0.5 i.e. if the co-efficient is greater than or equal to 0.5, then the contents of two consecutive downloads of Web page are not changed otherwise the co-efficient indicates change. The co-efficient value of 0.5 gives the flexibility to ignore unnecessary change prediction e.g. the header information may not be the same each time the page is downloaded. Also, non-significant differences may appear in an unchanged Web page e.g ads, tip of the day and timestamps by themselves, etc. We checked for changes between two successive downloads, and for each such download we recorded whether the contents (not the headers) had changed since the last successful download (this is done by the comparison of Md5 sum of bodies of Web pages for exact matching and also by shingling co-efficient), whether datestamps and Etags were present, and whether or not they had changed. Not all servers sent out Etags or datestamps on every visit. As we are interested in content change forecasting, we have considered that a change between having Etags or datestamps and not having them to be the same as if two different Etags or datestamps were given which is another important reason for shingling co-efficient 0.5 and shingle size of 3. 3) Results: Table II shows all the crawls made and their respective attributes. First crawl contains some duplicate records but they have negligible effect on overall result. Sixth crawl was not included in final results because of substantial deviation from expected results. a) Validation of estimators for real Web data: We carried out 20 harvests of our dataset obtained from Yahoo directory services. A total of 240,660 entries were found when processing the downloaded data, with an average body size of 1,629 bytes. Since we intend to verify the changes in the content of Web pages, we aim at only the consecutive downloads i.e. the page is downloaded in the previous run. Table III shows the consecutive downloads along with the ones which have changed contents. As can be seen, up to 89% of the downloads are unnecessary and may be avoided if we would have used an accurate predictor for content change. Table IV indicate the number of pages in which datestamps and Etags are present and how well they are in predicting the change in the contents of a Web page. For the downloaded pages, only 90,016 (40%) have Etags.

C. Experiments on Web archives Around 500GB of archive (.arc) files are provided generously by European archive [9] for the purpose of this research. Same sets of experiments were conducted on these archives in order to estimate the quality of our strategies for the data in Web archives. Table V below indicates general information about these experiments. Table VIindicates the results of the experiments done on these archives. Due to complex nature of data available in Web archives, we will be presenting only the coverage measure of our strategies. As can be seen from Table IV and Table VI, the results from both datasets are almost same except for Outgoing links strategy and RSS feeds strategy. This is due to nature of data present in Web archives. Outgoing links strategy indicates rather high coverage measure in case of real Web data (Table IV) than Web archives (Table VI). This is due to the presence of URLs for which server does not provide any modification information. In case of Web archives processing, we have just 11% of coverage measure for RSS feeds. This is due to the fact that Web archives in our dataset were captured when RSS feeds were not very popular and thus not lots of files indicated the use of RSS feeds in them. In case of real Web data, coverage as well as effectiveness of RSS feeds are as expected because our dataset contains only front pages and there is every likelihood of that being a little higher. Section below discusses the results in more detail. D. General Discussion ∙ HTTP Metadata In case of Web archives, Last-Modified: response headers have shown high variation and we have just 63% coverage measure. This is because of presence of URLs having query strings or fragments in them. One such example

Crawl Crawl1 Crawl2 Crawl3 Crawl4 Crawl5 Crawl7 Crawl8 Crawl9 Crawl10 Crawl11 Crawl12 Crawl13 Crawl14 Crawl15 Crawl16 Crawl17 Crawl18 Crawl19 Crawl20

Crawl date 3-Aug-09 4-Aug-09 5-Aug-09 6-Aug-09 7-Aug-09 9-Aug-09 10-Aug-09 11-Aug-09 12-Aug-09 13-Aug-09 14-Aug-09 15-Aug-09 16-Aug-09 19-Aug-09 20-Aug-09 21-Aug-09 22-Aug-09 23-Aug-09 24-Aug-09

URLs parsed 12,714.00 11,411.00 11,428.00 11,405.00 11,293.00 11,441.00 11,428.00 11,432.00 11,431.00 11,418.00 11,416.00 11,408.00 11,413.00 11,414.00 11,433.00 11,410.00 11,254.00 11,403.00 11,402.00

Last modified 12,351.00 11,067.00 11,079.00 11,053.00 10,950.00 11,091.00 11,087.00 11,074.00 11,096.00 11,074.00 11,067.00 11,062.00 11,060.00 11,086.00 11,083.00 11,050.00 10,915.00 11,049.00 11,058.00

% Last modified 97.14 96.99 96.95 96.91 96.96 96.94 97.02 96.87 97.07 96.99 96.94 96.97 96.91 97.13 96.94 96.84 96.99 96.90 96.98

Etags 5,162.00 4,720.00 4,730.00 4,717.00 4,669.00 4,727.00 4,718.00 4,706.00 4,715.00 4,716.00 4,718.00 4,726.00 4,724.00 4,726.00 4,727.00 4,717.00 4,674.00 4,711.00 4,713.00

% Etags 40.60 41.36 41.39 41.36 41.34 41.32 41.28 41.17 41.25 41.30 41.33 41.43 41.39 41.41 41.35 41.34 41.53 41.31 41.33

Table II: Crawl list

Estimator 1 2 3 4 5 6 7 8 9 10 11

Exists in consec. downloads

Mispredicts change

Mispredicts non change

Datestamp Etags

223,226 (97%) 33 (0.01%) 65,668 (29%) 90,016 (40%) 17,139 (7%) 10,267 (5%) Image, sound and video files 144,792 (63%) 62,054 (27%) 27,580 (12%) Document Files 117,213 (51%) 48,264 (21%) 39,071 (17%) RSS Feeds 94,229 (41%) 43,668 (19%) 6,895 (3%) Keyword Extraction 78,141 (34%) 85,037 (37%) 11,492 (5%) Date Extraction 108,019 (47%) 34,475 (15%) 66,651 (29%) Image Files+Docs Files 170,073 (74%) 39,071 (17%) 16,088 (7%) Outgoing Links 117,213 (51%) 55,159 (24%) 48,264 (21%) Script files+CSS Files 172,371 (75%) 20,685 (9%) 16,088 (7%) Neighborhood Technique 190,757 (83%) 16,088 (7%) 13,790 (6%) (Neighborhood technique basically combines the techniques at 8, 9 and 10)

Table IV: Effectiveness of estimators for real Web data

Total Pages Found

Total Pages Stamped

Coverage (%)

1

Estimators Last Modified Headers

28,762,131.00

28,798,371.00

63

2

Image, Sound and Video files

30,761,783.00

30,761,783.00

67

3

Document Files

19,631,295.00

19,731,295.00

43

4

Image Files+Docs Files

33,405,160.00

33,405,160.00

73

5

Outgoing Links

28,017,967.00

28,017,967.00

61

6

RSS Feeds

5,190,879.00

5,190,879.00

11

7

Keyword Extraction

16,994,723.00

16,604,423.00

36

8

Date Extraction

24,297,829.00

23,283,629.00

51

9

JavaScript files+CSS Files

32,724,913.00

32,724,913.00

72

10

Neighborhood Technique (4+5+9)

37,340,550.00

37,340,550.00

82

Table VI: Coverage of estimators for Web archives

Estimators Last Modified Headers Image, sound and video files Document Files Image Files+Docs Files Outgoing Links RSS Feeds Keyword Extraction Date Searching JavaScript files+CSS Files Neighborhood Technique

is “http://livingtogether-competition.britishcouncil.org/ photos /email_to_a_friend/ photos/large/ 135/2367/ 0/1/1/null”. Servers do not return any modification information for such URLs and thence reducing the coverage of this strategy in case of archives processing. In case of real Web data, Last-Modified: response headers provide very high coverage measure. This is due to the fact that our dataset contains mostly front pages of the Websites. ∙

Content and Semantic timestamping

As can be seen, in both archives and real Web results, images, sound and video files provides rather high coverage and effectiveness measure for timestamping of Web pages. The files considered were image files (including all popular image formats on the Web i.e. gif, jpeg, tiff, png, bmp, pcd, pbm, psd, xcf), sound files (these mostly include shockwave files (.swf)) and video files (fla, swf, asf , wmv , avi , mov , mpg , mp4 , rm , flv). Precisely, these files includes everything that can be found in src attribute of image tag and all files those can be included in embed HTTP tag. We searched for the media formats mentioned above. Document files normally provides much reliable source of timestamping of Web pages given that the local computer on which they were modified have reliable time source. Document files include MS office files, PDF (portable document format), and compressed files (.zip, .gzip, .rar, etc). Normally these files can be found in anchor HTTP tag and in our experiments, we have separated them from Outgoing links by filtering on certain document types. RSS (Really Simple Syndication) feeds provide a little higher coverage measure than expected. This is due to the fact that in our dataset, we have dealt with the front pages only. RSS feeds file information are normally available in Link HTTP tag where its type attribute value is application/rss+xml or application/rss+atom and href attribute value point to the RSS file which is an XML file. As discussed earlier, there is a possibility to map RSS feed containing blog entries to the corresponding Web page, in order to date individual items. We parsed each RSS feed file in order to look for the URL to be timestamped in the item list of RSS feed file. If we are able to find that URL in that list, we can say that the URL is successfully stamped with that RSS feed. There is possibility that we can have more than one RSS feed files. In that case, we have performed searching of that URL in all those files. Keyword extraction strategy has rather large coverage measure in both cases. This is because of the keywords those are used for extraction of dates are somehow common in many pages. This strategy is implemented by scanning the whole document for certain keywords those could indicate any kind of date (e.g., Modified, Last-Modified, Updated On, Last Modification Date, Last Modified, Updated-On, LastModification-Date, Last-Modification or Last Modification) using regular expressions. If we are able to parse valid date from the resulting text, we can assume that the Web page could be stamped using this technique.

Approximate Efficiency 3 sec 45 sec 50 sec 60 sec 40 sec 50 sec 60 sec 60 sec 20 sec 100 sec

Table VII: Estimators efficiency

Date extraction strategy indicates relatively higher coverage and effectiveness measure than keyword extraction strategy. This is because of the inclusion of all possible dates extracted using keyword extraction strategy into date extraction strategy. This strategy is implemented by searching for dates in the body of a Web page with the help of regular expressions. For the validation of the resulting dates, same procedure is applied as in case of keyword extraction strategy. Script and CSS files include any client-side scripting files such as Javascript or Vbscript and CSS files. As shown in the results, this strategy also provides relatively higher measure of coverage and effectiveness in both cases. This is due to the fact that we are able to find these files in a very large number of Web pages and once found, server provides some kind of HTTP metadata relating to these files. ∙

Neighborhood technique

Neighbors of a Web page includes its Outgoing links, Image files, sound files, video files, document files, script files and CSS files. Precisely, everything that is pointed out by a Web page constitute its neighbors. As can be seen by its composition, nearly all files those are dealt with separately in the previous strategies must be taken into account collectively in this strategy. The results indicate that this technique produces the highest coverage and effectiveness measure in both real Web and Web archives experiments. This is due to the fact that we are using everything that is present in a Web page to compute its timstamping information and from one neighbor or another, we could be able to have some kind of information regarding last modification date of that Web page. E. Efficiency of estimators Table VII provides the approximate average computation time of each strategy. This is the average time required to timestamp a Web page using a particular strategy. HTTP timestamp provides the highest efficiency measure. This is due to the fact that in this case we just need to read the response headers and no further processing is required. The reduction in the efficiency of other estimators is due to the large number of items those need to be processed by these estimators. Note that there could be multiple javascript and

CSS files to be extracted and processed thus resulting in it being less efficient than last modification header estimator. IV. C ONCLUSION In this work, we have analyzed different techniques with the help of our dataset. The dataset is obtained from real Web data obtained by crawling the Web repeatedly and Web archives which are obtained from a third party research organization. Several experiments were conducted in order to compute the coverage, effectiveness and efficiency of each of these techniques. According to the results of our experiments, if the size of document is not very large (i.e. less than 2MB), neighborhood technique provides the best way of timestamping of Web pages. Alternatively recommended strategies with relatively higher coverage, effectiveness and efficiency are Script and CSS files, and image and document files. Summing up the results and discussion, we can conclude that the best ways to timestamp a Web page in the order of their possible implementation are: ∙ Etags, when present are the most reliable and efficient estimators. ∙ Neighborhood technique (comprising of outgoing links, document files, image files, sound and video files, Script and CSS files) ∙ Script and CSS files ∙ Image and document files As a future work, an experimental study of timestamping effectiveness, going further than the previous study by Klausen [1], is strongly needed. This could be done by using a large existing crawl, collecting all kinds of timestamps, server information, etc., that can be found, and store them in a database. The main problem is that there is hardly any ground truth to compare the estimates to, apart from the binary change prediction test that was used in [1]. ACKNOWLEDGMENT This is an independent sub-project done as a part of DBWeb Team at Telecom ParisTech . The goal of the project is to identify the most appropriate technique of timestamping of Web pages. For this purpose, a Web crawl was conducted on large number of Archived (ARC) files provided by European Archive [9], a non profitable organization which is responsible for archiving the current web. I would like to thank my professor Mr. Pierre Senellart for providing me ultimate guidance in completing my project and with whom I have never felt stuck throughout my project. I would also like to thank European Archive [9] for providing us required data in minimum possible time. R EFERENCES [1] L. Clausen, “Concerning etags and datestamps,” 4th International Web Archiving Workshop (IWAW04), 2004. [2] Netcraft., “November 2008 web survey.” November, 2008. [Online]. Available: http://news.netcraft.com/archives [3] D. C. M. Initiative., “Dcmi metadata terms .” January, 2008. [4] sitemaps.org, “Sitemaps xml format.” Feburary, 2008. [Online]. Available: http://www.sitemaps.org/protocol.php

[5] S. Nunes, C. Ribeiro, and G. David, “Using neighbors to date web documents,” Proceedings of the 9th annual ACM international workshop on Web information and data management, pp. 129–136, 2007. [6] A. Jatowt, Y. Kawai, and K. Tanaka, “Detecting age of page content,” Proceedings of the 9th annual ACM international workshop on Web information and data management, pp. 137–144, 2007. [7] S. Chakrabarti, Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, 2002. [8] M. Oita and P. Senellart, “Deriving Dynamics of Web Pages: A Survey,” TWAW (Temporal Workshop on Web Archiving), Feb. 2011. [Online]. Available: http://hal.inria.fr/inria-00588715 [9] E. Archive, “www.europarchive.org.” [Online]. Available: www.europarchive.org