Chronica: A Temporal Web Search Engine Deniz Efendioglu
Chris Faschetti
Terence Parr
University of San Francisco
University of San Francisco
University of San Francisco
[email protected]
[email protected]
[email protected]
ABSTRACT
provided via the Internet Archive Way Back Machine, a webbased application that provides access to archived versions of websites via the original site URL. The obvious dependence upon the original site URL limits the capabilities of such an archive. Chronica, the subject of this short paper, provides a typical search interface with the exception that the user may specify a data range for which search results are desired. Given a query, Chronica can also perform the search at monthly or yearly time intervals to produce a bar graph representing the number of documents found during that interval. Sociologists, historians, politicians, and marketing departments may find this a valuable tool.
Search engines regularly crawl the web taking vast snapshots of site content. Because previous crawls are not archived, however, search results pertain only to a single, recent instant in time. Search engine users are unable to request pages discussing UK politics in 2001, for example. The Internet Archive, an organization dedicated to maintaining such snapshots of the Internet, provides access to many previous web crawls, but lacks a search facility. Users of the “Way Back Machine” must provide a specific URL for which they want a list of snapshots organized by date. This short paper describes Chronica, a temporal search engine that indexes Internet Archive crawl data in order to provide search results spanning user-specified time ranges. Chronica can generate graphs showing query result hit counts across a given time span and even side-by-side comparisons of different query results. These graphs can be used to, among other things, track a term’s popularity over time for marketing or academic research purposes.
2. TEMPORAL SEARCH Via a web interface, a user can input any textual query, using either the simple or advanced search functionality as shown in Figure 1. The simple search is just that–a simple textual search that is input directly into the Lucene index. The advanced search uses Lucene’s own query language, albeit internally, allowing the user to specify URLs, document types, and the like. Included in both is the ability to specify a date range for the query, in order to filter the results set or simply search the entire index. The index contains about 3/4 terabyte of data collected from certain web sites in the UK over that period. For the results page itself, Chronica uses StringTemplate [4] to produce templates of search results, allowing the system to easily drop the result objects from a search into predefined HTML and XML templates. The search results page is one example of this, with each result being its own template, combined in a JSP page to produce HTML output. If configured to, the results also contain a link to an RSS feed containing the same data, merely with different templates loaded.
Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval
Keywords Search engine, temporal search, Search, Indexing, Crawling
1.
INTRODUCTION
One of the largest benefits of the web is the rate at which content is updated and available for viewing; an overlooked result of that process however is that content is most often simply replaced and lost to the majority of end users. That is until The Internet Archive (www.archive.org) was created–a “non-profit [organization] that was founded to build an ’Internet library’, with the purpose of offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format.” Included amongst these collections is an ever expanding archive of web snapshots, propelled by the Internet Archive’s in-house crawler Heritrix [3], an open-source Java solution, in an effort to archive the web for future reference. At the time of publication, the Internet Archive had over 40 billion archived web pages available for browsing. Access to this archive is
3. GRAPHING One interesting concept that arose during the development of Chronica was the notion of a term or phrase’s growth in popularity over time. Chronica became perfect for such a task, with archived websites, documents, etc. combined with a keyword search capable of being limited to specific date ranges; all that it lacked was the ability to display these trends. The easiest form for displaying data such as an increase of some term over time was with a graph. While tracking a single keyword over time is an interesting exercise and can be useful with enough data, it quickly became apparent that a more meaningful use is a side by side compari-
Copyright is held by the author/owner(s). ICWE’06, July 11-14, 2006, Palo Alto, California, USA. ACM 1-59593-352-2/06/0007.
119
Figure 1: Chronica’s general search interface son between keywords, and the “[VS]” operator was created. This operator is used to delimit different terms or phrases for comparison: the query “java [VS] python” would return a graph with two data points for each time interval, one for “java” and another for “python”. Figure 2 illustrates the hit count plot from our UK data set for comparative search results of “bush [vs] blair” using a line graph (the real site using a bar graph, but was too large to properly fit in this paper format so the raw data was replotted).
built search index solution with a MySQL database to store references to the original documents. While his solution does in fact produce a usable temporal search engine, the simplicity of the implementation leaves much to be desired beyond proof of concept. As the size of his search index increased so did the time required to search through it, in some cases taking upwards of 20 seconds. Fortunately, Lucene is a very fast and efficient search engine, providing a search index of not only minimal size but also speed rivaling many commercial search engines.
5. CONCLUSIONS Chronica is a temporal web search engine that allows users to search Internet Archive ARC files representing snapshots of the web over time. Previously, it was not possible to search these files and the user had to know the precise URL to look up. Chronica’s indexer properly detects and collates page snapshots that have not changed over time to provide a single search result; versions of that URL over time are still available for viewing. Beyond the normal search capabilities, Chronica provides a graphing facility that shows the number of plants for a query all over a time period. Further, multiple queries can be run at once and visually compared against each other; relative hit counts are much more useful than absolute counts because absolute counts are sensitive to the number of sites crawled. Chronica is available under the the BSD license at chronica3.cs.usfca.edu.
Figure 2: Graph of “bush [vs] blair” search hit count over time
6. ACKNOWLEDGMENTS 4.
The authors would like to thank Rudd Stevens and Jason Endo for their help with the implementation of Chronica.
RELATED WORK
A similar temporal search engine project was undertaken by Ryan Sheahan at Kansas University [1][2]. From a feature standpoint, Chronica and Sheahan’s system are similar–both allow for time-sensitive searching, as well as replay capability. One major difference however is while Sheahan’s system “replays” an archived version of a page, it does not reproduce embedded content (such as images), nor does it rewrite external links on the page. However, Sheahan’s system does provide the ability to view the difference between two versions of the same page, archived at different dates. While this functionality is not perfect, it is certainly a step in the right direction towards exposing the power of temporal web archiving. With a suitable interface, a user could easily visually compare the changes of a page over time. Sheahan’s system also seems to focus more on page retrieval rather than analytics–Chronica, with its graphing functionality, allows for simple statistics and tracking of archived data. On the back end, Sheahan’s implementation involved a custom
7. REFERENCES [1] Sheahan, Ryan. Improving Query Retrieval Times in the Temporal Search Engine, Masters Thesis; University Of Kansas, 2003. [2] Temporal Search Engine website. http://www.ittc.ku.edu/temporal/ [3] Heritrix. http://crawler.archive.org [4] Terence Parr. Enforcing Strict Model-View Separation in Template Engines. In WWW2004 Conference Proceedings p. 224, May 17-20 2004, New York City.
120