Monitrix – A Monitoring and Reporting Dashboard for Web Archivists

4 downloads 259819 Views 611KB Size Report
indicate undesirable behavior, such as crawler traps or blocking hosts. ... of the task: the .uk top-level domain alone, for example, consists of more than 10 ... the data useful enough to answer fundamental questions (“are we making best.
Monitrix – A Monitoring and Reporting Dashboard for Web Archivists Rainer Simon1 and Andrew Jackson2 1

AIT - Austrian Institute of Technology, Donau-City-Str. 1, 1220 Vienna, Austria [email protected] 2 The British Library, Boston Spa, Wetherby West Yorkshire, LS23 7BQ, UK [email protected]

Abstract. This demonstration paper introduces Monitrix, an upcoming monitoring and reporting tool for Web archivists. Monitrix works in conjunction with the Heritrix 3 Open Source Web crawler and provides real-time analytics about an ongoing crawl, as well as summary information aggregated about crawled hosts and URLs. In addition, Monitrix monitors the crawl for the occurrence of suspicious patterns that may indicate undesirable behavior, such as crawler traps or blocking hosts. Monitrix is developed as a cooperation between the British Library’s UK Web Archive and the Austrian Institute of Technology, and is licensed under the terms of the Apache 2 Open Source license. Keywords: Web Archiving, Quality Assurance, Analytics.

1

Introduction

One of the most challenging aspects of Web Archiving is quality assurance. Current harvesting technology does not deliver 100% replicas of Web resources; and a long list of known technical issues exists that hamper the task of Web preservation [2]. One aspect is the complexity of Web content, being a mix of different content types, embedded scripts and streaming media, as well as content that is dynamically generated (the “deep Web” [4]). Another aspect is the scale of the task: the .uk top-level domain alone, for example, consists of more than 10 million hosts. Practical limitations such as finite computing resources and time, as well as limited legal mandates to libraries impose restrictions on the size and scope of the crawl. This forces decisions on aspects such as: suitable starting points (“seeds”) for the crawl; criteria for curtailing the crawl of a particular host; the exclusion of specific hosts to those curtailing rules; or relevant and eligible domains outside the targeted domain which should be included in the crawl nonetheless, e.g. URL shorteners. (For a survey of of practices in various Web archiving initiatives worldwide refer to [1]). To enable effective reporting, it is important to identify metrics by which the impact of such decisions can be measured, and to implement tools that T. Aalberg et al. (Eds.): TPDL 2013, LNCS 8092, pp. 450–453, 2013. c Springer-Verlag Berlin Heidelberg 2013 

Monitrix – A Monitoring and Reporting Dashboard for Web Archivists

451

capture these metrics automatically. State-of-the-art crawler software such as Heritrix1 , or archiving services such as Archive-It2 offer a number of reports out of the box – e.g. on crawled seeds, hosts and URLs, downloaded file types, etc. However, these are rarely in a form that is suitable for wider dissemination to stakeholders outside the technical team. (E.g. Heritrix 3 provides reports in a bare-bones text format3 , Archive-It offers lists of comma-separated values for further processing in spreadsheet applications4 . Additionally, it is often desirable to generate derivative metrics from the original raw numbers (such as histograms of specific properties), and to produce graphical visualizations in order to make the data useful enough to answer fundamental questions (“are we making best use of our resources?”, “are we collecting what we want to collect” [3]), identify gaps and issues early, and inform estimates of the current rate of completion. In this TPDL 2013 demonstration, we introduce Monitrix, an Open Source Web crawl reporting tool currently being developed by the Austrian Institute of Technology under the guidance of the British Library’s Web Archiving team. Monitrix works in conjunction with the Heritrix 3 Web crawler, and generates crawl analytics which it makes available in graphical form, in a browser-based user interface. Monitrix is licensed under the terms of the Apache 2 License, and available at https://github.com/ukwa/monitrix.

2

System Overview

Monitrix is a self-contained Web application which tracks the logs of one or more Heritrix 3 Web crawler instances. Based on the log data, it provides (i) summary views which expose aggregate information about different aspects of the crawl as a whole, (ii) alerts that flag up suspicious behavior observed during the crawl, and (iii) search functionality which allows a detailed inspection of specific hosts or crawled URLs. 2.1

Summary Views

In addition to global status information (e.g. time since start or last activity, total no. of URLs crawled, momentary values for no. of URLs crawled or MBytes downloaded per minute, etc.) Monitrix provides three distinct summary views: The crawl timeline (Fig. 1, left) shows timeseries graphs for the total number of URLs crawled, the total download volume, the total number of new hosts visited, and the total number of hosts completed over time, either as standard or cumulative graphs. The hosts summary (Fig. 1, right) shows the number of successfully crawled hosts and their distribution across top-level domains, lists the hosts that have 1 2 3 4

https://webarchive.jira.com/wiki/display/Heritrix/Heritrix http://www.archive-it.org/ https://webarchive.jira.com/wiki/display/Heritrix/Reports https://webarchive.jira.com/wiki/display/ARIH/ Listing+of+Reports

452

R. Simon and A. Jackson

Fig. 1. Monitrix screenshots

reached the crawl cap (i.e. the maximum allowed download limit), and provides a number of histograms to help identify slow or otherwise problematic hosts: e.g. for average delay, number of HTTP retries, or percentage of requests blocked due to robots.txt rules. The user can “drill down” into the histograms to inspect the hosts represented through a particular bar by clicking on it. The virus log summarizes the results of the virus check (which is usually conducted as part of the crawl). 2.2

Alerts

In order to gather requirements prior to the development of Monitrix, a workshop was conducted with experienced Web crawl coordinators from different institutions. As one part of this workshop, participants discussed heuristics which could be used to detect crawler traps and other undesirable crawler behaviour, e.g: – hosts with many sub-domains (500+) are likely either to be “spammy” or to be sites with virtual sub-domains (e.g. blogs). – very long URLs – i.e. where the number of path segments exceeds a certain threshold, or where there are multiple identical path segments - may be indicative of a problem. – if a host is not serving any non-text resources any more, there is most likely a problem. These heuristics have been implemented in Monitrix. They will trigger alerts which are then shown in the user interface. It is also foreseen that alerts will trigger E-Mail notifications to the crawl operator at a later stage. 2.3

Search

Monitrix’ search feature not only supports search by URL or hostname, but also offers advanced search options to search by specific properties. It is, for example,

Monitrix – A Monitoring and Reporting Dashboard for Web Archivists

453

possible to retrieve hosts by their average fetch delay, or URLs that were logged with a specific crawl annotation, e.g. one indicating a virus infection. For host searches, Monitrix generates pages summarizing all information collected about a specific host. These pages include, among other parameters: the time spent crawling the host; a graph showing the number of URLs crawled over time at this host (which allows a quick visual appraisal of whether the host has likely been completed); and distribution pie charts of the HTTP response codes, the content types and the virus scan results observed at this host.

3

Technical Architecture

The key technical challenge that Monitrix faces is the massive amount of data it needs to aggregate. The initial design has been dimensioned for a target size of 1 Terabyte of crawler log data – which corresponds to more than 3 billion log lines. In order to support reporting on such a scale with reasonable response times, Monitrix performs a number of pre-processing steps during ingest, and stores interim results (along with the raw log information) in a MongoDB NoSQL database. For example, all timeseries values (e.g. the number of URLs or hosts crawled over time, etc.) are pre-aggregated into a fixed base resolution raster, which can later be resampled into live timeseries visualizations on the screen with relatively low processing overhead. Monitrix also maintains a number of aggregated records and indexes on crawled hosts and alert conditions in the database. The frontend is implemented as standalone Java Web application, which is to be deployed alongside Heritrix. It also offers a JSON API, which enables loosely-coupled integration with external systems. Along with extensive testing and trial operation, additional work on integration options are expected to be the next steps in our future work.

References 1. Gomes, D., Miranda, J., Costa, M.: A survey on web archiving initiatives. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds.) TPDL 2011. LNCS, vol. 6966, pp. 408–420. Springer, Heidelberg (2011), http://dl.acm.org/citation.cfm?id=2042536.2042590 2. Hockx-Yu, H.: How good is good enough? quality assurance of harvested web resources (October 2012), http://britishlibrary.typepad.co.uk/ webarchive/2012/10/how-good-is-good-enough-quality-assuranceof-harvested-web-resources.html 3. ISO Technical Committee 46 Working Group on Statistics and Quality Indicators for Web Archiving: Statistics and Quality Indicators for Web Archiving - Draft Technical Report (November 2012), http://netpreserve.org/sites/default/files/resources/ SO TR 14873 E 2012-10-02 DRAFT.pdf 4. Olston, C., Najork, M.: Web crawling. Found. Trends Inf. Retr. 4(3), 175–246 (2010), http://dx.doi.org/10.1561/1500000017

Suggest Documents