Tools for Web Archiving: The Java/Open Source Tools to Crawl ...

2 downloads 645 Views 542KB Size Report
Tools for Web Archiving: The Java/Open Source Tools to. Crawl, Access & Search the Web. NLA. Gordon Mohr. March 28, 2012 ...
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012

Overview • The tools: – Heritrix crawler – Wayback browse access – Lucene/Hadoop utilities: ‘JBs’ (indexing) and ‘TNH’ (searching)

• Example uses – CDL Web Archive Service – NetArchive Suite (DK, FR, AT) – archive.org worldwide archive & Archive-It

Internet Archive

• Established in 1996 • 501(c)(3) non profit organization • Over seven petabytes (compressed) of publicly accessible archival material • Technology partner to libraries, archives, museums, universities, research institutes, and memory institutions • Archiving books, texts, film, video, audio, images, software, educational content and

• > 175 billion captures (URL+datetime) • > 2+ petabytes compressed • > 15 years (1996-)

• Collects anything accessible to public • Obeys ‘robots.txt’ restrictions • Respects rightsholder/site-owner takedown requests

Web Archiving Partners

Heritrix – crawling

What is Heritrix? Open-source Archival-quality Flexible Extensible Web-scale Web crawling software http://crawler.archive.org

Heritrix – major components • Scope / DecideRules – URIs in or out

• Frontier – URI queues, queues of queues, seen-set

• Processors – Prep, Fetch, Extract, Write, Schedule, etc.

Heritrix writes ARCs or WARCs • Both: sequence of content blocks, each introduced by a small text header • ARCs: – 1-line header – verbatim protocol response

• WARCs add: – multi-line header with extensible fields – New record types: • Request, Response, Resource • Metadata, Revisit, Conversion, Warcinfo, Continuation

– ISO standardization

Heritrix vs. with other web copiers • Powerful (but complicated) config - pluggable extractors, fetchers • Not optimized for site/hostname-centric - bulk content mixed-together • Content never ‘unrolled‘ or rewritten - requires access tools (wayback) • Good options for giant crawls - millons of sites, 100s of TB 11

Wayback – browsing

What is Wayback? Open Source Java Modular Scalable Customizable Web Archive Access Tool http://archive-access.sourceforge.net/projects/wayback

Wayback Features • Starting with an URL: – See list of captures by date – See extension URLs (same site) – View a capture

• Once browsing (“replay”): – Browse web ‘as it was’ – Best-match clickthroughs

Wayback: Modular Components • Query User Interface – Calendar, Search Engine, XML

• Replay User Interface – Archival URL, Timeline, Proxy

• Resource Index – CDX, BDB, Remote, Aggregated

• Resource Store – Local ARC, HTTP 1.1 Remote ARC

Wayback vs. other access • Many deployment configurations • All ‘replay‘ handled at browse-time - issues fixed in code or tolerated • Many UI customizations

16

Wayback: Memento • http://www.mementoweb.org/ • Collaboration – Los Alamos National Lab – Old Dominion University – Library of Congress

• APIs for ‘time dimension’ – not just external archives

• API for Wayback 17

Formats • ARC/WARC • CDX – simple, flat file indexes

• WAT – web-capture specific metadata – Data exchange and analysis – Less than full WARC, more than CDX – JSON – Minimizes data exchange worries: 18 copyright, privacy

Lucene/Hadoop-based utils: JBs (indexing) TNH (searching)

Lucene & Hadoop Open Source Java Full-Text Indexing Bulk Processing (Map-Reduce) Bulk Storage (HDFS) Large ecosystem

Hadoop • HDFS – Distributed storage – Durable, default 3x replication – Scalable: Yahoo! 60+PB HDFS

• MapReduce – Distributed computation, Java jobs – Hadoop distributes work across cluster – Tolerates & retries failures

• & more – ‘Pig’, ‘HBase’, ‘Mahout’, ‘Hue’

21

JBS/TNH Background • Lucene – Open-source Java full-text indexing – Popular, mature

• Nutch – Extensions to Lucene – For web content, access, scale

• Hadoop – Spun off from Nutch – Inspired by Google’s Map-Reduce

JBs/TNH • Replaces an earlier ‘NutchWax’ • JBS: utilities for bulk Lucene indexing - ARCs/WARCs - dates, duplicates • TNH: OpenSearch service - efficient collapsing - query reformulation

The Ecosystem • Each tool stewarded at IA – Sponsorship by partners – Driven by projects-of-the-moment

• Use by many institutions – CDL Web Archive Service – NetArchive Suite – archive.org Wayback Machine, Archive-It

Thank You Gordon Mohr Internet Archive Web Group [email protected]