Tools for Web Archiving: The Java/Open Source Tools to Crawl ...

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012

Overview • The tools: – Heritrix crawler – Wayback browse access – Lucene/Hadoop utilities: ‘JBs’ (indexing) and ‘TNH’ (searching)

• Example uses – CDL Web Archive Service – NetArchive Suite (DK, FR, AT) – archive.org worldwide archive & Archive-It

Internet Archive

• Established in 1996 • 501(c)(3) non profit organization • Over seven petabytes (compressed) of publicly accessible archival material • Technology partner to libraries, archives, museums, universities, research institutes, and memory institutions • Archiving books, texts, film, video, audio, images, software, educational content and

• > 175 billion captures (URL+datetime) • > 2+ petabytes compressed • > 15 years (1996-)

• Collects anything accessible to public • Obeys ‘robots.txt’ restrictions • Respects rightsholder/site-owner takedown requests

Web Archiving Partners

Heritrix – crawling

What is Heritrix? Open-source Archival-quality Flexible Extensible Web-scale Web crawling software http://crawler.archive.org

Heritrix – major components • Scope / DecideRules – URIs in or out

• Frontier – URI queues, queues of queues, seen-set

• Processors – Prep, Fetch, Extract, Write, Schedule, etc.

Heritrix writes ARCs or WARCs • Both: sequence of content blocks, each introduced by a small text header • ARCs: – 1-line header – verbatim protocol response

• WARCs add: – multi-line header with extensible fields – New record types: • Request, Response, Resource • Metadata, Revisit, Conversion, Warcinfo, Continuation

– ISO standardization

Heritrix vs. with other web copiers • Powerful (but complicated) config - pluggable extractors, fetchers • Not optimized for site/hostname-centric - bulk content mixed-together • Content never ‘unrolled‘ or rewritten - requires access tools (wayback) • Good options for giant crawls - millons of sites, 100s of TB 11

Wayback – browsing

What is Wayback? Open Source Java Modular Scalable Customizable Web Archive Access Tool http://archive-access.sourceforge.net/projects/wayback

Wayback Features • Starting with an URL: – See list of captures by date – See extension URLs (same site) – View a capture

• Once browsing (“replay”): – Browse web ‘as it was’ – Best-match clickthroughs

Wayback: Modular Components • Query User Interface – Calendar, Search Engine, XML

• Replay User Interface – Archival URL, Timeline, Proxy

• Resource Index – CDX, BDB, Remote, Aggregated

• Resource Store – Local ARC, HTTP 1.1 Remote ARC

Wayback vs. other access • Many deployment configurations • All ‘replay‘ handled at browse-time - issues fixed in code or tolerated • Many UI customizations

16

Wayback: Memento • http://www.mementoweb.org/ • Collaboration – Los Alamos National Lab – Old Dominion University – Library of Congress

• APIs for ‘time dimension’ – not just external archives

• API for Wayback 17

Formats • ARC/WARC • CDX – simple, flat file indexes

• WAT – web-capture specific metadata – Data exchange and analysis – Less than full WARC, more than CDX – JSON – Minimizes data exchange worries: 18 copyright, privacy

Lucene/Hadoop-based utils: JBs (indexing) TNH (searching)

Lucene & Hadoop Open Source Java Full-Text Indexing Bulk Processing (Map-Reduce) Bulk Storage (HDFS) Large ecosystem

Hadoop • HDFS – Distributed storage – Durable, default 3x replication – Scalable: Yahoo! 60+PB HDFS

• MapReduce – Distributed computation, Java jobs – Hadoop distributes work across cluster – Tolerates & retries failures

• & more – ‘Pig’, ‘HBase’, ‘Mahout’, ‘Hue’

21

JBS/TNH Background • Lucene – Open-source Java full-text indexing – Popular, mature

• Nutch – Extensions to Lucene – For web content, access, scale

• Hadoop – Spun off from Nutch – Inspired by Google’s Map-Reduce

JBs/TNH • Replaces an earlier ‘NutchWax’ • JBS: utilities for bulk Lucene indexing - ARCs/WARCs - dates, duplicates • TNH: OpenSearch service - efficient collapsing - query reformulation

The Ecosystem • Each tool stewarded at IA – Sponsorship by partners – Driven by projects-of-the-moment

• Use by many institutions – CDL Web Archive Service – NetArchive Suite – archive.org Wayback Machine, Archive-It

Thank You Gordon Mohr Internet Archive Web Group [email protected]

Tools for Web Archiving: The Java/Open Source Tools to Crawl ...

Tools for Web Archiving: The Java/Open Source Tools to Crawl ...

Suggest Documents

Web analytics tools and web metrics tools: An overview and ...

Web-based survey tools are powerful tools for ... - CiteSeerX

Open Source Alternatives to Commercial GIS Tools

Effective Web 2.0 Tools For The Classroom

Effective Web 2.0 Tools For The Classroom

Using data archiving tools to preserve archival records in ... - PURL.PT

Open Source Tools for Standardized Privacy ...

Open source tools for content management - CiteSeerX

Open source tools for content management - CiteSeerX

Empowering Communities via Open Source Tools for

Graphic design tools for Open Source FPGAs

Open Source Tools for Mobile Forensics [PDF]

Open Source Android Development Tools

Open Source Tools for Collaborative Systems ...

Bearings - Open Source Machine Tools

Tools for archiving and managing cultural heritage ...

Tools for thought or thoughts for tools?

Timesheets. js: Tools for Web Multimedia

Benchmarking Vulnerability Detection Tools for Web Services

Timesheets. js: Tools for Web Multimedia

EXTENDING WEB ENGINEERING MODELS AND TOOLS FOR ...

Benchmarking Vulnerability Detection Tools for Web Services

Tools for Web-Based Sorting Animation

Web-Based Educational Tools for Speech Technology