DigiBoard - International Web Archiving Workshop - European Archive

Julien Masanès Andreas Rauber Marc Spaniol (Eds.)

International Web Archiving Workshop IWAW 2010

10th International Workshop Vienna, Austria September 22-23, 2010 Proceedings

Preface

IWAW is an annual international workshop series that brings together both practitioners and researchers in the domain of Web archiving, Internet and new media preservation. The International Web Archiving Workshop (IWAW) series, organized since 2001, provides a cross domain overview on active research and practice in all domains concerned with the acquisition, maintenance and preservation of digital objects for long-term access, with a particular focus on Web archiving and studies on effective usage of this type of archives. It is also intended to provide a forum for interaction among librarians, archivists, academic and industrial researchers interested in establishing effective methods and developing improved solutions for > http://blogs.discovermagazine.com/ cosmicvariance/?p=5353

Compared to the live page, in the archived version we replace the and elements with a specific element that contains an archivespecific player: “http://collections.europarchive.org/media/player.swf” and the URL of the archived video:

“http://collection.europarchive.org/swr/2010060108470 8/rtmp://.../foobar-video.flv”

Figure 2. Archived Web site – HTTP video. Notice that the URL of the archived video file was transformed into an HTTP URL pointing to a flv file in the archive. The replacement of the HTML elements is done on the fly, by specific methods implemented in the access code on the server. The WARC files always keep the original version of the pages. Finding a common pattern to correctly identify and replace the player containers is still challenging, since the structure and the attributes of the elements could differ from one Web site to another. In this section we focused on the example of capturing the RTMP streaming videos, because the downloading process involves a specific streaming protocol. Nevertheless, the player replacement techniques implemented in the access code are used in the same manner also for the HTTP-downloaded videos. 3.VIDEO CAPTURE USING EXTERNAL DOWNLOADERS As part of the new technologies for Web archiving developed in the LiWA project1 , a specific module was designed to enhance the capturing capabilities of the crawler, with regards to different multimedia content types (for an early attempt on this topic at EA see [Baly 2006]. The current version of Heritrix is mainly based on the HTTP/HTTPS protocol and it cannot treat other content transfer protocols widely used for the multimedia content (such as streaming). The LiWA Rich Media Capture module2 delegates the multimedia content retrieval to an external application (such as MPlayer3 or FLVStreamer4) that is able to handle a larger spectrum of transfer protocols. The module is constructed as an external plugin for Heritrix. Using this approach, the identification and retrieval of streams is completely de-coupled, allowing the use of more efficient tools to analyze video and http://www.liwa-project.eu/ http://code.google.com/p/liwatechnologies/source/browse/rich-media-capture 3 http://www.mplayerhq.hu 4 http://savannah.nongnu.org/projects/flvstreamer/ 1 2

audio content. At the same time, using the external tools helps in reducing the burden on the crawling process. 3.1.Architecture The module is composed of several sub-components that communicate through messages. We use an open standard communication protocol called Advanced Message Queuing Protocol (AMQP)5. The integration of the Rich Media Capture module with the crawler is shown in the Figure 3 and the workflow of the messages can be summarized as follows. The plugin connected to Heritrix detects the URLs referencing streaming resources and it constructs for each one of them an AMQP message. This message is passed to a central Messaging Server. The role of the Messaging Server is to de-couple Heritrix from the clustered streaming downloaders (i.e. the external downloading tools). The Messaging Server stores the URLs in queues and when one of the streaming downloaders is available, it sends the next URL for processing. In the software architecture of the module we identify three distinct sub modules: ●

●

●

a first control module responsible for accessing the Messaging Server, starting new jobs, stopping them and sending alerts; a second module used for stream identification and download (here an external tool is used, such as the MPlayer); a third module which repacks the downloaded stream into a format recognized by the access tools (WARC writer).

After a successful capture, the last step consists in wrapping the captured stream into a WARC file, which is moved afterwards to the final storage. 3.2.Optimizations The main issues emerging from the initial tests were related to the synchronization between the crawler and the external capture module. In the case of a large video collection hosted on the Web site, a sequential download of each video would definitely take longer than the crawling process of the text pages. The crawler would therefore have to wait for the external module to finish the video download. A speed-up of the video capture process can be indeed obtained by multiplying the number of downloaders. On the other hand, parallelizing this process would be limited by the maximum bandwidth available at the streaming server. An other solution for managing video downloaders is to completely de-couple the video capture module from the crawler and launch it in the post processing phase. That implies the replacement of the crawler plugin with a log reader and an independent manager for the video downloaders. The advantages of this approach (used at EA for instance) are: ● a global view on the total number of video URIs ● a better management of the resources (number of video downloaders sharing the bandwidth) The main drawback of this method is related to the incoherencies that might appear between the crawl time of the Web site and the video capture in the post processing phase: ● some video content might disappear (during one or two days delay) ● the video download is blocked waiting for the end of the crawl Therefore, there is a trade-off to be done when managing the video downloading, between: shortening the time for the complete download, error handling (for video contents served by slow servers), and optimizing the total bandwidth used by multiple downloaders. 4.CONCLUSION AND PERSPECTIVES

Figure 3. Streaming capture module interacting with the crawler. When available, a streaming downloader connects to the Messaging Server to request a new streaming URL to capture. Upon receiving the new URL, an initial analysis is done in order to detect some parameters, among others the type and the duration of the stream. Of course, if the stream is live, a fixed configurable duration may be chosen. After a successful identification the actual download starts. The control module generates a job which is passed to the MPlayer along with safeguards to ensure that the download will not take longer than the initial estimation. http://www.amqp.org/confluence/display/AMQP/Advan ced+Message+Queuing+Protocol 5

As can be seen, it is difficult to design a general solution for dealing with all the Web sites hosting video content. Based on the general methods presented in this paper, the harvesting technique should be adapted to each particular case. The crawl engineering effort needed to adapt the tools is generally dependent on the complexity of the Web site. The remaining work in this area folds under three main areas: ● Scaling video capture, likely by decoupling it from the crawl and better handling of the numerous errors and interruption that video servers in general, and streaming servers in particular, generate. ● Improving automatic detection of obfuscated links, and following them with specific rules

●

(allowing off-domains, detection in other file formats than html etc.) Developing a generic access and presentation method. This entails, detecting players characteristics automatically to replace them in a generic manner, managing better access and serving options for large files etc. 5.ACKNOWLEDGEMENT

This work is partly funded by the European Commission under LiWA (IST 216267)

6.REFERENCES [Baly 2006] Baly, N., & Sauvin, F. (2006). Archiving Streaming Media on the Web, Proof of Concept and First Results. International Web Archiving Workshop (IWAW 06), Alicante, Spain.

Active Preservation of web sites Robert Sharpe Tessella 26 The Quadrant, Abingdon Science Park, Abingdon, UK OX14 3YS +44-1235-555511

[email protected] ABSTRACT This paper describes an automated preservation framework called “Active Preservation” that has been designed to work well with linked information objects such as those found in web sites. It follows the now well-described three-step process of characterization, preservation planning and validated migration. Characterization is split into two parts: physical characterization of digital objects and conceptual characterization of information objects. Physical characterization includes the ability to identify and validate the format of a file (which means that all of that formats’ inherent format properties are known to apply to the file). In addition, it can perform “instance format property” extraction (i.e. the detection of technical properties that apply to that particular file). Finally, it performs embedded object extraction to detect any digital objects contained within the file. The tool to use for each of these steps can depend on the output from the previous steps.

1. INTRODUCTION Increasingly national libraries, national archives and other organizations are archiving snapshots of web sites with the aim of ensuring that these otherwise ephemeral resources are preserved for the long-term benefit of mankind. While this is a relatively new discipline, significant advances have been made. In particular the International Internet Preservation Consortium1 have developed a number of tools to assist with the process of harvesting (e.g., the development of the Heritrix crawler2 and the WARC standard, now ISO28500:2009) and access (e.g., the development of the NutchWAX search engine3). The IIPC also has a working group looking at preservation. However, since a web site can contain almost any type of file, this complex issue is tied to the general problem of the preservation of all forms of digital content. Digital preservation is also an emerging discipline. It is underpinned by the OAIS reference model (ISO 14721:2003) that has helped define:

Conceptual characterization allows the identification of a network of linked information objects (“components”) and the measurement of their “significant characteristics”. Preservation planning allows at risk files to be identified and these to be aggregated to create a list of linked information objects that are thus at risk. The list of at risk “components” within these can thus also be identified and a plan created that determines the optimum tool to use to migrate this component and lists the significant properties to validate post-migration.

•

The use of information packages (logical combinations of metadata and content files) for use in submission (called a submission information package or SIP), archival storage (an archival information package or AIP) and dissemination (a dissemination information package or DIP).

•

The nature of these information packages. Importantly, the OAIS standard makes a clear distinction between information objects (the conceptual entities with which an end user like a human wants to interact) and digital objects (the physical artifacts that manifest this information object in a particular set of technology).

•

The necessary functional entities needed by a long-term repository namely ingest, access, storage, data management, administration and, crucially, “preservation planning”.

Finally, migration can then occur following the plan including not just running the required tool but characterizing the output physically and conceptually allowing the comparison of before and after states. Automation is achieved through the use of pre-entered, humaneditable but machine-readable policy information being stored in a Technical Registry. The approach has been used to successfully migrate large web sites and has been shown in practice to detect unexpected errors in migration tools.

Categories and Subject Descriptors

The exact nature of the “preservation planning” functions have been discussed at length but a common approach as described in the EU-funded Planets project4 is to split the problem into:

H3.6 [Library Automation]: Large text archives

General Terms Algorithms, Management, Measurement, Experimentation

1

http://netpreserve.org

Keywords

2

http://crawler.archive.org/

Digital preservation, Web Archive, Migration

3

http://archive-access.sourceforge.net/projects/nutchwax/

4

http://www.planets-project.eu/

•

Physical characterization of the SIPs received, in particular to determine if anything is obsolete or at immediate risk of obsolescence

•

Preservation planning of what to do with the obsolete material

•

Preservation action to perform whatever actions was deemed necessary.

Importantly, a feedback loop is included. Part of the planning process is to determine the “significant properties” that should be invariant (potentially with a tolerance) under future actions (e.g., migration) [1]. These properties can then be measured before and after such a migration and compared to determine if the significant characteristics5 have been preserved. Despite the development of such theoretical frameworks, most examples have been based around simple cases (e.g., single documents or single images). There has been relatively little work performed on the preservation of whole web sites. However, in 2009, Strodl et al. [2] reported an approach based on the use of existing physical characterization tools and command-line migration tools to migrate web sites. It used the Planets projects’ preservation planning workflow [3] and, in particular, the Plato software to perform manual preservation planning. Migrated output was then injected back into a WARC file for long-term storage.

2. ACTIVE PRESERVATION In this paper an approach to the automated preservation of web sites is introduced. This is called the “Active Preservation” approach [4]. It follows the three-fold approach described above of characterization, preservation planning and preservation action. However, there are some key additional features introduced: •

•

5

Characterization is split into the characterization of digital objects (the physical files) and the characterization of the information objects these file represent. The former involves detection of a file’s format, validation of this format (where possible), extraction of key technical properties and, if appropriate, the detection of embedded objects within a file and their iterative characterization. By contrast, conceptual characterization involves obtaining a structural view of the constituent technologyindependent “components” of each information object. This involves detecting the presence of these “components” (e.g., each web page, each image etc.) and then measuring the “significant properties” of each of them.

verifies that each component is (i) still present, (ii) is still linked to other components in the same way and (iii) that its significant characteristics are, within any allowed tolerance, maintained. •

3. METHODOLOGY The approach has been demonstrated using Tessella’s Safety Deposit Box (SDB) technology. This is a system that is designed to meet the generic digital preservation needs of institutions such as national libraries and national archives. As would be expected of a software system in a rapidly developing field, it is constantly developing with an active user group that is shaping its future direction. The approach described in this paper used version 4.0 of this software. SDB is a commercial system but the approach could be applied similarly elsewhere. In addition, a number of tools have been used within the system (e.g., DROID for format identification). These tools have been tested and are believed to produce correct answers at least for formats that are commonly found in web sites. However, the utility of the approach is not specifically bound to any particular set of tools. The preservation functionality of the system is the main subject of the paper and is thus described in following sections. This section briefly describes how the other OAIS functional entities are performed, in particular the way in which web sites are ingested into the system, stored within it and accessed from it.

3.1 Ingest of web sites Before preserving web sites it is necessary to ingest them into the system. SDB contains an automated workflow system that allows multi-step ingests to occur. A typical ingest for a web site will consist of the following steps: •

Use Heritrix to crawl for a set of web sites saving the content files and creating a SIP. An SDB workflow step has been produced to do this turning Heritrix’s output into a SIP structure already understood by SDB. In this case, Heritrix has been used in mirror mode and thus content files are held in native web formats rather than using WARC. However, the results of the study could be performed using WARC with the appropriate modifications by adopting a similar approach to Strodl et al. [3]. In this study a single logical information object is created for a single source URL and this is ingested in a single SIP but the “Active Preservation” approach is not dependent on this structure.

•

Virus-check this SIP before allowing it to proceed further. This usually involves two checks taking place either side of a configurable quarantine period (e.g., 28 days).

Once migration has been performed, comparison of a web site takes place component by component. It is important that this takes place at the information object level not at the digital object level since information objects should be preserved by migration while digital objects are not necessarily. Hence, the approach

Following [1], we will use the word “property” to mean something that can be measured and the word “characteristic” to mean a property/value combination.

The process is fully automated. This is achieved by describing policy at all three stages: characterization, preservation planning and migration in a machinereadable way and holding this policy in a Technical Registry. This is described more below. This automation leads to some need for simplifications and the consequences of these simplifications will be discussed.

•

A number of other quality assurance checks (e.g., ensuring all files described in the metadata are present and that all files present are described in the metadata, fixity checks etc.)

•

The next step is characterization. This step is crucial for web site preservation and thus is described in more detail in the next section.

•

Storage of the content files in one or more storage systems

•

Storage of the metadata in a specially designed database

•

Indexing of the content to allow fast future access.

•

Other steps are also included in production system such as synchronization with catalogue systems, conversion between metadata schemas etc. but these are not important for the subject being discussed in this paper.

An important feature for automation is that SDB is designed for scalability. It utilizes multi-threading and can be deployed across multiple servers. Although not explicitly covering web sites, benchmark tests have shown that a single entry-level server (SunFire X4140 2 x QuadCore 2.3GHz, 8GB RAM) can ingest 1.5TB (1500 1GB SIPs each consisting of a mixture of 100 10MB TIFF, PDF and JPEG files) each day. To further help with automation, the system allows automated steps to create warnings and the severity of these to be configured. Hence, certain events can be logged and the process continued while others will stop that particular workflow to await human intervention.

3.2 Storage of web sites SDB utilizes a system of storage adaptors allowing storage to take place in a variety of storage systems include local disk storage, hierarchical storage management (HSM) systems accessed via a virtual directory, content-addressable storage systems and cloud storage. AIPs can be optionally packaged (e.g., to include a metadata snapshot) and signed as required.

•

It has an access system designed particularly for archivists and librarians, that allows these users to see all of the metadata held by the system including all properties associated with each object and full audit trail records. This helps ensure that web site migration can be fully understood and, if necessary, manually examined.

•

Web sites can also be downloaded to check their integrity although it is not yet possible to render them in-situ.

3.5 Administration SDB allows the system to be administered and in particular allows reports to be run that can assess the impact of preservation actions. Where these are relevant they are described below.

4. TECHNICAL REGISTRY Since the approach described here is an automated approach to preservation, it is necessary to have a method of describing policy in a machine-readable way. This is achieved through the use of a Technical Registry. SDB uses the Planets Core Registry (PCR), which is an enhancement of the UK National Archives’ PRONOM system8 for this purpose.

4.1 Factual Information This system allows factual information about a number of entities to be stored including file formats, software, hardware, properties, tools, migration pathways etc..

4.2 Policy Information This factual information is important reference material on which to base decisions. However, for the purposes of automation it is the Technical Registry’s ability to hold machine-readable policy information that is important. This includes policy for physical characterization (which will be described in more detail in the next section) such as: •

Which tools to use for format identification? (e.g., use DROID)

•

Which tools to use for format validation for each format (if any)? (e.g., validate all PDF formats using Jhove)

•

Which tools to use for technical property extraction for each format (if any)? (e.g., extract technical properties for GIF 1989a files using Jhove’s GIF extraction module)

•

Which technical properties should be measured for each format? (e.g., retain information on image height and width but don’t hold bit depth information)

•

Which tools to use for embedded object extraction for each format (if any)? (e.g., extract files from ZIP files using the … tool).

3.3 Data management of web sites SDB stores structural and descriptive metadata in a custom-built database schema. This uses Hibernate6 to allow database engine independence. In the tests described here, Oracle 11g was used as the database engine. Descriptive metadata can be held in any XML schema and is stored in CLOBs. The descriptive information are indexed using Solr7 to enable fast access. This is also used to create full-text indexes. Descriptive information can be edited using XML technology but, in production systems, this is often done using external catalogues with which SDB can be synchronized.

3.4 Access to web sites SDB’s access functionality has two features of particular use for the purposes of studying preservation:

In addition, it can hold policy on which significant properties should be measured for each component type during conceptual characterization. 6

http://www.hibernate.org

7

http://lucene.apache.org/solr/

8

http://www.nationalarchives.gov.uk/PRONOM/Default.aspx

During preservation planning the Technical Registry determines which formats (or format and physical property combinations) make files at risk. It can do this by simple choice or through a weighted risk-based system.

information object structure (by detecting the existence of “components” and their links) and then measures the component properties.

It can also determine the most appropriate migration pathway to use in different scenarios (e.g., one for preservation and a different one for presentation purposes).

Physical characterization occurs on every file in a web site. It operates via a framework in which tools can be plugged. The choice of which tool to use for a particular part of the process is a policy decision and is thus controlled by the Technical Registry.

Finally, it can also record which significant properties should be measured before and after migration and record any allowed tolerances in these measurements.

4.3 Properties The Technical Registry makes an important distinction between types of properties: •

Inherent format properties. These are properties of the format (e.g., its level of support) and so affect all files of that format.

•

Instance format properties. These are properties of files of a given format but can vary between different individual files. For example, the number of pages is an instance format property of all Microsoft Word files.

•

Other instance properties. These are file properties that are not format-dependent like file size or fixity values.

•

Component properties. These are technologyindependent properties of “components” like the word count of a document or the title of a web page.

Dappert and Farquhar [1] point out that different types of properties are significant for different aims. For example, when performing a media migration it is properties like file size and fixity values that are significant. However, as will be described in more detail below, the assertion of this paper is that when format migration occurs it is only the technology-independent component properties that are significant since all other (technologydependent) properties can’t be expected to survive a change in technology. For example, file size and fixity value properties are technologydependent and will clearly not survive format migration while a web page’s title should be technology independent and thus should persist. In some cases (e.g., image size) the property might be, in principle, technology-independent but might be most conveniently expressed in a technology-dependent way (e.g., in pixels). As will described in the next section one of the things that it is important to help to clarify this distinction is to distinguish the properties of the information object from the properties of the digital object that manifest it.

5. CHARACTERIZATION Characterization of web sites can be split into two parts: physical characterization of the received digital objects (i.e. the content files) and conceptual characterization of the information objects (i.e. web pages or other “components” of the web site). Physical characterization determines the content file’s format (and thus inherent format properties) and will also measure the file’s instance properties. In addition, it will attempt to identify the data object structure within each file (i.e. to detect the presence of an embedded object). Conceptual characterization identifies the

5.1 Physical characterization

The first step is to identify the format of every file. In order to determine which tool to use requires a policy decision so the characterization framework asks the Technical Registry. Currently this is configured to use DROID9. This tool uses known indicative byte sequences (signatures) and compares this to the bitstream of a file in order to determine its format. However, this identification step is not designed to be 100% definitive. For that, it is necessary to check that each file is formally valid against the specification of its purported format. This requires a format-specific tool. The policy for which tool to use for the formats identified by the first step is held in machine-readable form in the Technical Registry. In the ideal world a validation tool would exist for every format and the role of the initial identification tool would thus be reduced to simply telling the framework which validation tool to use. However, this is not the case since validation tools do not exist for very many formats and also because strict compliance with the specification is not necessary to enable preservation. Thus, for example, reallife HTML pages are rarely formally valid (e.g., with closed and correctly nested tags). Hence, the framework allows the rules to be tweaked to deal with real-life practicalities. For example, invalid HTML files are still ingested into the system. This validation step may also lead to an updated identification. For example, current DROID signatures are unable to distinguish between TIFF3, TIFF4, TIFF5 and TIFF6 files while the validation tool used (a module from Jhove10) is capable of this. The next step in identification is to extract the instance properties from each file. There are two purposes here. The first is to capture those properties that might influence long-term preservation needs (e.g., if a PDF is encrypted) while the second is to measure properties that might become part of the properties of the conceptual components. This second purpose will be described in more detail below. Again the tool used is formatspecific so again the framework queries the Technical Registry to determine the appropriate policy to follow. In practice, many tools perform both format validation and property extraction so these two steps can occur in parallel using the appropriate tool only once. The final step in physical characterization is to attempt to detect the presence of embedded objects within files. This also takes place via a framework and queries the Technical Registry to determine the format-specific tool to use. It currently works with ZIP files but by adding in the appropriate tool it could be applied to similarly work with WARC files. Each embedded object that is detected is then characterized following the same process as that followed for each original file (i.e. it passes through format 9

http://sourceforge.net/projects/droid/

10

http://hul.harvard.edu/jhove/

identification, format validation, property extraction and embedded object extraction). This means that the process is iterative until no further objects are left to be characterized.

5.2 Conceptual characterization A web site is a high-level information object that comprises a network of linked lower-level information objects such as individual web pages, images and documents. Hence, the next stage in characterization is to attempt to identify this network of “components” and to measure not only their links to each other but also the technology-independent, “significant properties” of each component. There is no fundamentally correct way in which an information object can be completely and fully described. For example, do we simply need to describe web pages or is it important to also describe elements within a web page like a paragraph of text and tables? Whilst recognizing this, it is important to make progress and measure what is practical to do and then add more sophistication later. Hence, the conceptual characterization framework builds a network made up of web pages, images and documents. It does this based off the original manifestation of the web site. In current technology this means looking for: •

HTML files (which indicate the existence of a web page)

•

JPEG, GIF, BMP or PNG files (which indicate the existence of an image)

•

PDF or Microsoft Word files (which indicate the existence of a downloadable document)

Clearly the HTML documents also indicate the linkages between components so using information within each of these files it is possible to detect and record the existence of links between two web pages, a web page and an image and a web page and a document. It is important to realize that the information being gathered here is technology independent even though it is being gathered using today’s technology (HTML). The conceptual characteristics being recorded are stating that, for example, there is a conceptual unit of information (of type web page) that is linked to another conceptual unit of information (of type image) even though it is actually being measured by reading an IMG tag of a HTML file. For each identified component, it is then possible to measure significant properties that are also technology independent (although, again, they are measured in today’s technology). For example, the Title of a web page can be measured by recording the element of each HTML page and each image’s height and width could be recorded similarly. The current framework only measures a few simple properties that prove the principle of the approach. Ideally, better characterization tools will be produced that allow more detailed measurements to be made in the future. In this approach it is necessary to balance the needs of accuracy with the consequent need to hold vast quantities of information as property metadata. A practical example of this is preserving sufficient information to guarantee the color of images. One way of achieving this is to measure the color of each pixel in turn and record this as has been demonstrated by the use of XCDL [5].

However, an alternative method is to utilize a statistical method by creating a histogram of the spread of red, green and blue values of each pixel. This has been shown to be an effective method of detecting accidental color shift owing to migration failures [6] and is thus used within the “Active Preservation” framework.

6. PRESERVATION PLANNING A thorough approach to preservation planning is to use the Planets projects’ preservation planning workflow [3] and, in particular, the Plato software to perform manual preservation planning. However, this process is very manual and usually only takes place once (or at infrequent intervals) to create a policy to cover an entire collection. Such a policy can then be enacted by entering it into the Technical Registry and using the “Active Preservation” automated approach to ensure it is applied to existing and newly received information objects in that collection. The first step in “Active Preservation” preservation planning is to determine the criteria for obsolescence. In its simplest form it is possible to simply list formats that need preservation action. However, it is also possible to assign risk weightings to inherent format properties such as the judged complexity of the format (e.g., high, medium or low) or the support level. In addition, it is possible to assign risk scores to instance format properties (i.e. properties that vary with each file instance of the format). These risk criteria can then be applied to SDB’s metadata database to determine which files in the system are at risk. In order to prevent files that have already been migrated for a particular purpose being re-identified as in need of migration in subsequent migration workflows, files can be flagged as “inactive” once they have been migrated. The next step is to determine the information objects that are thus at risk because one or more files that are used to manifest them in a particular manifestation is at risk. At the lowest level this is simple since the link between low-level information objects and files is maintained simply in the database. However, migrating a low-level information object will affect its parent information object and so on. Thus, it is important to ensure that migration takes place at a sufficiently high-level to maintain integrity of the network of information objects that are stored in the system. For this reason SDB classifies information objects into two types: collections and deliverable units. Collections can be hierarchical and each collection’s immediate child can be another collection or a top-level deliverable unit. Deliverable units are also hierarchical but their children can only hold other deliverable units. Only deliverable units have direct manifestations (each of which will consist of potentially many files11). If any file in any manifestation is at risk then that manifestation of the whole of the deliverable unit hierarchy to which it belongs is considered to be at risk. Note that this does not mean that additional unnecessary files will be migrated but that post11

In fact the files are linked to manifestations via a “manifestation file” entity to allow multiple manifestations to share the same file so that in the (usual) event of a partial migration of the files in a web site it is not necessary to hold duplicate information about files that are not changed. In fact, this approach also allows multiple deliverable units to share files which allows efficient storage of web site snapshots.

migration validation will check for the integrity of the structure of the entire hierarchical tree that this represents. For each information object that is now on the list to be migrated, each component within that will be considered. If this component contains an at risk file, it will be migrated. It is important to note that the atomic unit of migration is thus the component that was identified in characterization above. A simple example will illustrate how this works. Suppose it is decided to migrate from one image format to another. Each image component will clearly be migrated and will (in current technology) usually involve a single file-to-single file migration step. However, in addition the image files will also belong to web page components so these components are also migrated. In this case the migration is not a format migration but an update of IMG tags in a HTML file so that the migrated web site’s integrity is maintained. For each component type that needs migration the system looks up the most appropriate migration pathway to use from the Technical Registry. This migration pathway tells the system not only which migration tool to use but also which component properties need to be measured after migration and compared to the values before migration (with any applicable tolerance) in order to verify that it has succeeded.

7. MIGRATION The last piece of the process is to perform the actual migration. The fact that the preservation planning process has already produced a list of components to transform and that, for each of these, there is a named tool to use means that this is a conceptually simple process. Each component is migrated in turn (or in parallel threads) using the appropriate tool. Then the outputs are all characterized and each component’s characteristics are compared to those measured before characterization. Note that this approach does not require tools to preserve the number of files or the split of information between files. However, it does require each conceptual component, its key characteristics and its links to other components to be preserved. For example, in the example in the last section, if the migration process did not update the IMG tags in the HTML file then in the migrated manifestation the link between the web page and the image would be broken and the migration would be known to have failed.

successfully detected and thereby rejected migrations where errors are introduced by comparing the before and after significant characteristics. It has been found that this can occur when tools operated beyond their tested boundaries (e.g., in one case an image migration tool produced an incorrect color transform when asked to migrate a file larger than it had been previously tested on). The approach is also very flexible since it allows new tools to be added to perform characterization, preservation planning or migration tasks as time goes by. These could either replace existing tools or add currently missing format coverage. Hence, the approach has proved itself to be successful but there are limitations and thus areas for further work. One limitation of this approach is the general problem that the characterization and migration tools needed for many formats do not exist. However, the use of a framework means it is easy to plug in new tools as they are produced. Some of the tools used in “Active Preservation” (in particular those used for the identification of components and the measurement of “significant properties”) were created specifically for this work. However, in reality, these can also rely on existing technical characterization tools since for many entities like images there is a one-to-one correspondence between a measurable instance format property and an equivalent technology-independent “significant property”. A larger problem (and another general problem in digital preservation) is that there is no good agreement on what constitutes a valid “component” or a significant property [1]. In this study a pragmatic approach has been followed largely based on properties that it is possible to measure but moving forwards it is envisaged that more sophisticated characterization tools will lead to more informed choices being required. Another limitation is that its current implementation identifies components at the file or above level. This is an unwanted technological constraint caused by pragmatism. It should be possible to extend the concept to monitor the existence of things within files (e.g., a table or paragraph within a web page).

9. ACKNOWLEDGMENTS The work presented in this paper has been performed by Tessella over a number of years with contributions from a number of projects.

Once the new manifestation has been verified, it can then be ingested and stored. It is only necessary to store the new files as the system allows a new manifestation to consist of a mixture of existing files (that were not affected by migration) and the new files.

The single most important influence was from the National Archives of the United Kingdom, England and Wales’ Seamless Flow Program and we would particularly like to acknowledge the contribution of Adrian Brown, the head of Digital Preservation there at the time.

It is also possible to perform migration in “test” mode where the migrated content is deliberately not ingested. This allows new tools and techniques to be tried out on real content held in the archive (rather than just on artificial test data) without any risk that this will impact the integrity of the repository.

In addition, working with colleagues on the Planets project has been invaluable. The Planets project was supported by the European Community under the Information Society Technologies (IST) Program of the 6th FP for RTD – Project IST033789.

8. PRACTICE AND FUTURE STEPS

The authors of this paper are solely responsible for the content of the paper.

This approach has been used to demonstrate successful migration of web sites by deciding, for example, to alter image formats. To date, the approach has been used with web sites up to 20,000 files in size but there is no absolute limit to the size of the web sites that can be treated in this way. In addition, the framework has

10. REFERENCES [1] Dappert A., Farquhar A., Significance is in the eye of the stakeholder, 2009, Lecture Notes In Computer Science,

Proceedings of the 13th European conference on Research and advanced technology for digital libraries, Corfu, Greece, Pages: 297-308 [2] Strodl S., Beran P. and Rauber A. Migrating content in WARC files 2009 The 9th International Web Archiving Workshop (IWAW 2009) Proceedings", (2009), 43 - 49 [3] S. Strodl, C. Becker, R. Neumayer, and A. Rauber. How to choose a digital preservation strategy: Evaluating a preservation planning procedure. Proceedings of the 7th ACM IEEE Joint Conference on Digital Libraries (JCDL’07), 29–38, New York, NY, USA, 2007. ACM.

[4] Sharpe R and Brown A. Active Preservation Lecture Notes In Computer Science, 2009, Proceedings of the 13th European conference on Research and advanced technology for digital libraries, Corfu, Greece, Pages: 465-468 [5] Thaller M., The eXtensible Characterisation Languages – XCL, Verlag Dr. Kovac, Hamburg: 2009 [6] Sharpe R., Henshaw C., Thompson D., Managing “Visually Lossless” Compression with JPEG2000, Proceedings of Society for Imaging and Technology Archiving 2010, 107112

Terminology Evolution Module for Web Archives in the LiWA Context ∗

Nina Tahmasebi

Gideon Zenz

Tereza Iofciu

L3S Research Center Appelstr. 9a Hannover, Germany


L3S Research Center Appelstr. 4 Hannover, Germany

[email protected]

[email protected] Thomas Risse

[email protected]


[email protected] ABSTRACT

1.

More and more national libraries and institutes are archiving the web as a part of the cultural heritage. As with all long term archives, these archives contain text and language that evolves over time. This is particularly true for web archives as content published online is highly dynamic and changing at a fast rate. The language evolution causes gaps between the terminology used for querying and the one stored in long term archives. To ensure access and interpretability of these archives, language evolution must be found and handled in an automatic manner. In this paper we present the LiWA Terminology evolution module, TeVo which takes us one step closer to fully automatic detection of terminology evolution. TeVo consists of a pipeline for finding evolution from web archives based on the UIMA framework. The LiWA TeVo module consists of two main processing chains, the first for Warc file extraction and text processing and the second for finding terminology evolution. We also present the terminology evolution browser, the TeVo browser, which aids in exploring evolution of terms present in archives.

Preserving knowledge for future generations is a major reason for collecting all kinds of publications, web pages, etc. in archives. However, ensuring the archival of content is just the first step toward “full” content preservation. It also has to be guaranteed that content can be found and interpreted in the long run.

Categories and Subject Descriptors H.3.6 [Library Automation]: Large text archives; H.3.1 [Content Analysis and Indexing]: Linguistic processing

General Terms Terminology Evolution, Semantics, Information Extraction

∗

This work is partly funded by the European Commission under LiWA (IST 216267).

INTRODUCTION

Currently the semantic accessibility of web content suffers due to changes in language over time, especially when considering time frames beyond ten years. Language changes are triggered by various factors including new insights, new political and cultural trends, new legal requirements or highimpact events. For example, consider the name of the city Saint Petersburg: the Russian city was founded in 1703 as “Sankt Piter Burh” and soon after renamed to “Saint Petersburg”. From 1914-1924 it was named “Petrograd” and afterwards “Leningrad”, then changed back to “Saint Petersburg” in 1991. Terminology evolution is not restricted to location names but refers to all added, changed or removed senses for a term. The work in this paper is done within the scope of the LiWA project1 . In LiWA, short for Living Web Archives, the objective is to turn web archives from mere web page repositories into living web archives. LiWA aims at improving web archives by filtering out irrelevant content such as web spam [8]; and dealing with issues of temporal web archive coherence [19], as well as improving long-term usability. Within the LiWA project an abstract model presented in [21] and [23] has been developed, that allows the representation of terminology snapshots at different moments in time, extracted from large digital corpora. In order to apply the terminology evolution detection algorithms to web archives we have implemented a terminology extraction pipeline based on the Apache UIMA2 framework. The remaining paper is organized as follows. We begin by giving an overview of the architecture for the TeVo module 1 2

http://www.liwa-project.eu/ http://incubator.apache.org/uima/

Crawler Post-Processing

….

put (List ) Asynchron Terminology Extraction Pipeline Job Queue

Crawl Statistics

WARC Extraction

POS Tagger

WARC Files

TermEvolu DB

Word Sense Discrimination

Cluster Tracking

Lemmatizer

Cooccurence Analysis

Evolution Detection

Evolution Detection Pipeline

Curator

Figure 1: TeVo architecture in LiWA.

in Section 2. The first part of our module concerning terminology extraction is explained in detail in Section 3. The second part concerning word sense detection is explained in Section 4. We present our visualization tool in Section 5. Experiments conducted with the module on a excerpt of a web archive is given in Section 6. We review related work in Section 7 and conclude our paper as well as discuss future work in Section 8.

Figure 2: Example of an ARC file and the extracted information.

2. ARCHITECTURE

• POS Tagger and Lemmatizer : Natural Language Processing using DKPro [13] UIMA components.

The LiWA TeVo Module is split into terminology extraction and tracing of terminology evolution. It is a post-processing module and can be triggered once a crawl or a partial crawl is finished. As input the module takes WARC or ARC files created, e.g., by Heritrix. The terminology extraction pipeline is implemented using UIMA, as presented in Figure 1. Apache UIMA (originally developed by IBM) is a software framework for unstructured information management applications. The UIMA framework is very scalable and can analyze large amounts of unstructured information. Furthermore its modular design allows for easy extension and adoption for the TeVo module. The data exchange between pipeline components is done via the Common Analysis System (CAS) as described in [9]. The evolution detection pipeline is manually triggered by the archive curator based on crawl statistics gathered during terminology extraction. When enough data is extracted or the desired time frame is reached, the curator can start the evolution detection pipeline.

3. TERMINOLOGY EXTRACTION For extracting terminology from web archives we have build a pipeline, as can be seen in the Terminology Extraction UIMA Pipeline in Figure 1, with the following UIMA components: • WARC Extraction: Archive Collection Reader, using BoilerPipe [10] for extracting text from web documents.

• Cooccurrence Analysis: AnnotationsToDB, writing the terminology and document metadata to a MySQL database, TeVo DB. When the processing finishes and the extracted terminology is indexed, terminology co-occurrence graphs can be created for different time intervals. The extraction pipeline can be called several times before the curator initiates the evolution detection pipeline.

3.1

Collection Reader

The first task in our system is to extract and annotate the text from web archives, which have the ARC or WARC format3 . In order to iterate through the web pages crawled and stored in an archive we integrated the archive reading tools from Heretrix Java Api4 from Internet Archive. For each archive file we retrieve the URL, the crawl date, the content type and the encoding from the archive header as shown in Figure 2. For the archive files with content type text/html we retrieve the html content and then, based on the encoding, we extract the textual content using the BoilerPipe Java Api [10]. One of the major issues when dealing with web data is extracting the text from the html document. By using simple 3 4

http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml http://crawler.archive.org/

HTML stripping tools, it is inevitable to extract also unwanted text, like table headers and advertisement tags. By using the boilerpipe library algorithms we better extract the main textual content of a web page. Extracting content is very fast (milliseconds) and no global or site-level information is required. For each text file from the archive we create a document with the URI annotated as document id and text as ArticleText. As additional document metadata we annotate the crawl date and the crawl title as given by the archive name.

3.2 NLP Annotator In the next step we annotate the article text using the components of the DKPro (Darmstadt Knowledge Repository)5 , a collection of UIMA-based components for NLP tasks. We process the text of each article using two annotators from DKPro: the BreakIteratorSegmenter for sentence splitting and tokenization, and the TreeTaggerPosLemmaTT4J annotator for part-of-speech tagging and lemmatization. The POSTagger and Lemmatizer from DKPro are wrappers for TreeTagger [17].

Figure 3: TeVo DB Schema

3.3 Terminology Indexing In order to make the terminology extracted from the archives available for further analysis, we implemented a database index for the found documents and terms. For detecting terminology evolution, the curvature clustering algorithm presented in [7] works on lemmas of nouns co-occurring within and and or clauses, or in lists separated by commas. In the AnnotationsToDB annotator we insert metadata about each analyzed document, i.e. its URL, the crawling date and the crawl name, into the database (see the diagram in Figure 3). Additionally we insert all the annotated sentences with the information regarding sentence position. As a start, we are interested in the evolution of nouns. Therefore we only insert lemmas of nouns occurring in documents into the database, along with their position information, beginning and end offset. All remaining annotated tokens, other than nouns and conjunctions are omitted. These would make the preprocessing expensive both in terms of storage as well as running time. When building the cooccurrence graphs we consider nouns appearing in the same list, thus we have to tackle the problem of articles and cardinals appearing before a noun. For example, in the sentence “It involves forgiveness and a readiness to accept ...”, the nouns forgiveness and readiness appear in an “and” clause. Inserting the original offsets of the two nouns would hinder us from retrieving the relations between them using a simple select based on position. To overcome this limitation for all nouns preceded by articles (a, the, ...) or cardinals (one, 20, ...) we use the beginning offset of the preceding token as beginning offset. One issue that frequently appears when dealing with archives generated by regular crawls is duplicate data. Pages that are not changed between two crawls will be duplicated in the archive. At terminology level, duplicate data may negatively bias the terminology analysis [23]. To overcome this 5

http://www.ukp.tu-darmstadt.de/research/projects/dkpro/

issue, when a document is found with a URL which already exists in the index, we check if the two documents have the same set of lemmas. We consider two documents to be the same, or have non-significant changes if the overlap between the two lemma sets is higher than a threshold. The threshold can be configured, based on initial experiments we used a threshold of 80%. Web pages can be static or dynamically generated. While the static pages have a unique fixed URL, the dynamic ones are being generated based on the parameters in the POST or GET parameters from the HTML request. Thus, within the same crawl there are possibly different pages with the same URL. For this situation, we first check that the pages have distinct lemma sets as in the previous scenario. If we can conclude that the pages indeed have different content, we append a random number to the URL of the second page before inserting it as the document identifier in the database. Finally the co-occurrence graph is stored in the TeVo DB and can be fetched by the evolution detection module.

4.

TERMINOLOGY EVOLUTION

After finishing the terminology indexing step, we can start extracting word senses and tracking evolution. In the first step we fetch a co-occurrence graph from the TeVo DB. We keep all co-occurrences and consider the graph as an unweighted graph. We then extract word senses by clustering the co-occurrence graph. The clustering algorithm is based on clustering coefficient of each term in the graph, i.e., the interconnectedness of the neighbors of a term. Terms that have a highly interconnected neighborhood are likely to present stable terms, while terms with a sparsely connected neighborhood are likely to be ambiguous [7]. By removing the terms with low clustering coefficient, the graph falls apart into coherent subgraphs (clusters) which we interpret as word senses. Previous works [6, 7, 14] have used a clustering coefficient of 0.5 which provides stable word senses. In

Figure 4: Subgraph from .gov.uk archive from 2006, each node contains also its clustering coefficient from the local graph. The graph shows two clusters which both correspond to means of transportation.

our experiments we use a clustering coefficient of 0.3 which has shown to provide a larger number of word senses as well as a higher probability of evolution. It should be noted that the extracted word senses represent the data available in the collection from which the co-occurrence analysis has been gathered. Figure 4 shows a small part of the graph created for the last quarter of 2006 from .gov.uk crawls. The term car has a low clustering coefficient (0.24) since it has many neighbors which are not connected to each other. When car is removed, the graph falls apart into two smaller subgraphs. In order to make these clusters capture also ambiguous words, the terms which have been removed will be added in the clusters where they have neighbors. The result is (bicycle, bike, car, motorbike, scooter, wheelchair) and (car, vehicle, caravan, trailer, tractor, lorry, number). Each term in one cluster is closer in meaning to the other words in the cluster than words from the other cluster. Both clusters correspond to means of transportation but differ in that the first cluster corresponds to smaller, mostly two wheel vehicles, while the second cluster corresponds to larger, four or more wheeled vehicles. Once we have created clusters, i.e., word senses, for each period in time, we can compare these clusters to see if there has been any evolution. Consider the cluster C1 =(bicycle, bike, car, motorbike, scooter, wheelchair) from 2006. In 2007 we find the exact same cluster, most likely indicating that the web pages from 2006 stayed unchanged in 2007. In 2008, we find a cluster C2 =(motorcycle, moped, scooter, car, motorbike). Because of the high overlap between the clusters, e.g., car, motorbike, scooter, we can draw the conclusion that they are highly related. Still we see some shift. In C1, the words bicycle, bike and wheelchair are representatives of unmotorized means of transportation, while in C2, only motorized means of transportations are present. Because of the short time span we cannot draw any conclusions of real terminology evolution, however we can say that there has been a shift in usage in the archive. In 2003-2005 we cannot find a cluster related to motorbike. This is likely a consequence of data selection and comes from using a random sample of each crawl. Current LiWA tracking technology use Jaccard similarity to compare clusters. This means that for two clusters, we

Figure 5: User-interface showing clusters for term Motorbike

look at the fraction between the overlapping terms and the total number of distinct terms contained in both clusters. The similarity scores for two clusters lie between 0 and 1. Similarity of 1 indicates that two clusters are exactly the same and a similarity of 0 indicates that two clusters have no terms in common. For C1 and C2 the Jaccard similarity is 83 . We consider two clusters which have a similarity higher than α to represent the same word sense. In the above example C1 would keep its meaning even if one word was removed from one archive to another, e.g., C1′ = (bicycle, bike, car, motorbike, scooter) and C1 can be considered to represent the same sense. When two clusters have a similarity below β we consider the clusters to have no relation. Clusters with similarity above β but below α are candidates for evolution. year 1867 1892 1927 1938 1957 1973 1980 1984

cluster members yard, terrace, flight hurdle race, flight, year, steeplechase flight, england, london, ontariolondon length, flight, spin, pace flight, speed, direction spin, pace flight, riding, sailing, vino, free skiing flight, visa, free board, week, pocket money, home flight, swimming pool, transfer, accommodation

Table 1: Selected clusters and cluster members for the term ’flight’ from The Times Archive. Because available web archives span a relatively short period of time, true terminology evolution becomes difficult to find. Therefore, as an example of cluster evolution, we show clusters from The Times Archive [11]. The archive spans from 1785 − 1985 and is split into yearly sub-collections and processed according to [22]. In Table 1 we see clusters for the term flight. Among the displayed clusters it is clear that the senses for flight are several and mostly grouped together. Between 1867-1894 there are 5 clusters (only two of them displayed here) that all refer to hurdle races. Between the years 1938 - 1957 the clusters are referring to cricket, the terms in the clusters are referring to the ball. Starting from 1973 the clusters correspond to the modern sense of flight as a means of travel, especially for holidays. The introduction of among others pocket money, visa, accommodation, differ-

4

15

x 10

Graph Size vs. Term Count

4

x 10 3

10

2

5

1

0

2003

2004

2005

2006

2007

2008

No. terms

No. of relations

Graph size No. of Terms

0

Year Figure 6: Graphical representation of 2006’s Motorbike-Cluster

entiates the latter clusters from the earlier. Also the cluster in 1927 refers to a flight but not necessarily in a holiday sense.

5. VISUALIZATION OF EVOLUTION In order to make the results of the terminology evolution process end-user accessible, we devised a web-based user interface which allows for exploring the evolution of a given term. As running example we will use the term motorbike present in clusters from the 2006-2008 .gov.uk crawls (see Section 6). After the user specifies the term of interest, here motorbike, we show all clusters representing this term over time by displaying the term with the highest clustering coefficient for each cluster over a time line (right side, Figure 5). Furthermore, we give the term frequency distribution of the term over time (left side, Figure 5). By assessing the term frequency distribution, and possibly combined with a changing cluster representative as seen on the right side, the user can infer if a significant change of the word usage happened at a given point in time. To get a deeper understanding of the context of a given year, the user can click on a cluster representative. As shown in Figure 6, all cluster members are displayed along with their connection. The TeVo visualization browser enables the user to get a quick look at what happens to a term over time. First of all the raw (or normalized) term frequencies over time can give an indication of an event, or evolution, for a term. If the term ’motorbike’ spikes in frequency in one year it is worth the effort to investigate further into that term. In addition to the term frequencies, the clusters help in getting term context. Assume that in 2007, a music group named motorbike is started. Then an additional cluster would appear, containing terms such as music, concerts, CD, release etc. From this, the user would get an indication of an added (or inversely) removed word sense. In addition, if there has been more subtle evolution within a cluster, the user can click on that cluster and see the other cluster members. In Figure 6, based on the connections between the cluster members, i.e., only to one other term in the cluster, the user can deduce

Figure 7: Number of relations shown for samples from 2003-2008. We also see the amount of unique terms from these relations. that the term wheelchair is less relevant to the cluster for motorbike. The TeVo visualization browser saves the user time in finding and reading web pages from the archive for the term ’motorbike’ from different periods in time to get this overview.

6.

EXPERIMENTS

Our experiments are conducted on sample archives from .gov.uk crawls available at European Archives6 . Archives from December each year are chosen and processed. The results are varying sized samples for which we will present details. Firstly it is clear that the amount of relations extracted from the yearly samples vary heavily because of the type of archive. As web archives can contain multimedia files, images, videos etc, it is difficult to predict the amount of text from such a sample. This limits the control over the amount of text extracted and indexed. Furthermore, even if we can control the amount of text that is processed, if the crawl is too wide, the extracted relations become sparse and the amount of useful information in each co-occurrence graph varies heavily. In Figure 7 we see the amount of relations as well as how many unique terms were present in the graphs. In 2003 each term has an average of 2.7 relations in the resulting graph. In 2006 or 2007, each term has an average of roughly 5.2 relations. A major factor for the observed variation is the diversity of the data in the crawl. The more diverse data that is chosen for processing, the fewer are the amount of relations per term. This varying behavior is also shown for the number of clusters extracted from the co-occurrence graphs, Figure 8. In 2003 there is one cluster for every 36th relation while in 2006 and 2007 there is one cluster for every 700th relation. In 2008 we have one cluster for every 470th relation. This is most likely a result of the sparseness in the crawls. Too many topics result in a graph with many relations that are not creating triangles, which results in many terms with a low clustering coefficient. These terms are removed in the clustering and do not contribute to creating clusters. 6

http://www.europarchive.org

Clusters and Cluster Quality No. of clusters Quality of Clusters

50%

0%

we do not need to write the CAS files on disk, thus the time needed by the CAS Writer can be deducted from the processing time.

2000

1000

2003

2004

2005

2006

2007

2008

No. Clusters

Cluster Quality

100%

0

% 1.09% 0.36% 1.93% 95.03% 1.56% 100%

Time(ms) 59905 19577 106055 5218886 85732 5508361

s/doc 0.00783 0.00256 0.01386 0.68185 0.01120 0.71967

Component ARC Reader BreakIteratorSegmenter TreeTaggerPosLemma AnnotationsToDB CAS Writer Entire Pipeline

Table 2: Annotator Processing Time for 7654 documents

Year Figure 8: Number of clusters shown for samples from 2003-2008. We also see the quality of these clusters. We measure the quality of the clusters by measuring the correspondence of clusters to WordNet synsets [15], i.e., the amount of clusters which correspond to a word sense. For this reason we evaluate all clusters with more than one term from WordNet. An average of 68% of the clusters are used for evaluation. Figure 8 shows the number of clusters per year as well as the quality of the clusters. In 2003 where we have the highest amount of clusters (shown in red), we have a fairly low quality of the clusters. Only 3 out of 4 clusters correspond to a word sense. In 2006 and 2007 on the other hand, we have fewer clusters with a high quality where 9 out of 10 clusters correspond to a word sense. In 2008 the quality is again lowered. We see that the results of the word sense discrimination algorithm is very irregular and highly dependent on the underlying archive. If the documents in the archive contain high quality, descriptive text, then the clusters have a high quality. If on the other hand, the documents in the archive contain a high amount of spam and advertisements, incorrectly written English etc., then good quality clusters are more rare. The remaining clusters, i.e., clusters that are not representing word senses or not considered in the evaluation, are not necessarily semantically unrelated clusters. In many cases they are just not corresponding to word senses. As an example, in 2003 we have many clusters which contain people names and names of documents, forms etc. (t.ereau, m-b.delisle, n.kopp, a.dorandeu, f.chretien, h.adle-biassette, f.gray) are all authors of a paper about Creutzfeldt-Jakob disease and (sa105, sa104f, sa107, sa103, sa106, sa104, sa103l) are all tax return forms.

6.1 Performance Analysis for Extraction The run-time performance of the annotators, measured over a 93.5 MB compressed web archive, is shown in Table 2. The uncompressed archive size is 194.6 MB and contains 9743 documents, out of which 7654 are of text/html content type and in English. All the annotators required on average less than a second per document. Most time is used by the AnnotationsToDB annotator which transfers the data to the database. As we do not need the produced CAS files,

Given the amount of data a typical crawl consists of, the performance of the annotator still leaves room for improvement. Completing the terminology extraction process for the gov.uk crawl used in our experiments took 14 days. For 2006-2008 the size of the crawls are presented in Table 3. One option to speed up the processing is to only index the lemmas that are needed in the next step, i.e., lemmas with a preceding or succeeding conjunction. While this restriction saves processing time, it makes the data unusable for other co-occurrence filters, like sentence level or window level cooccurrence. Year 2006 2007 2008

Number of archives 231 3087 1250

Average size 93.07 MB 93.75 MB 95.36 MB

Total size 21.5 GB 289.4 GB 119.2 GB

Table 3: .gov.uk crawls

We create co-occurrence graphs by selecting lemmas that appear to the left and right of found conjunctions. From a 3 GB snippet of the .gov.uk crawl of 2006, 59.844 cooccurrences were extracted, out of which most of the cooccurrences had a frequency of 1. This indicates that in order to produce meaningful co-occurrence graphs that do not experience the same variance as seen in Figure 7 and 8, terabytes of data is needed.

7.

RELATED WORK

For finding word senses in an automatic way, i.e., word sense discrimination, several methods based on co-occurrence analysis and clustering have been proposed like [4, 16, 18]. Taking semantic structures into account improves the discrimination quality. In Dorow et al. [6, 7] it is shown that co-occurrences of nouns in lists contain valuable information about the meaning of words. A graph is constructed where the nodes are nouns and noun phrases. There exists an edge between two nodes if the corresponding nouns are separated by “and”, “or” or commas in the collection. The graph is clustered based on the clustering coeffient of a node and the resulting clusters contain semantically related terms representing word senses. Another approach of word sense discovery is focused on pattern discovery, such as the one presented in [4]. In [15] a clustering algorithm called Clustering by Committee is presented. This clustering produces

clusters with words that can be considered synonymous. An evaluation method is also proposed, where the discovered word senses can be assessed using WordNet [12]. The output from word sense discrimination is normally a set of terms to describe senses found in the collection. This grouping of terms is derived from clustering and we refer to such an automatically found sense as a cluster. Clustering techniques can be divided into hard and soft clustering algorithms. In hard clustering, an element can only appear in one cluster, while soft clustering allows each element to appear in several clusters. Due to the polysemous property of words, soft clustering is most appropriate for word sense discrimination. Temporal aspects in information retrieval come in different flavors, such as dealing with temporal information within documents, or with temporally versioned documents, or dealing with temporal evolution of terminologies extracted from documents. According to our analysis not much work has been done on the problem of terminology evolution. Abecker et al. [1] show how medical vocabulary evolved in the MEDLINE system. McCray investigates the evolution of the MESH ontology [2]. In the latter study, psychiatric and psychological terms are manually analyzed and their evolution is studied over 45 years. Terminology evolution can also be observed in other domains. For example, in computer science the Faceted DBLP7 allows analysis of the evolution of given keywords at different times based on the Semantic GrowBag approach [5]. However, all these approaches assess the evolution manually. Furthermore, the results cannot directly be used by information retrieval systems. Automatic detection of cluster evolution can aid in automatically detecting terminology evolution. This has been a well studied field in the recent years. One such approach for modeling and tracking cluster transitions is presented in a framework called Monic [20]. In this framework internal as well as external cluster transitions are monitored. The disadvantages of the method are that the algorithm assumes a hard clustering and that each cluster is considered as a set of elements without respect to the links between the elements of the cluster. In a network of lexical co-occurrences, the links can be valuable since the connections between terms give useful information to the sense being presented. In [14], a way to detect evolution is presented which also considers the edge structure among cluster members.

8.

We conducted experiments on selections of a large real-world dataset, which helped us to gain first insights into applying our terminology evolution algorithms to web archives. As our experiments show, the variance regarding size and quality of the co-occurrence graphs was significant. In order to overcome these issues we need to improve on data selection or significantly increase the amount of data used. It is also necessary to investigate further how to properly handle duplicate data and to find the consequences of keeping or removing such data. As future work, we intend to optimize the indexing component to make it more applicable to web-scale data. We also want to utilize the TreeTaggerChunker from DKPro to annotate n-grams for detecting noun phrases. Additionally, instead of only taking cluster evolution into account, we want to specifically find term evolution. This will allow us to determine e.g. that automobile evolved into car. Furthermore, we intend to use the TeVo browser to display parts of the co-occurrence graph for each term. This will help users to gain even more detailed insights about a term for a time period. The modules from the TeVo architecture as well as the TeVo browser will be released as open source modules before the end of the LiWA project8 .

9.

http://dblp.l3s.de/

ACKNOWLEDGEMENTS

We would like to thank Times Newspapers Limited for providing the archive of The Times for our research.

10.

REFERENCES

[1] A. Abecker and L. Stojanovic. Ontology evolution: Medline case study. In Proceedings of Wirtschaftsinformatik 2005: eEconomy, eGovernment, eSociety, pages 1291–1308, 2005. [2] Alexa McCray. Taxonomic change as a reflection of progress in a scientific discipline, www.l3s.de/web/upload/talk/mccray-talk.pdf. [3] K. Berberich, S. Bedathur, M. Sozio, and G. Wiekum. Bridging the terminology gap in web archive search. In WebDB, 2009. [4] D. Davidov and A. Rappoport. Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words. In ACL ’06: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, pages 297–304, Sydney, Australia, 2006. [5] J. Diederich and W. T. Balke. The semantic growbag algorithm: Automatically deriving categorization

To our knowledge only one previous work has been published in the area of terminology evolution [3]. Using language from the past, the aim here is to find good query reformulations of concurrent language. A term from a query can be reformulated with a similar term if the terms in the resulting query are also coherent and popular. Terms are considered similar if they co-occur with similar terms from their respective collections. Our approach advances on this by using word senses to find similar terms rather than pure co-occurrence information. Furthermore our approach gives more advanced knowledge on the evolution such as time information on the valid reformulations.

7

CONCLUSIONS AND FUTURE WORK

In this paper we presented the LiWA Terminology evolution module, TeVo, which takes us one step closer to fully automatic evolution detection given a long term archive. TeVo can be integrated effectively as a Heritrix post-processing module. We focused on extracting terminology needed for noun evolution detection and on overcoming the challenges appearing from dealing with web archives. We also introduce a tool which helps visualizing knowledge from, and discovering properties of a given archive.

8

http://code.google.com/p/liwa-technologies/

[6] [7]

[8]

[9]

[10]

[11] [12] [13]

[14]

[15]

[16]

[17]

[18] [19]

[20]

systems. In ECDL, volume 4675 of Lecture Notes in Computer Science, pages 1–13. Springer, 2007. B. Dorow. A Graph Model for Words and their Meanings. PhD thesis, University of Stuttgart, 2007. B. Dorow, J. pierre Eckmann, and D. Sergi. Using curvature and markov clustering in graphs for lexical acquisition and word sense discrimination. In In Workshop MEANING-2005, 2004. M. Erdélyi, A. A. Bencz´ ur, J. Masanés, and D. Sikl´ osi. Web spam filtering in internet archives. In AIRWeb, pages 17–20, 2009. T. G¨ otz and O. Suhre. Design and implementation of the uima common analysis system. IBM Syst. J., 43(3):476–489, 2004. C. Kohlsch¨ utter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM ’10: Proceedings of the third ACM international conference on Web search and data mining, pages 441–450, New York, NY, USA, 2010. ACM. T. N. Ltd. The Times Archive. http://archive.timesonline.co.uk/tol/archive/. G. A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38:39–41, 1995. C. M¨ uller, T. Zesch, M.-C. M¨ uller, D. Bernhard, K. Ignatova, I. Gurevych, and M. M¨ uhlh¨ auser. Flexible uima components for information retrieval research. In Proceedings of the LREC 2008 Workshop ’Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP’, pages 24–27, Marrakech, Morocco, May 2008. G. Palla, A.-L. Barabasi, and T. Vicsek. Quantifying social group evolution. Nature, 446(7136):664–667, April 2007. P. Pantel and D. Lin. Discovering word senses from text. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 613–619, Edmonton, Alberta, Canada, 2002. ACM. T. Pedersen and R. Bruce. Distinguishing word senses in untagged text. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 197–207, Providence, RI, 1997. Comment: 11 pages, latex, uses aclap.sty. H. Schmid. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, pages 44–49, Manchester, UK, 1994. http://www.ims.unistuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html. H. Sch¨ utze. Automatic word sense discrimination. Computational Linguistics, 24(1):97–123, 1998. M. Spaniol, A. Mazeika, D. Denev, and G. Weikum. ˘ ˘ I: ˙ Visual analysis of aAIJcatch ˆ me if you canˆ aA coherence defects in web archiving. In In Proceedings of 9th International Web Archiving Workshop in conjunction with ECDL 2009, Corfu, Greece, 2009. M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. Monic: modeling and monitoring cluster transitions. In KDD ’06: Proceedings of the 12th ACM

SIGKDD international conference on Knowledge discovery and data mining, pages 706–711, New York, NY, USA, 2006. ACM. [21] N. Tahmasebi, T. Iofciu, T. Risse, C. Niederée, and W. Siberski. Terminology evolution in web archiving: Open issues. In 8th International Web Archiving Workshop, Aaarhus, Denmark, 18th & 19th Sep. 2008, 2008. http://iwaw.net/08/IWAW2008-Tahmasebi.pdf. [22] N. Tahmasebi, K. Niklas, T. Theuerkauf, and T. Risse. Using word sense discrimination on historic document collections. In JCDL ’10: Proceedings of the 10th ACM/IEEE-CS joint conference on Digital libraries, Gold Coast, Australia, 2010. ACM. [23] N. Tahmasebi, S. Ramesh, and T. Risse. First results on detecting term evolutions. In In Proceedings of 9th International Web Archiving Workshop in conjunction with ECDL 2009, Corfu, Greece, 2009.

Author Index

Costa, Miguel, 9

Pop, Radu, 42

Denev, Dimitar, 24

Risse, Thomas, 55

Grotke, Abbie, 17 Iofciu, Tereza, 55

Senellart, Pierre, 31 Sharpe, Robert, 48 Spaniol, Marc, 24

Jones, Gina, 17

Tahmasebi, Nina, 55

Masanès, Julien, 42 Mazeika, Arturas, 24 M´ ario J. Silva, 9

Vasile, Gabriel, 42

Oita, Marilena, 31

Zenz, Gideon, 55

Weikum, Gerhard, 24

63