Application Independent Metadata Generation - CiteSeerX

5 downloads 3404 Views 537KB Size Report
In recent years, the average amount of data stored on the users' hard drives has grown rapidly. No wonder that some- times we can't find a document we know ...
Application Independent Metadata Generation ¨ Jurgen Belizki, Stefania Costache and Wolfgang Nejdl

L3S Research Center, University of Hanover Hanover, Germany {belizki,

costache, nejdl}@l3s.de

ABSTRACT To efficiently support personal ways of desktop usage, we have to unleash the power of implicit metadata thus giving local data a well defined meaning. To achieve this, contextual information across heterogeneous media types, file formats, and applications should be annotated and linked. In this paper we present a light weight system which monitors the file structure and automatically generates semantic metadata based on the user activities. We underpin the utility of extracted metadata by showing how it can be leveraged to enhance conventional full-text desktop search. Categories and Subject Descriptors: D.4.3[Operating Systems]: File Systems Management; H.3.4 [Information Storage and Retrieval]: Systems and Software; K.8.3 [Personal Computing]: Management/Maintenance; K.8.m [Personal Computing]: Miscellaneous General Terms: Algorithms, Design Keywords: Contextualized metadata

1.

INTRODUCTION

In recent years, the average amount of data stored on the users’ hard drives has grown rapidly. No wonder that sometimes we can’t find a document we know we saved somewhere. Desktop search applications, which index data on a PC, have come up to deal with this problem. Still, these tools lack ranking facilities like PageRank [8], which have revolutionized web search. Thus, users may be forced to go through the whole list of produced search results to pick the right one. The main problem with ranking on the desktop is that valuable links between documents either do not exist or are lost during their usage. So, an email attachment is no longer related to the subject of the message or its sender as soon as it is stored as a file on the PC. In a realistic scenario, where the user forgets a person who sent him particular documents via electronic mail, a simple full text index employed by current desktop search engines even fails to find the information she is looking for.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CAMA’06, November 10, 2006, Arlington, Virginia, USA. Copyright 2006 ACM 1-59593-524-X/06/0011 ...$5.00.

In [11] the authors showed human tendency to associate things to certain contexts. This information can be analyzed in order to establish meaningful connections between initially isolated resources. Desktop search is in fact ”a search into our past” and should hence exploit a plenty of implicit and explicit data separately available in diverse tools like email, folder hierarchies as well as file specific metadata and others, to improve its searching and ranking capabilities. This will bring desktop search much closer to the performance of its web counterparts. To keep these relations, we represent user’s personal semantic model in an application independent way as RDF statements according to an ontology which incorporates the essential parts of all the desktop contexts. The incrementally stored RDF data can be a property of single resources: ”author” for an email, ”title” for a PDF article, or build the more complex connections between resources: an email ”has attachment” a file. This work proposes such an automatic solution which takes into account the user’s activity under which each physical resource was created or used in order to handle large amounts of data on our desktops. The implemented prototype focuses on the three main working contexts, that are further detailed in the next section. These are electronic mail, files and folder hierarchies, and additional ones dealing with research publications which we differentiate from other less structured files. We show how appropriate metadata generators extract or infer the corresponding context information which can be used by a search engine along with a full text index of our documents. The next section gives an overview over the system’s general architecture, it starts with description of jNotify - the core component of our prototype, going on to different exploitable contexts for metadata generators coupled with jNotify. Section 3 shows how previously mentioned metadata generator modules describe contexts by means of appropriate ontologies and association rules. Section 4 classifies existing approaches according to how they use metadata to augment search. Finally, Section 5 concludes with a look at some open questions and further improvements.

2. 2.1

GENERAL ARCHITECTURE jNotify

The main characteristic of our desktop search architecture is metadata generation and indexing on-the-fly, triggered by modification events generated upon occurrence of file system changes. jNotify is such an event listener which

Figure 1: Inotify Events. steadily watches the user’s home folder and provides a necessary API through callbacks for all monitorable changes. It is implemented in Java and relies on a notification functionality Inotify [7], which is the basic part of Linux kernel, used in our prototype, since the version 2.6.13. Events are fired whenever a new file is copied to hard disk or stored by the web browser, when a file is deleted, created or modified, when a new message from local mailbox is read, etc. Inotify recognizes up to 15 different events summarized in Fig.1. Regarding the architecture of jNotify, it uses Java Native Interface (JNI) to take advantage of platform-specific Inotify by providing external glue code for some native functions written in C to control Inotify via Java. However, Inotify is able to register just one watch per file to oversee access to it. Since a directory in UNIX systems is actually also a file that has a special attribute (denoting it as being a directory), that contains a list of file names, and ”pointers” to these files on the disk, watches on a registered directory apply only to the files and subdirectories inside. Hereby, the latter are not registered themselves unless they had been read or created after the watch initialization. Therefore, taking a directory (for example home folder of the user) to be observed as argument, jNotify descends inside this directory to recursively register watches on all accessible subdirectories contained in it. If some application touches one of the registered files or directories, jNotify passes the file’s absolute path, informed by the kernel, further to a metadata generator called correspondingly to the action and the file’s media type / format for further processing. Thus, new RDF annotations are created, updated, or removed upon CREATE, CLOSE WRITE and MOVE, or DELETE events, respectively. For instance, when the user renames a directory she triggers update of contextual data pertaining the meaning of the path of every file inside of it, as described further in Section 3. jNotify runs in background without any significant decrease in the overall system performance and keeps a personal semantic model up-to-date independently of the actual semantic search tool, e.g. Beagle++ [2], which on its own is able to make use of the produced metadata in the way described above.

2.2

Usage Contexts for Metadata Generation

For our system to take advantage of the kernel’s notification functionality, we separated out three important contexts for usage of desktop items, that can eventually be reduced to files. In the next paragraphs we introduce each of them with a small scenario, where ordinary full-text search

fails, but additional context metadata provides missed ingredients to find the document we are looking for. Email context: Existing search algorithms clearly drop a great deal of useful information present in emails. For example, one email might contain a question describing the object one is looking for, and another email in the same thread might include the answer to that question in the form of an attached document. As already mentioned, email attachments lose all provenance information as soon as they are stored, even though emails usually include additional information about their attachments, such as sender, subject, remarks in the body. We might discuss a paper with a colleague during a brainstorming session, and afterwards send her the electronic version via email, together with a few helpful comments. After a while, our colleague might not remember details about the paper itself, but rather recall with whom she discussed it or which question was raised in the discussion and included as comment. It would be helpful to find the stored paper not only based on its content, but also associatively based on that context. Basic aspects relevant to the email context are referring to the date when an email was sent or accessed, as well as its subject and body text. The status of an email can be described as seen/unseen or read/unread. The property ”reply to” represents email thread information, the ”has attachment” property describes a 1:n relation between an email and its attachment(s). The ”sender” field gives information about the person, which can be associated to a social networking trust scheme, thus providing valuable input for assessing the quality of the email according to the reputation of its sender. File hierarchies: Despite the effort the user invested to structure her data in folders, this sophisticated classification is barely utilized by the search algorithms. For example, pictures taken during a trip to Germany are probably saved in a directory entitled after the city or the region like ”Lower Saxony” or ”Hanover”. However, our user might have no time to rename each image, and thus their file names are the ones used by the camera (for example ”DSC00728.JPG”). When the user forgets the directory name, no ordinary search can retrieve her pictures, as the only word she remembers, ”Germany”, does neither appear in the file names, nor in the directory structure. It would certainly be useful if an enhanced desktop search with ”pictures germany” would retrieve her folder containig Hannover pictures. Obviously, our context metadata for files include the main file properties like path, date of access and creation. File types can be inferred automatically, and provide useful information as well (in our case, the file is of type ”JPEG image data”). We also keep track of the whole file path, including the directory structure. Publications Context: Research activities represent one of the occupations where the need for contextual metadata is very high. The most illustrative example is the scientific publication itself: We might remember the general topic of a paper and the person who sent it to us by email, but not its title. Which other papers did we download or discuss via email at that time, and how good are they (based on a ranking measure or on their number of citations)? These questions arise rather often in a research environment and have to be answered by an appropriate search infrastructure. Publications represent a specific type of file, with ad-

Figure 2: Contextual Ontology for the Semantic Desktop. ditional information inherent to a scientific article, which comprises its ”author”, ”conf erence”, ”year”, ”title” and ”cites”. Additionally, we store the paper’s CiteSeer ID (if any). The publication context can be connected to the email context, if we communicate with an author or if we save a publication from an email attachment. Of course, since each publication is stored as a file, it is also connected to the file context, and thus to the file specific information associated to it (e.g., path, number of accesses, etc.). Generated annotations are defined in the form of an ontology, depicted in Fig.2, which integrates metadata relevant to each context outlined above. Additionally, in the lower left corner the overview image presents another context for visited web pages stored in browser cache, which was explored by our previous work in [3]. However, a devoted metadata generation module for it doesn’t rely on jNotify and therefore is not so important for the present work.

3.

METADATA GENERATORS

Depending on the type of the file / event, the appropriate metadata generators, building upon our own RDFS ontologies merged into one as depicted in Fig.2, extract available metadata either directly (e.g. email sender, subject, body) or infer and instantly materialize them using appropriate association rules plus possibly some additional background knowledge (e.g. WordNet). Annotations are exported to RDF using the Jena toolkit1 and stored in the Sesame RDF Repository2 . Jena is a Java framework for building Semantic Web applications. It provides a programmatic environment for RDF and RDFS, including a rule-based inference engine which we use to implement our association rules. Email Metadata Generator: Our current email prototype is built on top of the JavaMail API3 . It processes the incoming emails, independent of which client wrote them4 , 1

http://jena.sourceforge.net/ http://www.openrdf.org/ 3 http://java.sun.com/products/javamail/ 4 It handles only maildir type of email storage, where there 2

into a separate class, derived from the Message defined in JavaMail, as the Message class already provided helpful methods such like getTo, getRecipients, getSubject and getSentDate. Further metadata are yielded when attachments are stored in the file system. File Metadata Generator: For the ”picture example” from the last section we need to consider file type and path information. Furthermore, we need to be able to go beyond simple keyword search. The rules allow us to take explicit compositional (Hanover ”is part of ” Germany), hierarchical (hackee ”is a” (species of) squirrel), as well as synonym information (picture ”is a synonym to” image) into account. Hence, to capture all meaningful information available from the name of the file itself and the folders in the file path, these are extended with holonyms, hypernyms, meronyms, hyponyms and synonyms provided by WordNet (JWNL API5 ), a lexical reference system for English language, thus enriching the file’s context. As future work, we intend to extract file specific information. For example, many image formats provide additional metadata, such as exposure information. Another possible improvement for this generator is to use additional background knowledge about seasons etc., as well as to let the user manually add more annotations to files or directories. We could then search for the pictures we took during the last winter in Germany, or during a special event in our life, like a birthday. Publication Metadata Generator: For each identified PDF file, which is supposed to be a publication, it parses the title and tries to match it against an entry into the CiteSeer publications database. If it finds such a paper, an RDF annotation is built up, containing information from the database about the title of the paper, the authors, pubis one file with the mime type ”message/rfc822” per email. We are currently working to support mbox used by e.g. Thunderbird, which stores each directory with all the emails as one file. In addition, another inotify-independent module was implemented to periodically poll metadata from an IMAP server. 5 http://sourceforge.net/projects/jwordnet/

lication year, conference, papers which cite this one and other CiteSeer references to publications. All annotation files corresponding to papers are merged in order to construct the RDF graph of publications existing on one’s desktop. Our system also supports BibTeX files commonly used to store bibliographic data associated with a LATEXfile. The BibTex Metadata Generator directly generates instances of Publication and Person according to the data stored in .bib files.

4.

RELATED WORK ON PERSONAL INFORMATION MANAGEMENT

Realizing an unified view upon the resources involved in a certain activity, our approach builds upon the idea of a semantic desktop [4]. Here, the authors envision that the next step towards communication is a desktop application based on the Semantic Web, which could draw connections between all the types of data people interchange. For example, an entry in an agenda would be correlated with the author of an article or to the context associated to an email. Altogether, the entire information existing in a social network would be connected to each desktop. Such a structure would then help people organize and find information, due to the enhancement brought by metadata into the system. The Fenfire project [6] proposes a solution to interlink any kind of information on one’s desktop. That might be the birthday with the person’s name and the articles she wrote, or any other kind of information. The idea is to make the translation from the current file structure to a structure that allows people to organize their data closer to the reality and to their needs, in which making comments and annotations would be possible for any file. Haystack [9] pursues similar goals as Fenfire. One important focus is on working with the information itself, not with the programs it is usually associated with. For example, only one application should be enough to see both a document, and the email address of the person who sent it. A third project building an information management environment for the desktop is Gnowsis [10]. The main idea behind applications in this environment is the use of a central information server which allows users to administer and directly access all the information on their computer (for example the author of a file, her email address, etc.). Gnowsis envisions appropriate ontologies at four levels. The first one is used on the server, as it needs custom formats for the internal operation data and for its configuration files. The second one is for each application and the data stored by it. For example, in Outlook Express the types of data that can be found are emails, contacts and appointments. On the third level we have public ontologies, created by others to describe people, projects or documents (e.g. Dublin Core or FOAF). On the uppermost level, the user can create user-specific ontologies to fit her needs. For each level, only general architectural information is given, but no specific details or examples about the proposed ontologies, though. Facilitating search for information the user has already seen before is the main goal of the Stuff I’ve Seen (SIS) system, presented in [5]. Based on the fact that the user has already seen the information, contextual cues such as time, author, thumbnails and previews can be used to search for and present information. [5] mainly focuses on experiments investigating the general usefulness of this approach though,

without presenting more technical details. Instead of searching through directories or performing keyword search on the desktop, Semex [1] offers search and browsing by association. To enable this, Semex automatically constructs a database of objects and associations between them extracted from multiple types of data sources. Similarly to our system, Semex provides a single logical view of one’s personal information hiding the boundaries that exist between data sitting in disparate applications.

5.

CONCLUSIONS AND FURTHER WORK

We described an application independent method of generating metadata, based only on the basic notifying functionalities provided by the operating system, that is whenever an event occurs, the appropriate methods are called and metadata is fully automatically generated. We also described the metadata generators that provide useful additional data and the scenarios in which they are successfully applied. As future work, we plan to enrich our metadata generators with more embedded file specific or attached information, such as GPS location coordinates and the time when a picture was taken, or its exposure information that could place an event in the day time. We believe this kind of additional data would make the user task of finding information more easy and efficient.

6.

ACKNOWLEDGMENTS

We would like to thank Christian Kohlsch¨ utter, who implemented the basic part of jNotify.

7.

REFERENCES

[1] Y. Cai, X. L. Dong, A. Y. Halevy, J. M. Liu, and J. Madhavan. Personal information management with semex. In SIGMOD Conference, pages 921–923, 2005. [2] P.-A. Chirita, S. Costache, W. Nejdl, and R. Paiu. Beagle++ :

[3]

[4] [5]

[6]

[7] [8]

[9]

[10] [11]

Semantically enhanced searching and ranking on the desktop. In ESWC, pages 348–362, 2006. P.-A. Chirita, R. Gavriloaie, S. Ghita, W. Nejdl, and R. Paiu. Activity based metadata for semantic desktop search. In ESWC, pages 439–454, 2005. S. Decker and M. Frank. The social semantic desktop. Technical report, DERI, 2004. S. Dumais, E. Cutrell, J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff i’ve seen: a system for personal information retrieval and re-use. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 72–79, New York, NY, USA, 2003. ACM Press. B. Fallenstein. Fentwine: A navigational rdf browser and editor. In Proc. of 1st Workshop on Friend of a Friend, Social Networking and the Semantic Web, 2004. R. Love. Kernel korner: intro to inotify. Linux J., 2005(139):8, 2005. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998. D. A. Quan and R. Karger. How to make a semantic web browser. In WWW ’04: Proceedings of the 13th international conference on World Wide Web, pages 255–265, New York, NY, USA, 2004. ACM Press. L. Sauermann. The gnowsis semantic desktop for information integration. In Wissensmanagement, pages 39–42, 2005. J. Teevan, C. Alvarado, M. S. Ackerman, and D. R. Karger. The perfect search engine is not enough: a study of orienteering behavior in directed search. In CHI ’04: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 415–422, New York, NY, USA, 2004. ACM Press.

Suggest Documents