Generating resource descriptions from metadata to ... - CiteSeerX

8 downloads 365 Views 47KB Size Report
language descriptions from the semi-structured metadata, rather than simple ... At its simplest XSLT allows mail merge like facilities, substituting elements from ... The body of the template may consist of a mixture of literal XML structure for the.
Generating resource descriptions from metadata to support relevance assessments in retrieval Alison Cawsey, Diana Bental & Bruce Eddy Department of Computing and Electrical Engineering Heriot-Watt University Riccarton, Edinburgh, Scotland EH14 7AS {alison,diana,ceebde1}@cee.hw.ac.uk

Patrick McAndrew Institute for Educational Technology The Open University Walton Hall, Milton Keynes, England, MK7 6AA [email protected]

Abstract We present methods for presenting descriptions of online resources that will help the user assess their likely relevance without having to download them. These descriptions are based on metadata describing a resource. Two approaches are explored. The first uses current XML-based standards and tools (XSLT and RDF) to offer tailored tabular presentations from selected metadata. The second uses natural language generation techniques to create concise textual descriptions. Both approaches tailor descriptions according to user interests using a simple user profile based on stereotypes.

Introduction Searching for relevant online documents, whether multimedia or text, is an interactive process involving relevance judgements by the user as well as the search engine. A search engine may retrieve a list of documents which are rated as relevant to the query, but it is then over to the user to examine this imperfect list to find those that are most likely to be genuinely useful given their information need. These may be then downloaded, and a further judgement made as to whether the query needs revising in the light of results. One obstacle in this process is that the information provided by the search engine about each resource is often not sufficient to allow the user to assess its possible relevance prior to download ; often just the title, and a few lines of text from the resource are given, or a thumbnail image. A consequence of this is that user time and bandwidth is wasted in downloading resources which turn out to be irrelevant to the user’s information need. The problem is especially acute where the resources consist primarily of multimedia components, with consequent poor descriptions and high download times. There is already some work addressing this problem. Summarisation techniques may be used for text resources, with query-directed summaries describing document content in a way that depends on the user’s query (e.g., Sanderson, 1998). While promising, the summaries produced are limited to what can be extracted from the text, and thus ignore information about the resource which may be external to it. Generating genuine summaries (as contrasted with concatenated extracted fragments of the document) also requires domain-specific knowledge, in order to robustly extract the key information from a document. And as the summarisation methods work by extracting text fragments, they are not useful for multimedia resources where the text represents but a small part of the information content of the resource. Another approach that is currently being pursued to support better search and retrieval is the use of rich metadata. Metadata is data about the resource, such as the topic, author, and date of last

modification (e.g., see Dempsey & Heery, 1998). There are a number of recent initiatives concerned with how metadata should be represented, and what metadata elements (for example, title) should be included in a formal description of a resource (e.g, Dublin Core 1). We are particularly interested in educational resources, where there has been much work establishing a core set of useful metadata elements to enable both teachers and learners to find resources 2. The GEM (Gateway to Educational Resources) element set, for example, includes the education-specific elements audience, duration, essential resources, grade and pedagogy, as well as standard elements from the Dublin Core. This metadata, as well as allowing for improved search, provides a source of structured data which should allow improved summary descriptions of a resource to be given to the user prior to download. The retricted metadata element sets and controlled vocabularies used make the task of generating quality descriptions more tractable (compared with summarisation techniques). We are interested in using this metadata to provide tailored descriptions so that the user may make a quick informed decision about a document’s relevance. We believe that providing good descriptions from metadata (making it visible) will both improve metadata based search services, and also motivate faster adoption of metadata standards. The paper presents two approaches to producing such descriptions. In the first we show how current XML-based standards and tools may be exploited in producing tailored descriptions in a tabular format. We using XSLT (eXtensible Stylesheet Language Transformations) to present summary descriptions to the user given metadata information represented using RDF (Resource Description Framework). Both of these are recent web technology standards relating to XML (eXtensible Markup Language), endorsed by the W3C (World Wide Web Consortium). While XSLT provides an elegant solution for the production of fairly simple descriptions, it is limited in power when more complex transformations are required. If we want to create quality natural language descriptions from the semi-structured metadata, rather than simple tables, we must look to techniques for natural language generation (see for example Reiter & Dale, 1997). The second half of the paper introduces the possibilities afforded by this approach, and preliminary work on using aggregation techniques to produce concise and coherent descriptions.

Using XSLT to Create Tailored Resource Descriptions Background: XML and Metadata Standards The eXtensible Markup Language (XML) is a language for marking up documents so their structure is made explicit. Extra information is added to a text to make the role of each part explicit. XML is a metalanguage based on the more complex SGML, but allowing easier delivery of documents over the World Wide Web. Like SGML it allows authors to define their own set of markup tags to use (e.g., "author", and "title"); this contrasts with HTML which has a fixed tag set. An XML document contains no information on how it should be displayed. Display is managed through use of a style sheet. A style sheet, for example, could specify that the title element should be displayed in a large bold font. The eXtensible Style Sheet Language (XSL) is a flexible style sheet language for use with XML documents. It consists of two parts, one concerned with transformation XSLT (Clarke, 1999), and one with formatting objects. Currently it is the transformation part that is most developed. At its simplest XSLT allows mail merge like facilities, substituting elements from the XML document into some template. But XSLT may also be viewed as a general tree transformation language allowing one XML document tree to be transformed into another. 1 2

www.purl.org/DC/ See www.imsproject.org, and www.geminfo.org for two intiatives concerned with educational metadata.

XML is used as the syntax for the Resource Description Framework (RDF), a W3C standard for expressing metadata. RDF allows for richly structured descriptions, and exploits XML’s namespace mechanism to make clear the metadata ‘vocabulary’ being used (Ianella, 1998 ; Lassila & Swick, 1999). Generating Tabular Descriptions from RDF Metadata using XSLT We are interested in how we can best present metadata to the user. As RDF is an emerging standard for representing metadata we considered initially metadata in RDF format, and explored the use of XSLT to present this data in different ways. Figure 1 gives an example of a simple RDF metadata record. The first two lines state that it is an XML document, and a particular stylesheet should be used to present the data. This followed by the main RDF record. The XML name space declarations (xmlns) define abbreviations that can be used within the main record to uniquely identify particular metadata fields (e.g., dc:title is a title field, as defined by the Dublin Core programme). The main metadata description (starting rdf :Description) is a hierarchical structure specifying, in this simple example, the resource’s title, creator, date and subject, with more detailed structured information given on the creator. While much more complex structures are possible in RDF (with an underlying data model based on directed graphs), this example should be sufficient to illustrate the use of XSLT.

Generating resource .. Alison Cawsey [email protected] 2000-01-01 Metadata, Retrieval

Figure 1: Example RDF Metadata

XSLT is a transformation language that allows you to take one XML (tree structured) document and transform it into another XML document. It is currently most commonly used to transform into HTML, to be delivered on current browsers (HTML, with some restrictions, is valid XML). XSLT is currently supported by Internet Explorer, and will be in future versions of Netscape. If we can effectively present data using XSLT, as well as being a natural and elegant solution, it also means that more of the processing could be done straightforwardly on the client side using standard technology.

An XSLT stylesheet consists of a number of templates, with patterns specified which may match parts of the input tree. The body of the template may consist of a mixture of literal XML structure for the output tree, and XSLT instructions. Figure 2 illustrates a simple example, suitable for presenting RDF data as a nested table3. When the root node in the input tree (/ ) is matched, some HTML is output, and templates applied to the main RDF description. RDF descriptions result in HTML table elements being created, while metadata fields such as DC:subject result in a table row, containing the name of the metadata field, and its value. (Similar templates should be defined for all the other possible metadata elements, such as DC :title, vcard :fn etc). As the values of metadata elements may be structured (another description) applying these templates may in turn result either in a simple table row, or a row with a (nested) table within the structure. The result of applying the stylesheet given in figure 2 to the RDF in figure 1 is sketched in figure 3.

Resource Description

Subject ...

Figure 2 : XSLT Stylesheet Fragment for RDF It is easy to write a stylesheet that will only present selected metadata ; we simply omit templates for the metadata fields of no interest. We can extend this idea to create presentations tailored to the individual user and their query ; If we derive from a user profile (and the query) a specification of which metadata fields are relevant to them, we can dynamically create a stylesheet specific to that situation. This is not a typical use of stylesheets (which are usually fixed, with one stylesheet associated with a document) but is technically not a problem. The reference to the stylesheet in the RDF document (« displayrdf.xsl » in the example in figure 1) can be to a simple CGI program that creates the stylesheet dynamically according to user preferences. Some of these more complex uses of XSLT, and a discussion of limitations, are given in (Cawsey, 2000), with demonstrations available on http://www.cee.hw.ac.uk/~mirador/demos.html.

3

This example XS LT fragment is based on the current XSLT specification, as used by the XT implementation. Small changes are required to run under Internet Explorer 5, which uses an older specification.

Title Creator Date Subject

Generating Resource… Alison Cawsey Name [email protected] Email 2000-01-01 Metadata, Retrieval

Figure 3: Simple Tabular Presentation of Metadata

Generating Natural Language Descriptions Although tables present specified metadata quite clearly, there are advantages in creating natural language descriptions of the data. Consider the following example : Title Author Subject Type Grades

Astronomy John Smith Science Lesson Plan 6-9

This data may be given more concisely in the single sentence ‘Astronomy is a science lesson plan by John Smith for pupils in grades 6-9.’ Creating natural language descriptions also opens the possibility of presenting information and opinions that go beyond the literal data, comparing and contrasting two resources, commenting on suitability, and so on. Simple natural language descriptions are also accessible to people who find tables hard to interpret. If metadata is simple and consistent (with the same metadata fields being used across resources), and little or no tailoring/selection is required, such textual descriptions may be created using templatebased approaches, filling in the blanks in a fixed structure. XSLT is entirely suitable for this, and provides a little more power than most mail-merge and database reporting tools. However, if different resources have different associated metadata elements, or different users require quite different information about resources, then template-based approaches start to break down. The task then becomes one of creating clear and concise sentence(s) given an arbitrary subset of metadata elements (e.g., author, date, and grade). This is a natural language generation problem ; how do we create coherent, concise texts given some structured data (and possibly a specified communicative goal). The particular area of natural language generation that appears most critical to success is aggregation : the creation of complex sentences by combining simpler structures (Shaw, 1998 ; Reape & Mellish, 1999). We have identified a number of types of aggregation that can be used in our application, focusing initially on syntactic aggregation. Within syntactic aggregation, we can perform the following operations : •



Adjective grouping : Two adjectives which modify the same item are joined with ‘and’, e.g., ‘Astronomy’ is a free resource. ‘Astronomy’ is a highly rated resource. -> ‘Astronomy’ is a cheap and highly rated resource. Co-ordination: Sentences with same subject joined with ‘and’.



e.g., ‘Astronomy’ is for grades 3-6. ‘Astronomy’ is about space science. -> ‘Astronomy’ is for grades 3-6 and is about space science. Embedding, e.g., ‘Astronomy’ is a lesson plan. ‘Astronomy’ is about science. -> Adjectival : ‘Astronomy’ is a science lesson plan . Prepositional phrase : ‘Astronomy’ is a lesson plan on science. Relative clause : ‘Astronomy’, which is on science, is a lesson plan.

We define a grammar specifying allowed aggregated sentence structures, and constraints on which metadata elements may be realised and combined using which aggregation methods. Then, given an arbitrary subset of metadata elements (from the types specified in our constraints) we may generate all the legal aggregated sentences. Some examples are given in figure 4, based on metadata from GEM (Gateway to Educational Materials). Set1: (title, type, subject) The resource is called Constellations. It is a lesson plan and on science. The resource is called Constellations. It is a science lesson plan. The resource is called Constellations. It is a lesson plan on science. Constellations, which is a lesson plan, is on science. Constellations, which is on science, is a lesson plan. Constellations is a lesson plan and on science. Constellations is a science lesson plan. Constellations is a lesson plan. It is on science. Constellations is a lesson plan on science. Constellations is on science. It is a lesson plan. Set2: (title, audience1, type, subject) Constellations, which is a lesson plan for students, is on science. Constellations, which is a science lesson plan, is for students. Constellations, which is a lesson plan, is for students and is on science. Constellations, which is a lesson plan, is on science and is for students. ….. Constellations is a science lesson plan for students. Constellations is a lesson plan for students and is on science. Constellations is on science and is a lesson plan for students. Constellations is a science lesson plan for students. Set3: (audience2, grade) The resource, which is a tool for teaching professionals, is for grades 3-6. The resource is for teaching professionals and for grades 3-6. … The resource is a tool for teaching professionals. It is for grades 3-6.

Figure 4 : Example aggregated texts generated for three different sets of metadata elements.

Although the sentences thus generated are concise, and syntactically acceptable, many of them are odd, leading to false implicatures. We now plan to investigate how best to select from the syntactically acceptable forms, and how semantic and terminological information adds additional constraints on acceptable combinations.

Tailoring Resource Descriptions Whether textual or tabular descriptions are produced, we want these to be tailored to the user’s needs and interests, and to the specific query.We currently allow users to set up user profiles which specify the degree to which the user is interested in the different possible metadata fields, on a four point scale. For example, for the metadata elements educational level or price, the user profile will contain information on whether this is something that is important to the user, interesting, dull, or of no interest. Users may find that the effort in specifying a detailed profile is not worth the benefit in more concise or relevant descriptions. We therefore also specify a number of stereotypes for typical categories of user (e.g., teacher, student, parent). The user may select an appropriate stereotype, and refine it by adjusting values where desired (Rich 1999). In order to create a basis for different user models, we presented different search scenarios (e.g. a teacher searching for classroom material, or a researcher wi shing to do an overview of web-based resources) to a small group of users and asked them which items of metadata they found most useful in each scenario. Our group of users was small but it included teachers, lecturers and university researchers who were able to give detailed comments on their needs in the different scenarios based on their own experience. These initial stereotypes can also be easily refined following actual use of the system. Individual users may adapt their profile based on these stereotypes, adjusting elements that don’t apply to them. These adjusted profiles give us the information to enable better stereotypes to be created representative of actual users. We also allow users to specify a preference for textual or tabular presentation of the information. Then, following a search, we identify which metadata elements to present, based on the user profile and search results, and present that data in the appropriate form. As our language generation system is still very much a prototype, textual descriptions are limited to a small set of metadata elements which the system has the knowledge to be able to express. Currently we make little use of the query when tailoring descriptions. However, we recognise that the terms (and potentially metadata fields) identified in the query should influence the description produced. Some existing systems (such as GEM) emphasise query terms in bold, or similar. We will ensure that query terms are visible and emphasised in a description, whether by linguistic or paralinguistic means.

Future Work : Describing Sets of Resources A search engine will typically return a large set of resources to be examined further by the user, possibly ordered according to assessed relevance to the query. We believe that the user will be able to more efficiently evaluate the potential relevance of each item if they can see a description covering a set of items, rather than having each resource description treated separately. If we work with sets of resources we can : • Compare and contrast resources, based on relevance to query and other properties (Resource 1 is similar, but is rated more highly..) • Provide more concise descriptions, making use of aggregation methods that apply over several resources (e.g., Resource 1 is published by XYZ. Resource 2 is published by XYZ. -> Resource 1 and Resource 2 are published by XYZ). We are currently extending and developing the aggregation methods used for single resources to cover sets of resources, and will consider issues of contrast and emphasis. In this work we find that XSLT is not powerful enough for the sorts of reasoning involved, but once we have established the core techniques we may go back to see whether they can be partially re-engineered using XSLT.

Conclusion We have explored a number of methods for creating descriptions which help users quickly assess the likely relevance of a resource. These descriptions may be tailored to include those elements that a particular user regards of most importance. Using current web technology (XSLT) allows tailored tabular presentations of metadata in XML format, but creating concise textual descriptions requires more complex reasoning. We have outlined our current approach for creating text descriptions, and identified limitations and future directions. Once we have completed this development, we will be able to evaluate the approach ; we will look at precision/recall figures for users using different systems (e.g., proportion of actually relevant documents identified by the user as relevant given a description), as well as measuring time taken to reach a judgement, and subjective preferences.

References Rich, E. (1999) Users are individuals : individualising user models, International Journal of Human Computer Studies 51, 323-338. Cawsey, A. (2000) Presenting tailored resource descriptions : Will XSLT do the job ? In Proceedings th of the 9 International World Wide Web Conference. Reiter, E., & Dale, R. (1995) Building applied natural language generation systems. Natural Language Engineering 1(1) , 57-87. Clarke, J. (1999) XSL Transformations (XSLT) Version 1.0 W3C Recommendation, URL: http://www.w3.org/TR/xslt, Nov 1999. Dempsey, L., and Heery, R. (1998) Metadata: a current view of practice and issues, Journal of Documentation, 54(2), 145-172. Sanderson, M. (1998) Accurate user directed summarization from existing tools, In Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM 98), pp 4551. Shaw, J. (1998) Clause Aggregation using Linguistic Knowledge, Proceedings of the 9th International Workshop on Natural Language Generation, pp 138-147. Ianella, R. (1998) An Idiots Guide to th Resource Description Framework, The New Review of Information Networking, 4. Lassila, O. & Swick, R.R. (1999) Resource Description Framework (RDF) : Model and Syntax Specification, http://www.w3.org/TR/REC-rdf-syntax Reape, M. & Mellish, C. (1999) Just what is Aggregation Anyway ? In Proceedings of the European Workshop on Natural Language Generation, Toulouse, France, May 1999.

Suggest Documents