for automatic generation of a resource's metadata based on pre-existing metadata of its related .... free text describing the concept of relating two resources with this relation, or is implied from the ..... incorporates the abilities for editing, storing, retrieving and automatically ... LOM_1484_12_1_v1_Final_Draft.pdf. Accessed ...
Identifying Inference Rules for Automatic Metadata Generation from Pre-existing Metadata of Related Resources Merkourios Margaritopoulos, Isabella Kotini, Athanasios Manitsaris and Ioannis Mavridis1
Abstract Manual indexing of learning resources according to metadata standards is a laborious task. The introduction of automatic metadata generation methods is a developing research field with diverse approaches, which appears as an option having the advantage of the economy of work for not having to manually create metadata. In this paper, a methodology for automatic generation of metadata which exploits relations between resources to be described is introduced and examples and empirical data on the application of the methodology to the LOM (Learning Object Metadata) standard are presented. The methodology comprises the execution of consecutive steps of actions aiming at identifying inference rules for automatic generation of a resource’s metadata based on pre-existing metadata of its related resources.
1
Introduction
Metadata is structured data which describes the characteristics of a resource and is usually defined as “data about data” ([7], [15]). ΙΕΕΕ in [16] filters this definition as “information about an object”. Metadata share many similar characteristics to the cataloguing that takes place in libraries, museums and archives. Moreover, metadata have the advantage to remain independent from the objects they describe and store information not present in the described object (such as usage rights, or third party annotations). In the educational domain these objects are called learning objects ([9], [15], [16]). As the population of learning objects is increasing exponentially, while particular learning needs are developing equally rapidly, possible lack of information or metadata about the objects is restricting the ability to locate, use and manage them. The adoption of standard structures for 1 Department of Applied Informatics, University of Macedonia, 156 Egnatia Street 54006, Thessaloniki, Greece, {mermar, ikotin, manits, mavridis}@uom.gr
155
156
Merkourios Margaritopoulos, Isabella Kotini
the interoperable description of learning objects has been a major step towards the definition of a common indexing background. LOM standard ([16]), ambitiously, defines almost 80 elements for the description and management of learning objects. Yet, not only the number, but, also, the diversity of the elements in this metadata specification has created implementation difficulties. Such a dense metadata set becomes by itself a source of trouble to potential indexers, since the use of the metadata set, in its entirety, is a complex and resource demanding task. In [17] this problem is called “metadata bottleneck”. Several studies on the use of LOM metadata have been carried out by researchers, providing interesting conclusions on the way they are used by indexers ([11], [20], [21]). A solution to the problem of manual indexing is the involvement of automatic metadata generation techniques ([12], [13]). Automatic metadata generation is based on exploiting several sources from which metadata values can emerge. These sources of information are known as document content analysis, document context analysis and composite documents structure ([3]). Among these three sources of information for automatic metadata generation, composite documents structure is a highly interesting approach with an efficient application to collections of objects. Objects, related to each other with some kind of relation, create together a whole and therefore, it is possible that several of their metadata elements are influenced by each other. As [8] notes, metadata generation through “related content” is a method of metadata propagation parallel to basic objectoriented modelling concepts like inheritance; hence the authors strongly encourage the research community to tackle this issue. This paper is organised as follows: Section 2 presents the motivation for the proposed methodology based on related work on the issue of automatic metadata generation by exploiting relations between resources. The focus is on educational resources. In Section 3, a solid methodology for the identification of inference rules that automatically generate metadata values of a resource based on preexisting metadata of its related resources is introduced. Section 4 provides some examples and empirical data for the practical application of the proposed methodology to the LOM metadata schema. Finally, in Section 5, a conclusion is drawn and plans for future work are presented.
2
Motivation from Related Work
Several research efforts dealing with the issue of automatic metadata generation through related resources with considerable interesting results have been undertaken ([1], [2], [6], [10], [14], [19]). The study of these efforts, which take advantage of already described related resources to produce new metadata descriptions of a resource, reveals that the set of inference rules used in these approaches is a core element, since it is the means for determining the metadata values. Each one of the above referenced research efforts incorporates a set of logic rules the application of which (usually by means of using a rule or inference
Identifying Inference Rules for Automatic Metadata
157
engine) produces implicit propositions for the values of metadata of related resources. Defining logic rules is an intellectual task, which has to take into account the semantics of the relations and the metadata. Inference rules proposed by the researchers, usually, coincide (there is a relative uniformity in defining the kinds of rules – inheritance, aggregation, etc.). However, sometimes, they diverge due to differences in the perceived semantics of the relations and the metadata elements. In this regard, the majority of the rules used by the researchers are considered to be heuristic rules, since they are not mathematical propositions applied globally, but solid results of experience. For example, in [20] the educational context of an object is considered to be the same with that of an object being part of the first one, while in [2] such an inference is not adopted. As a result of the above discussion, becomes apparent the need for a generic, common framework methodology to define a guided process for the identification of the complete set of inference rules suggesting the metadata of a resource based on the metadata of its related resources. The research work presented in this paper stems from the observation that the process of defining such rules must follow a well-formed theoretical construction based on the semantics of the relations connecting the resources. For this reason, a generic methodology for identifying inference rules, based on existing relations between resources is introduced. The methodology aims at enriching the existing relations by identifying new (implicit) ones, and, using existing metadata to compute the influenced metadata of related resources. The application of this methodology can be accomplished regardless of the metadata schema and the semantics of the relations used.
3
Generic Methodology for Identifying Inference Rules
Since the following presentation of the generic methodology focuses on indexing learning objects using LOM, LOM relations are put into consideration. The relations defined by LOM are directly adopted from Dublin Core metadata set ([4], [5]). The semantics of the relations of Dublin Core suggest a certain semantical perspective for the relations between general resources and documents, which mainly serves the administrative needs of librarians ([10], [22]). In addition, the way Dublin Core relations are defined cannot properly serve the needs of an educational environment where learning objects described with the LOM standard will be used. Thus, a slight modification to the semantics of these relations is a first necessary step before applying the proposed methodology. The approach presented in this article has been greatly influenced by the definition of the semantics for the six pairs of LOM relations proposed by [10]. The identification of inference rules to generate metadata values of a resource by exploiting a net of existing relations between the resources, is a process that comprises four consecutive steps each of which is elaborated in the following subsections.
158
Merkourios Margaritopoulos, Isabella Kotini
3.1 Step 1: Locating Connection Features In most cases, the semantics of a relation connecting two resources is defined in free text describing the concept of relating two resources with this relation, or is implied from the meaning of the verb or the noun used to specify the relation with no further explanations. Thus, the adoption of logic inferences regarding interconnected properties of the resources is a demanding intellectual task. In order to come up with such inferences, one has to locate the interrelated properties of the resources connected with a relation that specify this connection on the basis of similarities or differences. In the rest of this paper these properties are called “connection features”. Connection features may be stated explicitly in the definition of the semantics of a relation. However, in other cases, connection features may be implied. For example, the definition of the semantics of the relation “IsVersionOf” of Dublin Core ([4], [5]) clearly highlights the connection features “Format” and “Creator” (the related resources have the same format and the same creator), whereas, one can presume the connection feature “Topic area” (different versions of a resource belong to the same topic area). Apart from relations referring to semantic characteristics of the resources they connect, structural relations (part – whole relations) connecting the related resources are also included in the definition of connection features (“part” or “subset, “whole” or “superset” connection features). For the optical representation of a set of relations and the connection features of each, in order to extract logic inferences, the use of a 2-dimensional table for assisting the identification of the inference rules is proposed in [18]. The rows of the table consist of the set of relations under consideration, while its columns contain their connection features. The connection features depicted in the table are defined as common properties of the two related resources, as well as properties influenced by each other in a specific way. Thus, the interrelation of the connection features of a relation does not, necessarily, take the form of equality. It is possible that a relation defines a certain type of differentiation. For the relation “IsLessSpecificThan” in [10], the connection feature “Level of details” is defined, since the values of this property of the two connected resources with this relation are influenced by each other in a specific way (one lower than the other).
3.2 Step 2: Creating Inference Rules for New Implicit Relations It is possible that the relations the members of a set of resources are connected to each other with, can be enriched by applying certain inference rules to generate new implicit relations from the already existing ones. Such a process involves two substeps: • Firstly, a number of rules for the generation of new relations are created by exploiting the mathematical properties of the relations. Since the relations in
Identifying Inference Rules for Automatic Metadata
159
question are all binary relations (having two arguments), they are equipped, as the case may be, with common properties of binary relations (concepts of set theory in Mathematics), such as inverse, symmetric relations, transitivity, etc. For example, transitivity in the relation “IsPartOf” of LOM leads us to the rule: «If resource a “IsPartOf” resource b and resource b “IsPartOf” resource c, then resource a “IsPartOf” resource c». • Secondly, one takes advantage of the connection features defined in the previous step to create rules for getting new relations by means of “relation transfer”. Relation transfer states that if resource a is connected to resource b with relation σ, then it is also connected with the same relation to all other resources connected to b through a certain connection feature (the relation is transferred to them). For example (using LOM relations), the relation “IsReferencedBy” connecting resource a to resource b can be transferred to all resources sharing the same “Intellectual content” with b (such as all resources related to each other with the relation “IsFormatOf”). The resulting rule is: «If a “IsReferencedBy” b and b “IsFormatOf” c then a “IsReferencedBy” c». This sort of propositions can be depicted by adding the symbol “v” in the table produced in the previous step ([18]). Symbol “v” depicts the proposition «resource a is connected to resource b with relation σ, as well as any other resource connected to b through the connection feature appearing in the respective column of the table» and facilitates the formulation of the respective inference rule of relation transfer.
3.3 Step 3: Mapping Connection Features to Metadata Elements The connection features, thought as properties of resources, can be mapped to certain metadata elements of the schema used for describing the resources. The interrelation of the connection features of two resources (through the relation they are connected with) is translated into the interrelation of their respective metadata elements. Thus, considering the LOM metadata schema ([16]), the connection feature “Intellectual content” is mapped to metadata elements which are affected by the content of the resources (such as “1.4 General.Description” and “1.5 General.Keyword”). Consequently, resources connected to each other with a relation that uses this connection feature (such as “IsFormatOf” which prescribes that the connected resources have the same intellectual content), will have their respective metadata elements interrelated the same way as their connection features (i.e. equal). Furthermore, the connection feature “Whole” can be mapped to LOM elements expressing properties of learning objects which can influence the same properties of an object that contains them, like “1.3 General.Language”, “1.5 General.Keyword”, “5.9 Educational.Typical learning time”, “6. Rights.Cost”, etc. In this regard, the connection feature “Understanding” can be mapped to LOM metadata elements that can be influenced by the notion of
160
Merkourios Margaritopoulos, Isabella Kotini
understanding the learning objects by their users, which are “5.6 Educational.Context”, “5.7 Educational.Typical age range” and “5.11 Educational.Language”.
3.4 Step 4: Specifying the Influence Type of the Metadata Elements’ Values An inference rule generating the value of a metadata element exploits the relations that use the connection features corresponding to the element, as its conditions part. For each and every one of these relations a single rule is created. For example, for the metadata element “1.3 General.Language” two rules are created having as their conditions part the relations “HasPart” and “IsPartOf” (as the connection features “Whole” and “Part” were mapped to this metadata element – since it is not hard to conclude that the value of “1.3 General.Language” of a learning object can influence the value of this element of a learning object that either contains or is part of the first one). The actions part of every rule deals with specifying the exact type of influence for the metadata elements values. The value of a metadata element may be influenced in one of the following three types: • Inclusion of metadata values from related resources, according to which a resource’s metadata element values (with cardinality greater than 1) are added (included) to the values of the same metadata element of a related resource. A resource can include metadata values from its parts as a result of a whole relation (the inclusion relations of the resources are transferred to their metadata elements). • Computation of metadata value from metadata values of related resources. The metadata element value of a resource is the result of a mathematical or logic expression (which has to be specified) of metadata element values of related resources. • Restriction of the range of values of a metadata element, according to which the range of values of a metadata element of a resource is not the complete value space defined by the specification of the standard, but a proper subset of it computed from the values of the same metadata element of related resources (the exact expression has to be specified). The first two types of influence automatically generate metadata values, while the third one facilitates the task of manual indexing. Thus, the rules are formatted as: «If resource a is related to resource b with relation σ, then the value of the metadata element m of a is determined by the value of metadata element m of b according to one of (the above) three defined types of influence». Considering the above stated examples, in order to get the actions part of the rules, one has to specify the exact type of influence for the metadata element
Identifying Inference Rules for Automatic Metadata
161
values. Thus, if learning object a contains (“HasPart”) learning object b, the value of the metadata element “1.3 General.Language” of b will be included (“inclusion of metadata values” type of influence) to the values of the same metadata element of a. Following the same logic, if a learning object “IsPartOf” two or more learning objects, then the range of values of its language will be restricted (“restriction of the range of values” type of influence) to the intersection of the values of the languages of its two supersets.
4
Application Example
Applying the proposed methodology to LOM, more than 80 inference rules were created. A brief description of the reasoning followed to obtain a small sample of such rules, which affect the values of certain LOM metadata fields, follows: • Metadata field “1.5 General.Keyword”: The connection feature “Intellectual content” of the relation “IsFormatOf” can be mapped to this element (keywords are determined by the content of an object). The value of this metadata element of a learning object is directly connected to the value of this element of a related object which they share the same intellectual content with. The influence type of the element’s value is “computation of metadata value” (equality). In free text, the rule takes the form “learning objects that differ only in their format (they have the same content), will have the same keywords”. • Metadata field “1.8 General.Aggregation Level”: It maps to the connection feature “Whole” (“Superset”) of the relation “HasPart”, in the sense that the value of this metadata element of a learning object is affected by the value of the same element of objects that are parts of the object to be indexed. Ignoring any formal notation, the rule may be expressed as “the aggregation level of the superset will be greater by 1 than the maximum aggregation level of its parts” (since a collection of learning objects of some aggregation level constitutes a learning object with aggregation level increased by 1). The influence of the element’s value is done through “computation of metadata value”. • Metadata field “4.1 Technical.Format”: This element maps to the connection feature “Format” of the relation “IsVersionOf”. “Computation of metadata value” (equality) is the influence type of the element’s value. Equality of these metadata elements of the related objects is obvious, since “different versions of a learning object maintain the same technical format”. • Metadata field “5.4 Educational.Semantic Density”: It maps to the connection feature “Intellectual content” of the relation “IsFormatOf” (the semantic density of a learning object is a property of its content). The influence of the element’s value is done through “computation of metadata value” (equality). Thus, the rule, in free text, is formulated as “learning objects that differ only in their format (they have the same content) will have the same semantic density”.
162
Merkourios Margaritopoulos, Isabella Kotini
• Metadata field “5.7 Educational.Typical Age Range”: It maps to the connection feature “Understanding” of the relation “Requires”, in the sense that the value of this metadata element of the object to be indexed can be affected by the value of the same element of the objects required in order for the first one to be understood. The influence of the element’s value is of type “restriction of the range of values”. The rule, in simple words, is expressed as “if a learning object requires others, then the typical age range of its intended user will be greater than the maximum typical age range of the objects it requires”. • Metadata element “5.11 Educational.Language”: It maps to the connection feature “Understanding” of the relation “IsRequiredBy”, in the sense that the value of this metadata element of the object to be indexed can be affected by the value of the same element of the object that requires it in order to be understood. “Computation of metadata value” (equality) is the type of influence of the element’s value. Simply put, the rule takes the form “if a learning object is required by another one, then the human language used by the typical intended user of this object will be the same with the corresponding language of the object that requires it”. • Metadata element “6.1 Rights.Cost”: This element maps to the connection feature “Part” (“Subset”) of the relation “IsPartOf”, in the sense that the value of this metadata element of a learning object which is part of others can be affected by the value of this element of the objects containing it. The element’s value is influenced through “computation of metadata value” (equality on condition). Without any formal notation, the rule can be expressed as “if a learning object is part of other ones, then there is no cost in it, if at least one of the objects containing it has no cost”. In order to demonstrate the effectiveness of the created rules in the automatic generation of metadata of related learning objects, and, thus, prove the practical application of the proposed approach, an application scenario is introduced involving four learning objects (a, b, c and d) the LOM descriptions of the three of which (a, b and c) are already stored in a database, while the fourth one (d) is intended to be indexed. More specifically, learning objects are defined as: a is a Microsoft Word file containing a differential calculus exercise; b is a postscript file which contains the entire manual of the calculus theory; c is a web page containing a part of calculus theory regarding differential calculus; d is a web page containing the calculus theory taken from the contents of learning object b. The relations connecting objects a, b and c (as they are already registered in the database) are: “a IsReferencedBy b”, “a IsReferencedBy c”, “b References a”, “c References a”. In order to take advantage of the created inference rules, so as to automatically index learning object d, one has to provide the relations connecting d to the already indexed learning objects. Thus, he/she manually fills out the relations “d IsFormatOf b” and “d HasPart c”. Considering inverse relations and relation transfer, implicit relations automatically produced are: “b HasFormat d”, “c IsPartOf d”, “a IsReferencedBy d” (as a result of relation transfer – “a
Identifying Inference Rules for Automatic Metadata
163
IsReferencedBy b” and “b HasPart d”), “d References a”. As a result of relation “d IsFormatOf b”, applying specific rules of computation of metadata value (in this case, equality), metadata fields of d inherited from b are: “1.2 General.Title”, “1.4 General.Description”, “1.5 General.Keyword”, “1.6 General.Coverage”, “5.2 Educational.Learning Resource Type”, “5.5 Educational.Intended End User Role”, “5.6 Educational.Context” and all fields of category “9 Classification” (providing the purpose of classification is related to the content of the learning objects). As a result of relation “d HasPart c”, applying either specific rules of inclusion of metadata values, or specific rules of computation of metadata value, metadata fields of d, the values of which are partly influenced by the same metadata values of c, are: “1.3 General.Language”, “1.8 General.Aggregation.Level”, “2.2 Lifecycle.Status”, “2.3 Lifecycle.Contribute”, “4.1 Technical.Format”, “4.4 Technical.Requirement”, “5.1 Educational.Learning Resource Type”, “5.8 Educational.Difficulty”, “5.9 Educational.Typical Learning Time”, “5.11 Language”, “6.1 Rights.Cost”, “6.2 Rights.Copyright and Other Restrictions”, “8 Annotation”. The outcome of this process is that a considerable number of initially empty fields of a learning object, which has not been indexed, is automatically populated with values computed from the values of respective metadata fields of already indexed related learning objects.
5
Conclusion – Future work
In this paper, a step-by-step methodology to guide the process of identifying a set of inference rules for generating metadata of a resource, based on the metadata of its related resources was introduced. The presentation of the methodology is accompanied with examples for its practical application to LOM metadata schema demonstrating the effectiveness of the proposed approach in the automatic generation of metadata. The process is integrated in a theoretical construction, the foundation of which is the semantics of the relations connecting the resources to be indexed. At present, there is an on-going work to apply the entire set of the rules created, by means of using a software module which carries out the automatic generation of metadata and stores them in a LOM native XML database. The whole project is being developed as an Integrated Metadata Management System (IMMS) that incorporates the abilities for editing, storing, retrieving and automatically generating new metadata.
164
Merkourios Margaritopoulos, Isabella Kotini
References 1. Bourda Y, Doan B-L, Kekhia W (2002) A semi-automatic tool for the indexation of learning objects. In Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2002 (pp. 190-191). Chesapeake, VA: AACE. 2. Brase J, Painter M, Nejdl W (2003) Completion axioms for learning object metadata Towards a formal description of LOM. In 3rd international conference on advanced learning technologies (ICALT). Athens, Greece. 3. Cardinaels K, Meire M, Duval E (2005) Automating metadata generation: the simple indexing interface. In Proceedings of the 14th international conference on World Wide Web, Chiba, Japan. 4. Dublin Core Metadata Initiative (2002) DCMI Metadata Terms http://dublincore.org/documents/dcmi-terms/. Accessed 15 March 2007. 5. Dublin Core Metadata Initiative (2005) Using Dublin Core – The Elements http://dublincore.org/documents/usageguide/elements.shtml. Accessed 15 March 2007. 6. Doan B-L, Bourda Y (2005) Defining several ontologies to enhance the expressive power of queries. In volume 143 of CEUR, workshop Proceedings, on Interoperability of web-based Educational Systems, held in conjunction with WWW’05 conference, Chiba, Japan, Technical University of Aachen (RWTH). 7. Duval E (2001) Metadata Standards: What, Who & Why? Journal of Universal Computer Science, Vol. 7, no 7, 2001, pp. 591-601. 8. Duval E, Hodgins W (2004) Making metadata go away - hiding everything but the benefits. In DC-2004: Proceedings of the International Conference on Dublin Core and Metadata Applications, pp. 29–35. 9. E-Learning Consortium (2003) Making sense of learning specifications and standards: A decision’s maker’s guide to their adoption (2e). The Masie Centre. http://www.masie.com/ standards/s3-2nd-edition.pdf. Accessed 15 March 2007. 10. Engelhardt M et al (2006) Reasoning about eLearning Multimedia Objects. In J. Van Ossenbruggen, G. Stamou, R. Troncy, V. Tzouvaras (Ed.) Proc. of WWW 2006, Intern. Workshop on Semantic Web Annotations for Multimedia (SWAMM). 11.Friesen N (2004) International LOM Survey: Report (Draft). http://dlist.sir.arizona.edu/403/ 01/LOM_Survey_Report2.doc. Accessed 15 March 2007. 12. Greenberg J, Spurgin K, Crystal A (2005) AMeGA (Automatic Metadata Generation Applications) Project, University of North Carolina. 13. Greenberg J, Spurgin, K, Crystal A (2006) Functionalities for Automatic-Metadata Generation Applications: A Survey of Metadata Experts’ Opinions. International Journal of Metadata, Semantics and Ontologies. Vol. 1, No. 1, 2006. 14. Hatala M, Richards G (2003) Value-added metatagging: Ontology and rule based methods for smarter metadata. In RuleML, pp. 65–80. 15. Horton W, Horton K (2003) E-learning Tools and Technologies. Indianapolis: Wiley Publishing. 16. IEEE. 1484.12.1 (2002) Draft Standard for Learning Object Metadata. Learning Technology Standards Committee of the IEEE. http://ltsc.ieee.org/wg12/files/ LOM_1484_12_1_v1_Final_Draft.pdf. Accessed 15 March 2007. 17. Liddy E D et al (2002) Automatic metadata generation & evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research. 18.Margaritopoulos M, Manitsaris A, Mavridis I (2007) On the Identification of Inference Rules for Automatic Metadata Generation. In Proceedings of the 2nd International Conference on Metadata and Semantics Research (CD-ROM), Ionian Academy, Corfu, Greece. 19. Motelet O (2005) Relation-based heuristic diffusion framework for LOM generation. In Proceedings of 12th International Conference on Artificial Intelligence in Education AIED 2005 - Young Researcher Track, Amsterdam, Holland.
Identifying Inference Rules for Automatic Metadata
165
20.Najjar J, Ternier S, Duval E (2003) The Actual Use of Metadata in ARIADNE: an Empirical Analysis. In Proceedings of ARIADNE Conference 2003. 21.Sicilia M A et al (2005) Complete metadata records in learning object repositories: some evidence and requirements. International Journal of Learning Technology, 1(4), pp. 411-424. 22. Steinmetz R, Seeberg C (2003) Meta-Information for Multimedia eLearning. Computer science in perspective, Springer-Verlag New York, Inc.