Automatic Generation of Metadata for Learning Objects Paramjeet Singh Saini Dept of Information and Communication Technology University Of Trento, Italy
[email protected]
Marco Ronchetti Dept of Information and Communication Technology University Of Trento, Italy
[email protected]
Abstract Proper reuse of learning objects depends both on the amount and quality of attached semantic metadata such as “learning objective”', “related concept”, etc. Manually expressing such metadata is a time consuming and expensive task.. Here we present an approach based on a probabilistic model, which through the automatic classification of learning resources on a given taxonomic organization of the knowledge, allows to associate ontological metadata to the learning resources.
1. Introduction One of the most effective solutions to aid the information retrieval process in e-learning domain is to organize the learning resources into domain specific repositories, frequently referred to as Learning Object Repositories (LORs). Actually, these LORs are collection of learning resources or Learning Objects (LOs), which are generally accessible and searchable through the Web. Ariadne1, EdNA2, and Merlot3 are some examples of existing LORs. LORs are organized at a low-level of granularity, like Web pages, presentations, and more in general, multimedia objects. Learning Management Systems (LMSs), on the contrary, try to organize LOs at a higher level of granularity, i.e. into “courses”. The problem is that existing techniques for the aggregation and visualization of LORs resources assume that LOs are equipped with sufficient metadata (see e.g. [8]). 1
http://www.ariadne-eu.org/. http://www.edna.edu.au/edna/page1.html 3 http://www.merlot.org/.
2
Diego Sona SRA Division, ITC-IRST Trento, Italy
[email protected]
Actually, an efficient reuse and visualization of LOs is allowed by these techniques only if the metadata encoded in the meta-tags provides a sufficient description of LOs. A poor description negatively affects the possibility of reusing LOs in a LMS. Clearly, good and homogeneous description of LOs require a manual intervention by experts that associate these meta-tags to the resources. This operation could be extremely time consuming and expensive. For this reason it is quite hard to find LORs covering wide domains, and even more, these LORs are mostly still in emerging stages. Hence, there is a need for a system that helps processing LOs, automatically assigning a value to the meta-tags (such as “learning objectives”, “related topics”, “pedagogical purpose”, “time required to learn”, “main concept”, etc.) in order to relieve the human intervention. In this paper we present an approach that allows to automatically associate to LOs most of the abovementioned metadata. The exhaustive description of the learning domain is the first ingredient for such a model. More precisely, to express such descriptions we need to agree on a common vocabulary that specifies what we mean. Here is where the notion of ontology comes in. An ontology abstracts the essence of the concepts, and allows to distinguish various kinds of items and to define the relationships among them. Then, equipped with this high level representation of the learning domain, the system can map LOs to the suitable concepts in the ontology, and consequently it can associate a number of meta-tags to the LOs. For the current implementation of the system we decided to represent the domain knowledge with (concept) taxonomies, which are a simplified version of ontology, where only hierarchical relationships are used. Taxonomies are trees, where each node identifies a concept and the edges connecting the nodes describe the relationships between the concepts. Moreover, in
the current version, LOs are just textual documents. Hence, the task here is to automatically classify documents into a given taxonomy. Document classification has been addressed many times within the Information Retrieval and Machine Learning communities (see e.g. [13, 12, 5, 2]). All the proposed models however are based on supervised training strategies, where classifiers are trained with a set of labeled training data. This means that a manual intervention is required to acquire class labels for a proper set of training data. To avoid this hand-labeling, the idea is to classify LOs on the given taxonomy without any labeled example. This means that only unlabeled objects are available. We can however use the prior knowledge about the ontology, which is provided in terms of both the keywords (and in case the description as well) associated to each concept, and the topology of the classes. We refer this task to as bootstrapping [7]. This task, which can be seen as unsupervised, is hard and any used classification scheme cannot give results as good as in a supervised training framework. In any case, our aim is to provide evidence that this approach can be helpful to relieve the expert from part of the work while assigning meta-tags to LOs. To solve this classification problem we used a probabilistic clustering approach. In particular, we used the Expectation Maximization (EM) algorithm with a proper initialization of parameters, taking advantage of the prior knowledge available in the taxonomy [14]. We compared the results against a much simpler Naive Bayes (NB) classifier tailored for the problem at hand. The experiments were made on a LOR focused on the computer science domain that we created. In sections 2 and 3 are described the taxonomic representation of an e-learning domain and the task respectively. Section 4 introduces the classification model, and section 5 provides a description of the experimental settings with some results. Finally, sections 6 and 7 discuss a related work and draw some conclusions and future works.
2. Taxonomic Description of a Learning domain As previously outlined, we adopt the notion of taxonomy adjoined with a uniform set of metadata used to generate the descriptions to associate to LOs. Clearly, different taxonomies are determined for different learning domains. In particular, the creation of a taxonomy should be based on two difficult to achieve principles. First, the taxonomy should cover the target domain in an exhaustive way. Second, the taxonomic
description of the domain should be accepted by a broad community. To satisfy these principles, our approach is to extract the taxonomies from the proposals of “ knowledge organizations” that are produced by standard associations and large communities. Description Area
Discrete Structure Learning Objective
Unit
Basic Logic
`
Time Required
Topic
Truth Table
Proposition Logic
Predicate Logic
Figure 1. A taxonomic description For example, to create a taxonomy for computer science knowledge we took the ACM Computing Curricula 2001 for Computer Science (CCCS) [3], which defines a suite of courses. To create the taxonomy we extracted the “ body of knowledge” from CCCS and arranged it in a form that allows some elementary reasoning (see [11]). In particular, the taxonomy has been built identifying three layers of knowledge representation: topics, units and areas (see Figure 1). Topics are the finest-grain elements (the more specific concepts), which are then collected into units. Units are finally organized into areas. An important point to observe in Figure 1 is that the body of knowledge associated to the taxonomy is now quite detailed. All nodes have some keywords describing their meaning; area nodes are associated with short descriptions, and unit nodes are associated with expected learning time and learning objective.
3. The Task Our goal is to map or classify a given set of LOs to the most appropriate topic in a given taxonomy (i.e., into a leaf node of the tree). This classification is then used to automatically associate some metadata to the LOs themselves. Specifically, the system is presented with a taxonomy and a collection of LOs. Each node in the taxonomy is labeled with few lexical terms or keywords, describing the concept itself. The system performs a classification of the LOs exploiting the prior knowledge encoded in the taxonomy, and then associates the proper metadata to the LOs. For example, suppose that a LO is classified as “ proposition logic” in Fig. 1: the metadata that can be associated to the LO itself would be as shown in Table 1.
probability P(c j ) . Even if the prior knowledge is quite Table 1. Associated metadata.
poor we can exploit it in order to initialize the NB model. In particular, we can exploit the keywords in the descriptions of nodes to initialize the class-conditioned word probabilities as follows:
Main Discipline Discrete Structure Sub Discipline Basic Logic Main Concept Proposition Logic Expected Time 10 hours ( for unit ) Learning Objectives Overview of the main Discipline Related Topics Truth Table, Predicate Logic
4. The Classification Process The Naive Bayes (NB) classifier (see [16, 4]) is a simple and frequently adopted probabilistic model, used to classify objects to a predefined category. Usually the parameters of the classifier are learnt from the set of labeled examples, and then they are used to classify new objects. In particular, the classification of a new document d i is performed computing for all the classes
c j ∈C
the
document-conditioned
class
probability as follows: P (c j | d i ) ≈ P (c j )
∏ P( wk | c j ),
(1)
wk ∈d i
where P( wk | c j ) is the class-conditioned probability of word wk , and P(c j ) is the class prior probability. These two probabilities can then been computed as follows: ∑dt ∈D N (wk , d t ) P(c j | d t ) (2) P ( wk | c j ) = ∑w ∈V ∑d ∈D N ( wk , d t ) P(c j | d t ) s
and P (c j ) =
t
∑d ∈D P(c j | d t ) t
(3) |D| where N ( wk , dt ) is the frequency of the word wk in the document d t , V is the vocabulary, D is the set of all documents, and | D | is the total number of documents in the corpora. Clearly, these probabilities are estimated using the statistics of the labeled documents corpora D . Notice that, for a labeled document d t ∈ D , the posterior probability P (c j | d t ) is either 1 or 0 according to whether d t belongs or not the class c j . Hence Equation (2) just computes the normalized frequency of the terms in the given class. In our case, however, there are no such labeled examples. Hence, we need to determine a good initialization for both the class-conditioned word probabilities P ( wk | c j ) and the class prior
0.9 if word wk labels node c j (4) P ( wk | c j ) = 0.1 otherwise Moreover, a good way to estimate the class prior probabilities not having any knowledge on the real distributions is to assume a uniform distribution of data along all classes as follows: 1 (5) P (c j ) = |C | where | C | is the total number of classes. This approach, however, provides minimal benefits in the classification process for two reasons. First, the classification can result in high rejection as the classifier only uses the node labels, and it can be difficult to disambiguate between different classes for many documents. Second, the accuracy is low because the small number of keywords at the nodes gives insufficient information leading to high error-rate. The described NB approach only uses part of the available knowledge. There is still a corpus of unlabeled documents that can be used to further improve the amount of knowledge involved in the process. A way to also exploit this knowledge is to adopt an Expectation Maximization (EM) approach, a well known probabilistic clustering algorithm. The main drawback of such kind of algorithms is that after clustering the data, it is quite hard or even impossible to understand how to link the obtained clusters with the desired classes. This means that under these conditions we cannot train a classifier for our taxonomies using EM. On the other side, starting from a set of unlabeled examples EM organize the data by similarity, hence, with a good initialization of the EM parameters we can hope to obtain a good clustering of data. With our experiments, we observed that using Equations (4) and (5) as starting seeds for EM, we maintain a linkage between the classes and the clusters during learning. Hence, once the model is learnt we can use the estimated class-conditioned word probabilities P( wk | c j ) to perform classification of the documents
using Equation (1).
1. Initialize classifier parameters using Eq. (4) & (5); 2. E-step: Classify the documents using Eq. (1); 3. M-step: Compute new parameters using Eq. (2) & (3) 4. Iterate E-step and M-step till the model converges; 5. Output: A Bayesian classifier using Eq. (1).
Figure 2. Outline of the algorithm. EM algorithm uses the classification results on the entire data set at any iteration to re-estimate the classifier parameters for classification at the following iteration (see Figure 2).
5. Experimental Evaluation To evaluate the model in a real world scenario, we selected two sub-taxonomies from the ACM Computing Curricula: Intelligent Systems (IS) and Net Centric Computation (NCC). Then, we manually created the LORs collecting the resources from the Web and classifying them into the two given taxonomies. Actually, LOs were collected making some simple queries in Google search engine. These queries were constructed using the descriptions associated to the classes in the taxonomies. The collected resources were then filtered to neglect unfitting examples (like research papers, books, etc.). Observe that a third LOR was created simply joining the two collected LORs. This was done in order to see whether the model scales on the dimension of the taxonomies. In Table 2 are shown some data statistics for the created LORs. Starting from taxonomies and data, a set of terms for each taxonomy were selected as vocabulary. In particular the vocabulary was determined separately for each of the three taxonomies. Finally, observe that in the average there are little more than two unique keywords describing each class. Remember that these are the keywords used to initialize NB and EM. The evaluation of the proposed model was done using the Micro-F1measure [15]. This is a standard information retrieval measure that combines precision (i.e, ratio between the number of correctly classified documents and the number of classified documents) and recall (i.e., the ratio between the number of correctly classified documents and the number of documents that should be classified).
5.1. Results In Table 2 are also shown the classification results of the two proposed model for the three LORs. These
results reveal that, even not having labeled examples to use as supervised training set, the models have a fairly good capacity to correctly classify the LOs. Table 2. Data statistics and model evaluation. LORs
Docs/ Vocabulary KeyClasses Words
IS 236/21 NCC 176/20 IS+NCC 412 /41
600 550 600
41 50 91
NB
EM
52.72% 69.49% 31.81% 51.70% 43.41% 60.67%
In all cases, NB initialized only using the node labels is a fairly good baseline classifier. If this initialization is followed by the unsupervised training algorithm EM that also uses the unlabeled examples, the classifier becomes much better. The surprise here is that, contrary to a common conception, clustering algorithms can maintain adherence of the resulting clusters with respect to their initial meaning (classes) if properly initialized. The reason for EM outperforming NB is that it finds more robust parameters using both the nodes description (keywords) and the content of the documents. This results in both a reduced standard deviation on the model parameters and a reduced sensitivity to keywords. Another interesting result is that the quality of the classifiers for the third LOR constructed joining IS and NCC does not degrade. This shows that the model can scale up on the number of classes.
6. Related Work In the e-learning domain very little research has been done in the field of automatic metadata generation. Some existing approaches that deal with automatic metadata generation are focused on adding more metadata to LOs that are already equipped with some amount of meta-knowledge (see [10]). Our approach, on the contrary, deals with raw LOs that do not have any sort of meta-knowledge, hence the type of metadata generated by our approach is a first step to make such raw LOs usable in reality. To avoid manual intervention in manufacturing training samples, many models were proposed by the information retrieval community. The approach presented in [2] is quite different in methodological aspect, but similar in objective, i.e., to overcome the burden of preliminary labeling. Their main idea is to automatically train the classifier through a Web corpus, collected by making queries, which still requires some
degree of manual intervention. Moreover, the model proposed in [1] deals with first clustering documents into categories, then performing feature selection. Users are then asked to choose some representative words that are used for performing automatic classification. The model reduces the manual intervention up to some extent. In [9] the estimates are improved by differentiation of words in the hierarchy according to their level of specificity. In this way, inner nodes play an important role in improving estimates. The approach needs small number of training examples
7. Conclusion and Future Work In this paper we presented an approach for the automatic generation of metadata for LOs. Main ingredients of our model are the ontology and the classifier. The classifier automatically classifies the LOs according to the concepts in the given ontology. Classified LOs are then automatically attached with ontological metadata. The proposed approach is easy to implement and reduces the amount of manual intervention while integrating the LOs with metadata. Notice that there is a further kind of prior information encoded in a taxonomy, which is the topology of classes, i.e., the relationships. These relationships could be exploited to improve the robustness of the classifier parameters [6]. We plan to investigate such models to improve the accuracy. Moreover, since automatic classification is not perfect the process for creating a LOR needs validation. In any case, checking classification is a less labor-intensive process than performing classification. Since the given models allow a situation where a number of labeled examples enter the training process together with the unlabeled LOs, this allows to improve the quality of the models using the feedback of the expert.
10. References [1] B. Liu, X. Li, W. S. Lee, and P. Yu, “ Text Classification by Labeling Words” , Proceedings of The Nineteenth National Conference on Artificial Intelligence, 2004. [2] C. C. Huang, S. L Chuang, and L. F. Chien, “ Liveclassifier: creating hierarchical text classifiers through web corpora” , Proceedings of the 13th international conference on World Wide Web, pages 184-192, 2004. [3] Computing curricula 2001 computer science, The Joint Task Force on Computing Curricula: IEEE Computer Society- Association for computing machinery, 2001.
[4] D. D. Lewis, “ Naive bayes at forty: The independence assumption in information retrieval” , 10th European Conference on Machine Learning, 1998. [5] D. Koller and M. Sahami, “ Hierarchically classifying documents using very few words” , ICML 1997, Proc of the 14th Int. Conf. on Machine Learning, pages 170-178, 1997. [6] D. Sona, S. Veeramachaneni, P. Avesani, and N. Polettlini, “ Clustering with propogation for hierarchical document classification” , ECML Workshop on Statistical Approaches to Web Mining (SWAM), pages 50-61, 2004. [7] G. Adami, P. Avesani, and D. Sona. “ Clustering documents into a web directory for bootstrapping a supervised classification” , Journal of Data & knowledge Engineering – Elsevier, 54:301-325, 2005. [8] J. Klerkx, M. Meire, S. Ternier, K. Verbert, and E. Duval, “ Information visualization: Towards an extensible framework for accessing learning object repositories” , EDMEDIA, pages 4281--4288, 2005. [9] K. Toutanova, F. Chen, K. Popat, T. Hofmann, “ Text classification in a hierarchical mixture model for small training sets” , Proceedings of the tenth international conference on Information and knowledge management, pages 105 – 113, 2001. [10] K. Cardinaels, M. Meire, and E. Duval, “ Automating Metadata Generation: the Simple Indexing Interface” , WWW-2005 International World Wide Web Conference Committee, 2005. [11] M. Ronchetti and P. Saini, “ Ontology-based metadata for e-learning in the computer science domain” , IADIS eSociety Conference, 2003. [12] M. Ruiz and P. Srinivasan, “ Hierarchical text categorization using neural networks” , Information Retrieval, pages 87-118, 2002. [13] M. Ceci M and D. Malerba, “ Web-pages classification into a hierarchy of categories” , 25th European Conf. on Information Retrieval, 2003. [14] P. Saini, D. Sona, S. Veermachaneni, and M. Ronchetti, “ Making e-learning better through machine learning” , International Conference on "Methods and Technologies for Learning, 2004 [15] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, 1999. [16] T. Mitchell, Machine Learning, McGraw Hill, 1997.