Exploring Characterizations of Learning Object Repositories Using Data Mining Techniques Alejandra Segura1, Christian Vidal1, Victor Menendez2, Alfredo Zapata2, and Manuel Prieto3 1
Univ. del Bio-Bio, Avda. Collao 1202, Concepción, Chile Univ. Autónoma de Yucatán.Periférico Norte, 13615, 97110 Mérida, Yucatán, México 3 Univ. de Castilla-La Mancha. ESI. Po. de la Universidad 4, 13071 Ciudad Real, Spain
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
2
Abstract. Learning object repositories provide a platform for the sharing of Web-based educational resources. As these repositories evolve independently, it is difficult for users to have a clear picture of the kind of contents they give access to. Metadata can be used to automatically extract a characterization of these resources by using machine learning techniques. This paper presents an exploratory study carried out in the contents of four public repositories that uses clustering and association rule mining algorithms to extract characterizations of repository contents. The results of the analysis include potential relationships between different attributes of learning objects that may be useful to gain an understanding of the kind of resources available and eventually develop search mechanisms that consider repository descriptions as a criteria in federated search. Keywords: Learning objects, metadata, data mining, learning object repositories, association rules, clustering.
1 Introduction Learning object repositories (LORs) provide a platform for the open sharing of learning resources. In the case of repositories storing only metadata about resources available elsewhere in the Web, they and act as filters of the resources available by providing metadata-based search on learning objects (ADL, 2002). Further, some of these repositories nowadays implement flexible mechanisms for the search and selection of resources based on metadata (Broisin et al., 2005, McLean and Lynch, 2003). However, uses approaching these repositories don’t have an a priori clear view of the kind of resources stored and to what extent they fit their interests or preferences. This gap might be filled by extracting characterizations of learning objects obtained from the analysis of their metadata. Such characterizations could also be used to enhance search or serve as descriptions of the content bases of repositories, useful both for humans and for software applications. Machine learning techniques can be used to automatically extract characterizations of learning objects collections. Indeed, the application of data mining (DM) techniques to the domain of e-learning have become more frequent in recent years (Romero and F. Sartori, M.Á. Sicilia, and N. Manouselis (Eds.): MTSR 2009, CCIS 46, pp. 215–225, 2009. © Springer-Verlag Berlin Heidelberg 2009
216
A. Segura et al.
Ventura, 2007, Romero et al., 2008). Data mining in e-learning is mainly oriented to analyze student’s behavior, outcomes and interests in their interaction with learning technology and learning resources. Here we focus on analyzing the metadata stored inside learning object repositories (Duval et al., 2001). For this study we used the metadata base of the AGORA recommender system (Prieto, 2008) and other already existing LORs. Clustering and association rule mining were selected as the two main techniques to approach a first exploration of the characterization of LORs. Concretely, the study reported here was aimed at providing some preliminary insight on the following questions: (a) which are common characteristics of LO metadata stored in different repositories?, (b) Are there any LO groups which have similar characteristics in their metadata?, (c) from the instructional point of view, which relations between the LO metadata are the most significant? The structure of the paper is described as follows. Section 2 details methodological issues and describes the technical issues related to data gathering. Then, in the section 3, usefulness of the results for different areas of the LO management and design, especially for the AGORA system, are discussed. Finally, the main conclusions of our research work as well as the future work are highlighted.
2 Methodology The process of knowledge discovery essentially consists in pre-processing, application of data mining techniques and post-processing stages. Here the methods described in (Romero et al., 2008) were use as the guiding framework. 2.1 Criteria for LOR Selection A significant number of LORs have been developed in the last years (UNESCO, 2009) and there are some studies that analyze their characteristics (Neven and Duval, 2002). The following criteria that affect the homogeneity of metadata and query interfaces were used to define the scope of our study: 1. Repositories providing query interfaces based on SQI. 2. Repositories conforming to the IEEE-LOM standard (IEEE-LOM, 2002). As a consequence, our testing group was reduced to the LORs listed in (ARIADNE, 2009) from which three were selected: ARIADNE (Duval et al., 2001), MACE (Stefaner et al., 2007) and LACLO-FLOR (LACLO, 2009). Their metadata was extracted and contrasted also with metadata in the AGORA repository (Prieto, 2008). The previous learning object repositories use the IEEE-LOM standard (IEEE-LOM, 2002) as a standardized metadata format. IEEE-LOM standard defines 80 fields that are organized in 9 hierarchical categories (Al-Khalifa and Davis, 2006). The study focuses primarily in the educational metadata category. A current trend in the repositories is to publish services for access to their resources. This option promotes interoperability between different applications, LMS systems and other LORs. The SQI (Simon et al., 2005) protocol is a proposed specification for that purpose. The SQI Simple Query Interface (CEN/ISSS, 2005, Ternier et al., 2008) is defined as a set of methods related to a universal interoperability layer for educational networks.
Exploring Characterizations of LORs Using Data Mining Techniques
217
2.2 Data Collection The open source tool, SQITester, was used for gathering the required metadata. This allowed us to consume web services provided by the repositories according to SQI (Ortiz Baíllo et al., 2008). The SQITester tool provides access to some of the most important LO repositories, and it allows querying according to a set of specifications for each repository (for example, query language, session ID, result format, etc.) and retrieve LO’s metadata that match these queries. The results are presented in a XML compatible format. Queries in ARIADNE, LACLO and MACE repositories were done covering four areas: computing, mathematics, literature and biology, both in Spanish and English language. A known problem in LORs is that most of them store incomplete metadata records (Sicilia et al., 2005). To obtain the greatest number and diversity of LO in terms of their pedagogical metadata, specific queries were held in the repository MACE. MACE is the only repository that can query specific metadata fields with language PLQL 1 (Ternier et al., 2008). 2.3 Preprocessing The preprocessing consisted in following the activities: a) Error removal. Errors resulting of the interchange between formats were manually removed. b) Metadata field standardization provided by each repository. c) Value field standardization according to IEEE-LOM vocabularies. d) Elimination of duplicates. e) Transformation from a comma-delimited format to the ARFF format (Attribute-Relation File Format) that is directly used by the WEKA data mining tool (Witten and Frank, 2005). Five data sets were obtained as result of preprocessing1. One for each studied repository and also an additional group that integrated all of them. 763 registers were processed. This set are composed of 200 registers from AGORA, 246 from MACE, 179 from ARIADNE and finally 138 registers from the LACLO repository. 2.4 Application of Data Mining Techniques Clustering techniques and Predictive Association were applied to the just described data set. Concretely, the K-means clustering algorithm (MacQueen, 1967) was applied to each repository separately and then to the data coming from all the repositories. The Apriori Association Algorithm (Agrawal et al., 1993) was applied to each repository separately. Finally the same algorithm was used also with the merged data set. Since the study focuses on metadata relationships from an educational point of view, processing was done primarily with data from category 5 (Educational) in addition to IEEE-LOM elements 1.7 (structure), 1.8 (aggregation level), 4.1 (format) and the repository to which they belong.
1
The data set used in this study is available at http://www.kaambal.com/agora/
218
A. Segura et al.
Fig. 1. Data distribution for each analyzed repository versus attributes. The letter of each graph corresponding to attributes described in table 1.
When performing data analysis, it was observed that not all the attributes were filled, with AGORA and MACE being the more complete. Figure 1 shows the data distribution for each repository compared with each attribute used in the processing. The detail of each attribute is shown in Table 1. Table 1. LOM-Metadata elements used
a)
Attribute aggregation_level
b)
structure
c)
cat_format
d)
difficulty
e)
interactivity_ level
f)
interactivity_ type
g)
cat_learn_res_type
h)
semantic_ density
Description The functional granularity of this learning object. Underlying organizational structure of this learning object. Learning Objects formats shortcuts. Technical datatype(s) of (all the components of) this learning object. How hard it is to work with or through this learning object for the typical intended target audience. Interactivity in this context refers to the degree to which the learner can influence the aspect or behavior of the learning object. Predominant mode of learning supported by this learning object. Learning Objects types shortcuts. Specific kind of learning object. The most dominant kind shall be first. The degree of conciseness of a learning object.
Values one, two, three, four atomic, linear, hierarchical, networked html, img, pdf, rtf, ps, zip, doc, java, swf, xml, xls, mpeg, txt, eps, and their combinations very easy, easy, medium, difficult, very difficult very low, low, medium, high, very high expositive, active exe, sld, lec, fig, narr, tab, self, rea and their combinations very low, low, medium, high, very high
Exploring Characterizations of LORs Using Data Mining Techniques
219
2.4.1 Clustering The Simple K-means Algorithm for clustering was applied to data set coming from all the repositories. Four clusters were used and the results are shown in Table 2. The evaluation of obtained clusters using the attribute repository showed a 37.0904% error, the best result over other testing algorithms. Table 2. Clustering Results for MACE, AGORA, ARIADNE and LACLO
Attribute aggregation_level structure cat_format difficulty interactivity_level interactivity_type cat_learn_res_type semantic_density
Cluster# + Clustered Instances Full Data 0 1 763-100% 227- 30% 368 48% one one one atomic atomic atomic pdf ppt pdf medium medium medium very_low very_low very_low expositive expositive expositive narr lec narr medium medium medium
2 97-13% one atomic flash medium high active exe high
3 71 -9% one atomic html medium very_low expositive narr medium
Correctly classified instances are related to the repository as follows: Cluster 0 = LACLO, Cluster 1 = MACE, Cluster 2 = AGORA y Cluster 3 = ARIADNE. For each cluster, the grouping and incorrectly classified instances are represented by squares in Figure 2.
Fig. 2. Visualization of the grouping (cross) and incorrectly classified (square) instances
Due to the higher completeness of metadata records in the repository AGORA, it was important consider to present results of applying clustering. By applying Simple K-means Algorithm 3 clusters were obtained as shown in Table 3.
220
A. Segura et al. Table 3. Results of clustering in AGORA
Attribute
Full Data 200 -100%
aggregation_level structure cat_format cat_context difficulty interactivity_level interactivity_type cat_learn_res_type semantic_density
one atomic flash high medium very_low expositive sld medium
Cluster# - Clustered Instances 0 1 99 -50% 58- 29% one atomic flash high medium high active exe high
two atomic ppt high medium low expositive sld medium
2 43- 22% one atomic pdf high easy very_low expositive rea medium
Table 4. Examples rules for Repository Repository Nº
rel
Antecedent
Consequence
difficulty = medium cat_learn_res_type=exe_sim_que_dia_fig_gra_ind_s interactivity_type ld_tab_narr expositive interactivity_type cat_format = html expositive cat_format cat_learn_res_type = fig jpg_gif_pjpeg_bmp cat_format = ppt ; cat_learn_res_type = sld ; interactivity_type semantic_density = medium expositive cat_format = ppt ; difficulty = medium ; interactivity_type cat_learn_res_type = sld expositive interactivity_type cat_format = ppt ; cat_learn_res_type = sld expositive interactivity_type cat_format = html ; cat_learn_res_type = narr expositive interactivity_type cat_format = html_img ; cat_learn_res_type = narr expositive interactivity_type cat_format = html ; difficulty = easy expositive
MACE
24
0,937
ARIADNE
5
0,959
LACLO
4
0,982
AGORA
12
0,993
AGORA
19
0,993
AGORA
70
0,986
MACE
27
0,937
MACE
76
0,868
ARIADNE
2.
0,988
ARIADNE
33
0,322 cat_format = html ; interactivity_type = expositive
difficulty = easy
ARIADNE
16
0,797 difficulty = easy ; interactivity_type = expositive
cat_format = html
MACE
17
0,953 cat_learn_res_type = que
interactivity_type = active
MACE
77
AGORA
64
AGORA
32
; = = = = = = = = =
cat_format = html_img ; cat_learn_res_type = interactivity_type = active exe_dia structure = atomic ; interactivity_type = 0,988 cat_format = pdf ; interactivity_level = low expositive cat_format = ppt ; interactivity_level = low ; 0,993 cat_learn_res_type = sld interactivity_type = expositive 0,868
Exploring Characterizations of LORs Using Data Mining Techniques
221
Clusters can be described in terms of object grouped as follows: • Cluster 0: Objects more active and highly interactive for the learner. These are mainly resources of type exercise. They have a high semantic density and high complexity level also. • Cluster 1: Objects with low interactivity level and expositive. These resources are mainly slides with a medium level of both complexity and semantic density. • Cluster 2: Expositive objects with very low interactivity. They are mainly resources of type “reading” that are easy to use with medium semantic density. 2.4.2 Associations The Predictive A Priori Algorithm was applied separately to each repository and then with the full data set also. This generated a set of 100 rules, but the analysis that follows was restricted to those with a reliability (rel) greater than 90%. The rules generated for each repository were analyzed considering common rules, redundant rules, rules that reinforce the existing knowledge of the relations among metadata, unexpected rules, interesting rules, and finally questionable rules (from an educational point of view). Table 4 provides some examples of rules extracted. The repository name, identification number and reliability for each rule, are included. Some examples of analyzing these rules are provided as follows: • The utility of rule 24 (MACE) is difficult to evaluate since an obvious relation does not exist between resource type with its interactivity and difficulty. Learning resources were grouped by the learning type; this group is wide, from exercises up to tables or narratives. • Rule 5 (ARIADNE) is an example of simple rule (with a single antecedent and consequent) with a high level of reliability but it is not useful, since it appears with major frequency in repositories with many empty fields. • Rule 4 (LACLO) appears to be a useful rule, but it shows an obvious or known relation between metadata values. • There are redundant rules and they can be reduced. Rule 70 is a generalization of rules 12 and 19 (AGORA), with some information loss. In the analysis of predictive attributes it was confirmed that interactivity type is predicted by attributes “resource types” (cat_learn_res_type) and “format” (cat_format). • Rule 27 is a generalization example of rule 76 in MACE. • Rules 2, 33 and 16 from ARIADNE are similar. They have equal antecendent and consequent and they can be interchangeable also. • Finally rules 17 and 77 (MACE) seem useful. They confirm some evidence on metadata relations. For example, the fact resource type that exercise or questionnaire has interactivity type active. Other examples are rules obtained from AGORA repository. Rule 64 shows that resources with expositive interactivity type and with atomic structure, are mainly related to PDF format and low interactivity level. Rule 32 shows that a slide resource type is related to PPT format, low interactivity level and expositive interactivity type. 2.4.3 Applying Association in Integrated Repositories The association rules obtained from the analysis of all repositories were more interesting as they have increased explanatory scope (see Table 5).
222
A. Segura et al. Table 5. Examples rules in integrated repositories
Nº
rel
51.
0.972
73.
0.962
42.
0.981
11.
0.990
14.
0.989
33.
0.985
Antecedent
Consequence
structure = atomic ; cat_format = flash ; interactivaggregation_level = uno ity_type = active ; cat_learn_res_type = exe structure = atomic ; difficulty = easy ; interactivaggregation_level = uno ity_level = medium ; cat_learn_res_type = exe aggregation_level = dos ; structure = linear ; cat_learn_res_type = sld cat_format = flash cat_format = ppt ; cat_learn_res_type = sld ; semaninteractivity_type = expositive tic_density = medium cat_format = ppt ; difficulty = medium ; interactivity_type = expositive cat_learn_res_type = sld cat_format = flash ; interactivity_level = high ; semantic_density = high cat_learn_res_type = sld
• Most of the obtained rules were validated using the selection of predictive attributes. For example, the attributes “structure” and “learning resource type” are predictive attributes of “aggregation level” (See rules 42, 51 and 73). These rules are consistent with the principles raised in LOM-ES (LOM-ES, 2008) that establish that: it must exist a relation between resource type and aggregation level. • In turn, category “format” and category “learning resource type” are predictive attributes of “interactivity type” (See rules 11 and 14). • The attributes “format”, “interactivity level” and “learning resource type” are predictive attributes of “semantic density”. Rule 33 shows this relation. One interesting aspect of this study is to analyze if the attribute “repository” is a relevant element to possible classification. Association Rules with high reliability level, relate attributes as “structure”, “format” and “learning resource type”. Examples of these are the rules 68, 2, 1 and 84 in Table 6. Table 6. Rules example with attribute repository Nº
Reliability
Antecedent
Consequence
68.
0.985
structure = atomic ; difficulty = medium ; semantic_density = repository = AGORA very_high
2. 1.
0.994 0.994
cat_learn_res_type = lec cat_learn_res_type = narr
84.
0.980
cat_format = html_img ; interactivity_level = medium
repository = LACLO repository = MACE difficulty = medium ; repository = MACE
It is important to mention that about rule 70 have the attribute “repository” in the consequent. More than half of these rules are related to AGORA repository. This might be attributed to higher completeness of metadata in this repository compared with metadata from LACLO and ARIADNE.
Exploring Characterizations of LORs Using Data Mining Techniques
223
3 Potential Applications The results of studies like the one presented here can be applied for several purposes in educational technology, including the following: • Learning object search based on instructional criteria. The results of applying clustering to the AGORA repository produced groups of object based in some characteristics of the learning resource type. Instructional design methods (Reigeluth, 1999) requiring some given type can be matched with these groups. It is also possible to build classifiers with the results of clustering, so that new learning objects can be classified automatically in the relevant groups identified. • Meta-search strategies in learning object repositories. Meta-search typically broadcasts a query to several repositories, without considering which of them is more appropriate for the user. Characterizing repositories based mined models ca ne used to direct searches to the “more relevant” repositories for each user (or alternatively, the results of some repositories can be assigned an increased weight), for some given preferences. • Automatic metadata generation. Metadata generation requires some predictive models that help in automatically filling some fields based on the inspection of the available information of the resources. Figure 3 shows the generated predictive relationships in repositories.
Fig. 3. Associations between metadata fields, as extracted from the study
Metadata generation could be used to suggest values to users editing metadata records, or as a “best effort” approach for incomplete metadata bases. For example, attributes “format” and “resource type” are predictive attributes of “difficulty”.
4 Conclusions This paper has described an exploratory study on four learning object repositories that used data mining techniques to extract characterizations of the repositories from the processing of metadata records. Incomplete metadata records and deviations from the
224
A. Segura et al.
vocabularies in the IEEE LOM standard are two main limitations of this approach, but they could be addressed by providing some simple metadata quality filters inside the repositories. Two main techniques were applied, clustering and association rule mining. The application of clustering analysis resulted in the identification of three relevant groups of learning objects, which can be roughly characterized by their interactivity level. These characterizations can be applied to filtering search results to subsets of learning resources given some metadata preferences. Association rule extraction resulted in several relationships between metadata elements that have the potential to be useful as characterizations of learning resource bases. These relationships are candidates for automated metadata generation algorithms. Future work will expand the study reported here to cover a larger number of repositories and a more heterogeneous learning object base. This will eventually allow us to contrast or improve the rules learned and expand the features that define the groups generated. The results of this study are planned to be integrated in the AGORA project and repository, serving as a basis for automated metadata generation and meta-search.
Acknowledgments This work is partially supported by MECESUP UBB 0305 project, Chile; A/016625/08 AECID project, Spain; “Metodologías para la producción colaborativa de objetos de aprendizaje” project, SINED-ANUIES, México; YUC 2006-C05-65811 project, FOMIX, CONACYT, México.
References ADL, Emerging and Enabling Technologies for the design of Learning Object Repositories Report (2002), http://xml.coverpages.org/ADLRepositoryTIR.pdf (accessed April 2009) Agrawal, R., Imieli, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, Washington, DC, United States. ACM, New York (1993) Al-Khalifa, H.S., Davis, H.C.: The evolution of metadata from standards to semantics in Elearning applications. In: Proceedings of the seventeenth conference on Hypertext and hypermedia, Odense, Denmark. ACM, New York (2006) ARIADNE, SQI Implementation Registry (2009), http://ariadne.cs.kuleuven.be/SqiInterop/free/ SQIImplementationsRegistry.jsp (accessed (15 Enero 2009) Broisin, J., Philippe, V., Meire, M., Duval, E.: Bridging the Gap between Learning Management Systems and Learning Object Repositories: Exploiting Learning Context Information. In: Advanced Industrial Conference on Telecommunications/Service Assurance with Partial and Intermittent Resources Conference/E-Learning on Telecommunications Workshop (2005) CEN/ISSS: A Simple Query Interface Specification for Learning Repositories (CEN WorkshopAgreement#15454). Brussels, Belgium (2005)
Exploring Characterizations of LORs Using Data Mining Techniques
225
Duval, E., Forte, E., Cardinaels, K., Verhoeven, B., Durm, R.V., Hendrikx, K., Forte, M.W., Ebel, N., Macowicz, M., Warkentyne, K., Haenni, F.: The Ariadne knowledge pool system. Commun. ACM 44, 72–78 (2001) IEEE-LOM. Draft Standard for Learning Object Metadata. IEEE P1484.12.1 (2002) LACLO, Comunidad Latinoamericana de Objetos de Aprendizaje (2009), http://www.laclo.org/ (accessed April 2009) LOM-ES. Perfil de Aplicación LOM-ES V.1.0 G. G.-S. 36/AENOR Disponible en (2008), http://www.educa.madrid.org/cms_tools/files/ ac98a893-c209-497a-a4f1-93791fb0a643/lom-es_v1.pdf Macqueen, J.: Some methods for classification and analysis of multivariate observations. In: U.O.C. Press (ed.) Proceedings of the fifth berkeley symposium on mathematical statistics and probability, California (1967); Le Cam, L.M., Neyman, J. Mclean, N., Lynch, C.: Interoperability between Information and Learning Environments: Bringing the Gaps (2003), http://www.imsglobal.org/DLims_white_paper_publicdraft_1.pdf (accessed April 2009) Neven, F., Duval, E.: Reusable learning objects: a survey of LOM-based repositories. In: Proceedings of the tenth ACM international conference on Multimedia, Juan-les-Pins, France. ACM, New York (2002) Ortiz Baíllo, A., Tortosa, S.O., Martínez Herráiz, J.J., Hilera González, J.R., Barchino Plata, R.: Estandarización de los Sistemas de Búsqueda Federada: SQI como Interfaz de Búsqueda. In: X Simposio Internacional de Informática Educativa (SIIE), Salamanca, Esapaña (2008) Prieto, M., Menéndez, V., Segura, A., Vidal, C.: A Recommender System Architecture for Instructional Engineering. In: Lytras, M.D., Carroll, J.M., Damiani, E., Tennyson, R.D. (eds.) WSKS 2008. LNCS (LNAI), vol. 5288, pp. 314–321. Springer, Heidelberg (2008) Reigeluth, C.M.: Instructional-Design Theories and Models: A New Paradigm of Instructional Theory. Lawrence Erlbaum Assoc., Mahwah (1999) Romero, C., Ventura, S.: Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications 33, 135–146 (2007) Romero, C., Ventura, S., García, E.: Data mining in course management systems: Moodle case study and tutorial. Computers and Education 51, 368–384 (2008) Sicilia, M.A., Garcia, E., Pages, C., Martinez, J.J., Gutierrez, J.M.: Complete metadata records in learning object repositories& #58; some evidence and requirements. International Journal of Learning Technology 1, 411–424 (2005) Simon, B., Massart, D., Van Assche, F., Ternier, S., Duval, E., Brantner, S., Olmedilla, D., Miklos, Z.: A Simple Query Interface for Interoperable Learning Repositories. In: Saito, N., Simon, B., Olmedilla, D. (eds.) Proceedings of the 1st Workshop on Interoperability of Web-based Educational Systems, Chiba, Japan, CEUR (2005) Stefaner, M., Vecchia, E.D., Condotta, M., Wolpers, M., Specht, M., Apelt, S., Duval, E.: MACE - Enriching architectural learning objects for experience multiplication. LNCS(LNAI & LNBI). Springer, Heidelberg (2007) Ternier, S., Massart, D., Campi, A., Guinea, S., Ceri, S., Duval, E.: Interoperability for searching learning object repositories: The proLearn query language. D-Lib Magazine 14 (2008) UNESCO, O.E.R., Open Educational Resources, useful resources/repositories (2009), http://oerwiki.iiepunesco.org/index.php?title=OER_useful_resources/Repositories (accessed April 2009) Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufman Publishers, New Zealand (2005)