Software tool for automatic population of MLO- AD ...

4 downloads 103231 Views 443KB Size Report
AD ontology from accreditation documents. Segedinac ... automated population of MLO-AD ontology from higher ... Statement creator component - a component.
Software tool for automatic population of MLOAD ontology from accreditation documents *

Segedinac Milan*, Savić Goran*, Konjović Zora*, Surla Dušan** University of Novi Sad/Faculty of Technical Sciences, Department of Computing and Automation, Novi Sad, Republic of Serbia ** University of Novi Sad/Faculty of Sciences, Department of Mathematics and Informatics, Novi Sad, Republic of Serbia (milansegedinac, savicg, ftn_zora, surla)@uns.ac.rs

Abstract— Internationalization of higher education requires curriculum communication between educational institutions, as well as access to information about learning opportunities and/or achieved outcomes for other users like students, prospective employers and administrative institutions. In order to establish the interoperability in the field, European Union has adopted the metadata model MLO-AD (Metadata for Learning Opportunities – Advertising) as a standard CWA 15903:2008. The software tool presented in this paper is a part of the comprehensive software platform supporting international curriculum communication which relies on MLO-AD metadata standard and Semantic Web. The software tool is aimed at automated population of MLO-AD ontology from higher education institutions’ accreditation documentation.

I. INTRODUCTION Curriculum internationalization [1] requires that the curriculum management can be achieved at an international level, but in a way that it does not diminish the importance of local curriculum specificities and diversities [2]. One approach to the development of an internationalized curriculum management system is proposed in the paper [3]. This system is based upon an international standard for representing learning opportunities, MLO-AD (CWA 15903:2008), and Semantic Web technologies. The architecture of the system proposed in [3] with the SPARQL endpoint being its central component , in addition to reasoning from the ontology, also allows updating and indexing the ontology. To be applicable in a system that uses Semantic Web technologies, MLO AD which originally is a Dublin Core Application Profile, was represented by an ontology [4]. This ontology can be deployed at the SPARQL endpoint of the system for internationalized curriculum management. The architecture offers users direct access to the ontology by using SPARQL queries, as well as a Web application that is a user friendly interface for browsing, searching and updating the ontology. Local curriculum storages are integrated into the system by exporting the ontology into the native formats (for example generating PDF documents from the ontology) and by populating the ontology with the information from the local curriculum storages.

This paper contributes to the development of the internationalized curriculum management system by proposing a software tool for automatic population of MLO-AD ontology from accreditation documents. II. ONTOLOGY POPULATION Ontology population is the process of inserting individuals and relations instances into an existing ontology. During the process of ontology population, the structure of ontology stays intact [5]. Ontologies can be populated manually, semiautomatically or automatically [6]. When ontology is populated manually, domain experts directly insert the individuals and relation instances into the ontology. The main shortcoming of this approach is that it is timeconsuming and labor intensive. Semi-automatic approach to ontology population is based upon the idea that the system should give a user the full control over the process while automatically offering the most probable suggestions [7]. Some of the semiautomatic ontology population systems create document annotations and store them into ontologies. In such an approach, a user intervention is not necessary when adding the relationships for instances that already exist in the ontology. Otherwise, users must add them by themselves [6]. Automatic ontology population does not engage users in any way when the individuals and relation instances are being identified and inserted into ontology. It requires the initial ontology that is being populated, an instance extraction engine and the corpus from which the instances are being extracted [5]. Instance extraction engine is intended for identifying individuals and relations among them in the corpus. The list of identified individuals and relations instances is then inserted into ontology. A typical instance extraction engine deploys lookup text extraction or prototype recognition in image analysis. This process can include machine learning methods, resulting with a more general instance extraction. III. SOFTWARE TOOL ARCHITECTURE Most of the higher education institutions in the Republic of Serbia submit the accreditation documents in PDF format. One such document is shown in figure 1.

Initial ontologies

Output ontologies





OWL model manager

InferedInstancesAccess OWLModelUpdated

Statement creator

ExtractedValuesAccess

Figure 1. An accreditation document

In order to extract the information from the accreditation documents, the documents need to be converted into a format that is suitable for machine processing. Therefore, an OCR software is applied and the accreditation documents are transformed into HTML files, producing a machine processable corpus. The main part of the research presented in this paper is the design and implementation of a software tool aimed at automatic population of MLO ontology with the information extracted from the accreditation documents in HTML format. In accordance with the generalized architecture of ontology population systems (Petasis, Karkaletsis, Paliouras, Krithara, & Zavitsanos., 2011), software tool proposed in this paper consists of the following components: • Instance value extractor component- a component for extracting instances values from HTML files, • OWL model manager component – a component for accessing the input and output ontologies and inserting new instances into output ontologies • Statement creator component - a component aimed at creating individuals and statements from the information extracted by instance extraction component and inserting them into the ontology through the ontology manipulation component. The component diagram of the proposed tool is shown at figure 2.

Instance value extractor



HT ML files



PDF files

OCR software

Figure 2. MLO ontology population tool – the system architecture

The component intended for instance values extraction uses HTML files. Since the accreditation documents are initially available in PDF format, one additional component is introduced, namely the OCR software which converts the PDF into HTML files. The instances extraction component should offer an extracted values access interface. Ontology manipulation component (OWL model manager) enables handling input and output ontologies. It is important to stress that the input ontologies should not be changed meaning that the ontology manipulation component should only be able to read them. Output ontologies import the input ontologies and contain instances generated during the process of ontology population. Therefore, the ontology manipulation component should be able to create output ontologies, change them by inserting new individuals and relations instances, and read some of the existing output ontologies if needed. This component should offer two interfaces: • OWL model update interface - an interface that enable inserting new instances into the output ontologies



Infered instances access interface - an interface that enables access to the input and output ontologies

Statement creation component is a link between the instances extraction component and the ontology manipulation component. This component uses the interfaces of instances extraction component and ontology manipulation component. IV. MLO ONTOLOGY POPULATION PROCESS MLO ontology population process consists of cycles. Some of the ontologies populated in previous cycles are being used as input ontologies in proceeding ones. The activities of the MLO ontology population are shown at figure 3. Lecturers ontology population

Course units ontology population

Degree programmes ontology population

FOAF ontology

Create controlled vocabulary of lecturers MLO ECT S ontology : 1 Lecturers controlled vocabulary

Create course units ontology

MLO ECTS ontology : 2



Course units ontology

Create degree programmes ontology

Degree programmes ontology

Figure 3. MLO ontology population process

The MLO ontology population process starts with the population of lecturers ontology. The data on the lecturers are represented by using FOAF ontology (http://www.foaf-project.org/), which is therefore, the input ontology in this cycle. In this paper the OWL implementation of FOAF ontology available at (http://www.mindswap.org/2003/owl/foaf) is being used for that purpose. This activity involves the analysis of all HTML files that represent courses obtained from accreditation documents, the creation of individuals that represent lecturers according to the FOAF ontology and inserting the individuals into the lecturers ontology. Since the lecturers in the accreditation documents are represented by their names and surnames, only the foaf:firstName and foaf: surname properties are being set. After the lecturers ontology is populated, next cycle can start. In this cycle the courses ontology is being populated. Thereby, in addition to the populated lecturers ontology, the set of ontologies for representing metadata for learning opportunities proposed in [4] is being utilized. In this cycle, the ontology is being populated with instances that represent educational courses. For each course represented in the accreditation documents two instances are created: • An instance of mload: LearningOpportunityInstance, that represents an abstract description of a learning opportunity including relevant data common to all the implementations of the learning opportunity;



An instance of mload: LearningOpportunitySpecification, that represents the data relevant to the concrete implementation of a learning opportunity. For these instances the appropriate individual from the lecturers ontology is set for the mlo:hasLecturer property. In addition, the values extracted from the HTML files are being set for the following properties: ects: CourseUnitCode, ects: CourseUnitTitle, mlo: Credits, okv: CourseUnitStatus, okv: hasCourseUnitNumberOfClasses, mlo: Objective, ects: CourseUnitOutcome, ects: CourseUnitTeachingMethods, ects: CourseUnitContent, ects: CourseUnitAssessmentMethods and ects: CourseUnitRecommendedReading. After the ontology of educational courses is being created, the next cycle – that is the degree programme ontology population – can take place. In this cycle, the ontology of educational courses as well as the MLO ontologies are being used as input ontologies. For each degree programme an instance of mload: LearningOpportunityInstance and an instance of mload: LearningOpportunitySpecification are being created. For them, the data extracted from HTML files are being set for the following properties: ects: DegreeProgrammeTitle, ects: InstitutionName, kvs: DegreeProgrammeScientificField, kvs: DegreeProgrammeScientificDiscipline, kvs: DegreeProgrammeStudiesType, mlo: DegreeProgrammeCredit, kvs: DegreeProgrammeProfessionalTitle, mlo: DegreeProgrammeDuration, kvs: DegreeProgrammeRealizationStartYear, kvs: DegreeProgrammeRealizationStart Year, kvs: DegreeProgrammeNumberOf Students, kvs: DegreeProgrammePlaned NumberOfStudents, kvs: DegreeProgramme AcceptanceDate, mlo: DegreeProgramme LanguageOfInstruction, kvs: DegreeProgrammeAcreditationYear, mlo: Url, kvs: DegreeProrammeStructure Description, ects: DegreeProgramme EducationAndProfessionalGoals, kvs: DegreeProgrammeEducationalObjective, kvs: DegreeProgrammeCompetency and kvs: DegreeProgrammeCurriculum. V. SOFTWARE TOOL IMPLEMENTATION In this section, the implementation of the proposed MLO ontology population tool is presented. The tool is implemented in Java programming language. For HTML files manipulation, the CyberNeko HTML Parser (http://sourceforge.net/projects/nekohtml/) is used. For ontology handling, Apache Jena Java framework for building Semantic Web applications (http://jena.apache.org/) is being deployed. The system is implemented following the architecture proposed in this paper. Class diagram is presented in figure 4. Because of the nature of this paper, only the illustrative classes are shown.

Class OWLModelManager models the OWL model manager component. The attribute model of type OntModel represents the OWL model. This class introduces a number of attributes and methods aimed at importing, loading and storing ontologies, as well as inserting the instances into the ontologies. Abstract class AbstractHTMLParser enables instances values extraction from the HTML files. Since the accreditation documents are well-structured, the information is extracted based upon the HTML files structure and by using regular expressions. For handling the HTML files structure, CyberNeko HTML Parser is being used. If the documents were not well-structured, some form of text mining would be necessary. AbstractHTMLParser has a collection of regular expressions (represented by the attribute regex) by which the information is being extracted, and a file from which the information is being extracted (represented by the attribute HTMLFile). This class has two private methods, getRegex and regexReplace, that implement the information extraction. Method getValues returns the list of extracted instances values. The instances values extraction is achieved by creating concrete classes that are successors of AbstractHTMLParser and implementing their getValues methods. For example, extracting course unit codes is achieved by implementing CourseUnitCodeParser. Class AbstractStatementFactory with its successors AbstractOntologyFragments and OntologyFragment model the statement creator component. During the process of ontology population it is often required to simultaneously set the properties for multiple individuals. For example, when the ontology of educational courses is being populated, it is necessary to set the properties that have for domains classes OWL model manager OWLModelManager -

model proxy port importedNameSpaces OWLFil eLocation nameSpace

+ + + +

l oadModel () : void saveModel () : void addStatement (Indivi dual subject, String predicate, Indi vidual object) : void getModel () : OntModel

: : : : : :

OntModel String String HashMap String int

ects:CourseUnitSpecification and ects:CourseUnitInstance. The class OntologyFragment holds the individuals the make a logical whole and the fragment of the ontology that is currently being processed. Therefore this class has only one attribute – the collection individuals. Abstract class AbstractOntologyFragments enables the creation of all the instances. This class has a collection ontology fragments for the ontology that is being populated. For example, when population the ontology of educational courses, these are all the instances of ects:CourseUnitSpecification and ects:CourseUnitInstance. VI. CONCLUSION In this paper, a software tool for automatic population of MLO-AD ontology is proposed. The software architecture of the tool is specified by software component diagram. Then the process of the MLO ontology population is specified through activity diagram and explained. Finally, a class diagram of the tool is presented together with main implementation details. In the paper [8] it has been proposed to use the Semantic Web as a platform for curriculum management. Papers [9, 10, 11] propose mechanism for automation of the curriculum management. In [11] the need for a flexible platform for curriculum management has been identified. Paper [4] proposes a set of ontologies to be used in such a system. In the paper [3] the architecture of a system for internationalized curriculum management has been proposed. This paper can be positioned in the line of these papers, since it offers a practical tool that can save time and human efforts needed for efficient internationalized curriculum management system. Future work will be directed towards improvement of the current tool’s features (i.e. introducing new input formats in addition to PDF, utilization of text mining for concepts and/or relations extraction, etc.) and extending it with new features providing for automated population of other ontologies foreseen in the system for internationalized curriculum management.

1..1

0..1

Statement creator

Ontol ogyFragment

AbstractOntol ogyFragments

0..1

- fol derLocation : String

- individuals : HashMap

0..*

+ getIndividual s () : HashMap + setIndividuals (HashMap indivudals) : void

+ parseAl l () : void 0..1 CourseUnits CourseUnitCodeStatementFactory

ACKNOWLEDGMENT This work is partly funded by the Grant No. III-47003 of the Ministry of Education and Science of the Republic of Serbia.

CourseUnitT itleStatementFactory

0..*

REFERENCES

AbstractStatementFactory 0..1

0..1

+ ceateStatements (File HT MLFil e, OntModel model) : List 0..1 CourseUnitLecturerStatementFactory

CourseUnitNumberOfClassesStatementFactory

CourseUnitStatusStatementFactory

0..1 CourseUni tContentStatementFactory

0..1

CourseUnitLearningOutcomeStatementFactory

0..1

0..1

CourseUnitObjecti veStatementFactory

[2]

0..1 0..1

CourseUnitLecturerParser

CourseUnitStatusParser

1..1

1..1

CourseUnitNumberOfCl assesParser

CourseUnitCodeParser

CourseUnitTitleParser

1..1

1..1 1..1

[3]

1..1

1..1 1..1

[1]

CourseUnitObjectiveParser

CourseUnitLearni ngOutcoumeParser

[4]

CourseUnitContentParser 1..1 AbstractHT MLParser - regex : HashMap - HT MLFil e : Fi le - getRegex (Stri ng theRegex, Stri ng str2Check) : Li st - regexReplace (String theRegex, String str2Check, String newValue) : String + getValues () : Li st Instance value extractor

Figure 4. MLO ontology population tool – class diagram

[5]

W. F. Pinar, "The internationalization of curriculum studies."Internationalization of curriculum studies. New York: Peter Lang (2003). J. A. Pacheco,. "Curriculum Studies: what is the field today?." Journal of the American Association for the Advancement of Curriculum Studies-Volume, vol.8, 2012. M. Segedinac, Z. Konjović, D. Surla, I. Kovačević, G. Savić. "Software platform for international curriculum communication in bologna process.", XIX skup trendovi razvoja: “Univerzitet na tržištu...”, Maribor, Pohorje, Slovenija, 18. - 21. 02. 2013. M. Segedinac, Z. Konjovic, D. Surla, G. Savic. "An OWL representation of the MLO model." In Intelligent Systems and Informatics (SISY), 2012 IEEE 10th Jubilee International Symposium on, pp. 465-470. IEEE, 2012. G. Petasis, V. Karkaletsis, G. Paliouras, A. Krithara, E. Zavitsanos. "Ontology population and enrichment: State of the

[6]

[7]

[8]

art." In Knowledge-driven multimedia information extraction and ontology evolution, pp. 134-166. Springer-Verlag, 2011. H. Alani, S. Kim, D. E. Millard, M. J. Weal, W. Hall, P. H. Lewis, N. R. Shadbolt. "Automatic ontology-based knowledge extraction from web documents." Intelligent Systems, IEEE vol.18, 2003, pp.14-21. D. Celjuska, M. Vargas-Vera. "Ontosophie: A semi-automatic system for ontology population from text." In Proceedings of the 3rd International Conference on Natural Language Processing (ICON). 2004. M. Segedinac, G. Savić, Z. Konjović „Knowledge Representation Framework for Curriculum Development”, KEOD 2010 -

Proceedings of the International Conference on Knowledge Engineering and Ontology Development, Valencia 2010. [9] M. Segedinac, G. Savic, Z. Konjovic. "Optimal counterexamples expectation based method for knowledge space construction." In Intelligent Systems and Informatics (SISY), 2010 8th International Symposium on, pp. 273-278. IEEE, 2010. [10] G. Savić, M. Segedinac, Z. Konjović. "Automatic generation of ECourses based on explicit representation of instructional design." Computer Science and Information Systems/ComSIS vol.9, 2012, pp.839-869. [11] M. Segedinac, M. Segedinac, Z. Konjović, G. Savić. "A formal approach to organization of educational objectives." Psihologija, vol.44, 2011, pp.307-323.

Suggest Documents