A Semantic Model for Selective Knowledge Discovery over OAI ... - MDPI

information Article

A Semantic Model for Selective Knowledge Discovery over OAI-PMH Structured Resources Arianna Becerril-García *

ID

and Eduardo Aguado-López

Dissemination of Science Research Group, School of Political and Social Sciences, Autonomous University of the State of Mexico, 50100 Toluca de Lerdo, Mexico; [email protected] * Correspondence: [email protected]; Tel.: +52-722-555-7146 Received: 9 May 2018; Accepted: 7 June 2018; Published: 12 June 2018

Abstract: This work presents OntoOAI, a semantic model for the selective discovery of knowledge about resources structured with the OAI-PMH protocol, to verify the feasibility and account for limitations in the application of technologies of the Semantic Web to data sets for selective knowledge discovery, understood as the process of finding resources that were not explicitly requested by a user but are potentially useful based on their context. OntoOAI is tested with a combination of three sources of information: Redalyc.org, the portal of the Network of Journals of Latin America and the Caribbean, Spain, and Portugal; the institutional repository of Roskilde University (called RUDAR); and DBPedia. Its application allows the verification that it is feasible to use semantic technologies to achieve selective knowledge discovery and gives a sample of the limitations of the use of OAI-PMH data for this purpose. Keywords: ontologies; Semantic Web; OAI-PMH; knowledge discovery; awareness to context

1. Introduction The World Wide Web (the Web), a gigantic community of documents, has evolved rapidly since Tim Berners-Lee introduced its concept to the world. Finding relevant material for learning, teaching, or research in the flow of existing resources and publications is becoming a major challenge for both students and scientists. In addition, sharing of resources, resource metadata, and data across the Web is a central principle in scientific and educational contexts. Scientific collaboration has long been striving for wider reuse and sharing of knowledge and data [1]. This mass of information that constitutes the Web sometimes feels “a mile wide but an inch deep”. How is a more integrated, consistent, and deeper Web experience built? [2]. This question is where semantics is situated, the process of communicating information with sufficient meaning. It is thus possible to build intelligent applications that provide better understanding by identifying content in greater depth. Moreover, the variety of information resources on the Web that are useful for a student, academic, teacher, or scientist is very broad, encompassing books, journal articles, reports, conference proceedings, theses, pre-prints, and data files, among others, all available through specialized portals, repositories, and databases, both commercial and non-commercial. Since 2001, a movement has been consolidating that promotes free and unrestricted access to scientific content, especially if it has been publicly funded. This movement is called Open Access and was formalised through three statements: Budapest [3], Berlin [4], and Bethesda [5]. Repositories, portals, and journals that adhere to this movement adopt, as good practice, an interoperability protocol to exchange information and thus have communication rules and standards for structuring data. The registration of Open Access mandates ROARMAP shows a growing curve in the global implementation of policies at the level of institutions and countries that encourage the adoption of Information 2018, 9, 144; doi:10.3390/info9060144

www.mdpi.com/journal/information

Information 2018, 9, 144

2 of 22

Open Access. Similarly, OpenDOAR, the Directory of Open Access Repositories, shows a constant rise in the creation of repositories. The adoption of policies helps the population of repositories, consolidating these platforms as main disseminators of the knowledge generated by an institution or a country. Open Access platforms have adopted the OAI-PMH (Open Archives Initiative—Protocol for Metadata Harvesting) protocol as the de facto standard of communication and interoperability. It accounts for this, the Registry of Open Access Repositories—ROAR—with thousands of active data providers displaying millions of articles on the Web. This shows the consolidation of the repositories and the large amount of academic and scientific material available online under the philosophy of free access. It is thus, that it is relevant to dedicate efforts and apply new techniques such as Semantic Web technologies—which have been positively proven in terms of the benefits they bring to other sectors—in order to add value to the discovery of knowledge contained in these platforms; in such a way that students, professors, or researchers obtain greater benefits when working with them. The structuring of data, with which these information resources are presented on the Web, provides another semantic level in understanding their nature. Thus, software applications can use them for the benefit and improvement of information retrieval engines. Research discovery is facilitated through the navigation of content, and structured data lend themselves to a variety of applications that can use these data [6]. On the other hand, knowledge discovery (KD) is the most desirable end product of computing. Finding new phenomena or improving knowledge about them is second in value only after preserving the world and the environment [7]. Frawley, Piatetsky-Shapiro, and Matheus [8] define knowledge discovery as the non-trivial extraction of previously unknown and potentially useful implicit information from data. The challenge of extracting knowledge from data demands statistical research, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing to deliver advanced business intelligence and web discovery solutions [9]. The OAI-PMH emerged with the goal of interoperability between platforms. In addition to being exchanged between systems, resources structured under this specification are used in search engines that allow users to query. However, these data have untapped potential uses in applications that enable semantic knowledge discovery. The Dublin Core Metadata Initiative (DCMI) supports the development of interoperability standards at different levels, among which it is a set of metadata for simple and generic descriptions popularized as being part of the OAI-PMH protocol specifications. The scope of Dublin Core metadata specification is currently characterized in four levels of interoperability Level 1: Level 2: Level 3: Level 4:

Shared term definitions Formal semantic interoperability Description set syntactic interoperability Description set profile interoperability

The OAI-PMH protocol implements the interoperability level 1 with a metadata set of 15 elements. The protocol also requires serialization in XML, therefore, it is necessary to adapt these outputs to the syntactic coding of RDF/XML or some other RDF serialization format, as a determining factor for the description of resources such as generic scheme for the exchange of metadata between heterogeneous and distributed systems, content description standards (ontologies, thematic maps, thesauri, among others), and a variety of protocols and standards for the exchange of information on the Semantic Web. The overall objective of the approach presented here is to propose a semantic model for the selective discovery of knowledge about resources structured with the OAI-PMH to verify the feasibility and give evidence of the limitations in the application of Semantic Web technologies to these data sets


3 of 22

Information 2018, 9, x FOR PEER REVIEW

3 of 22

in order to discover knowledge, a process understood as identifying resources that were not explicitly requested by a user but are potentially useful. The OntoOAI model arises from the idea of allowing knowledge discovery through inference The OntoOAI model arises from the idea of allowing knowledge discovery through inference processes on OAI‐PMH structured resources, considering a user profile to discover potentially useful processes on OAI-PMH structured resources, considering a user profile to discover potentially useful information for the user. This article shows the complete model and application case. The first information for the user. This article shows the complete model and application case. The first findings of this model in its early stages were published in Becerril Garcia, Lozano Espinosa, Molina findings of this model in its early stages were published in Becerril Garcia, Lozano Espinosa, Molina Espinosa [10]; and Becerril Garcia, Lozano Espinosa, Molina Espinosa [11]. Espinosa and consist Becerril of Garcia, Lozano Espinosa, Molina Espinosa [11]. discovery, a semantic The [10]; results a semantic model for selective knowledge The results consist of athe semantic modelof fordata selective knowledge discovery, a semantic application application that enables processing structured with OAI‐PMH, the application of that enables the processing of data structured with OAI-PMH, the application of ontologies in the ontologies in the description and verification of the knowledge obtained from OAI‐PMH resources description and verification of the knowledge obtained from OAI-PMH resources and inference testing and inference testing mechanisms on the resultant dataset. mechanisms on the resultant dataset.

2. Materials and Methods 2. Materials and Methods The application of Semantic Web technologies to data structured with OAI‐PMH for selective The application of Semantic Web technologies to data structured with OAI-PMH for selective knowledge discovery for a user has led to the development of a model named OntoOAI. knowledge discovery for a user has led to the development of a model named OntoOAI. The following sequence of steps (Figure 1) represents the methodology to develop the OntoOAI The following sequence of steps (Figure 1) represents the methodology to develop the OntoOAI model: model: 1. Harvesting of resource metadata that are available through repositories that implement OAI-PMH. 1. Harvesting of resource metadata that are available through repositories that implement OAI‐PMH. 2. The conversion of the results obtained in XML to RDF. 2. The conversion of the results obtained in XML to RDF. 3. The enrichment of statements (obtained in Dublin Core) with additional information from 3. The enrichment of statements (obtained in Dublin Core) with additional information from authors (FOAF). authors (FOAF). 4. of the resulting knowledge. 4. Validation Validation of the resulting knowledge. 5. storage. 5. Triple Triple storage. Information retrieval based on inference, which will provide service to the end user. One may note 6. 6. Information retrieval based on inference, which will provide service to the end user. One may that required information input from a user interface query and the contextual note the that the required information input from a user matches interface the matches the query and the user information. contextual user information.

Figure 1. OntoOAI Methodology. Figure 1. OntoOAI Methodology.

The user context is an input to the information retrieval process for performing selective discovery. The developed software components are:


4 of 22

The user context is an input to the information retrieval process for performing selective discovery. The developed software components are:

• • • •

OAI-PMH harvester XML_DC2RDF DCFOAF DCFOAF_merged.owl

This approach is part of the logic of the extract, transform, and load (ETL) process, given that the information is collected and organised to later be transformed into RDF, enriched and validated using ontologies, and loaded as triples to a data file. This methodology is designed to keep adding from 1 to n repositories, provided they are OAI-PMH compliant. 2.1. Sources of Information By using one directory of repositories that implement OAI-PMH, like OpenDOAR, ROAR, or OpenArchives for instance, one can locate Web sites that meet this specification and are therefore candidates to provide information for OntoOAI. Additionally, it is important to analyse the “service providers” that, in turn, present the metadata via the protocol, which would significantly simplify harvesting, given that a service provider already harvests a set of repositories. After selecting the sites that satisfy the protocol within these directories, it is important to locate the base URL of each, which will be input into the metadata harvesting process. Moreover, the model includes the enrichment of metadata with additional information from the authors. This information must comply with the FOAF specification, and it can come from the social networks of researchers or other information sources such as DBpedia. DBpedia is a collaborative effort to extract structured information from Wikipedia and Wikidata and make it available on the Web. DBpedia makes it possible to query Wikipedia in a more sophisticated way and link different data sets to the ones obtained from Wikipedia. This initiative makes these data available through bases of triples in different formats. The DBpedia data set uses a multi-domain ontology derived from Wikipedia. 2.2. OAI-PMH Metadata Harvesting The process of harvesting or collecting metadata consists of sending HTTP requests to data providers that implement OAI-PMH for one or more records corresponding to information resources. A Java application, called OAI PMH Harvester, was developed for this process. OAI_PMH_Havester works by receiving the response of the request sent to the repository: [URLBase]?verb = ListRecords&metadataPrefix = oai_dc Such a response is structured with the set of Dublin Core metadata and written to a file named (ListRecords.xml), which is managed as a tree structure using the Xerces Java Parser library. It is taken to a numerical sequence file of type ListRecords1.xml, ListRecords3.xml, ListRecords3.xml and so on, to save each batch of metadata sent by the data providers. There are some who send sets of 20, others sets of 100, and the configuration is particular to each repository. Thus, the harvester is able to receive varying amounts using the “resumptionToken” parameter. The harvester is also capable of performing harvests with optional parameters of the “ListRecords” verb—such as SET, FROM, and UNTIL—to run queries filtered by set or dates. 2.3. XML to RDF Conversion XML_DC2RDF is the converter that takes the output of the harvester represented as XML files with Dublin Core to convert them to RDF. This conversion is necessary because, as confirmed in the Report on Metadata Models for Multimedia Content of the OMediaDis project (Open Platform for Content Management, Multichannel Distribution), many institutions provide access to their metadata


5 of 22

repositories through OAI-PMH but do not make their resources accessible through dereferenceable URIs, which restricts meaning and causes access to remain restricted to metadata [12]. There are some developments available that allow the exploitation of structured data, providing them with characteristics typical of semantic applications, such as the ones tasked with converting to RDF, also known as RDFizers. These tools convert data from relational databases, CSV files, XML, and text, among others. Three of them were compared to verify the feasibility of implementation within OntoOAI: OAI2LOD Server, a development designed to present OAI-PMH metadata as Linked Data [13]; D2R [14]; and OAI2RDF [15]. However, it was decided to develop a software application to perform this task because, in addition to overcoming some drawbacks, it was more suitable for incorporating the harvesting process with the rest of the implementation. This application is responsible for reading the metadata of XML files obtained from OAI-PMH harvesting using the SAXBuilder API. SAXBuilder builds a JDOM document using a third-party SAX parser to handle the analysis tasks and uses an instance of SAXHandler to listen to SAX events and thus build a JDOM document content using JDOMFactory. JDOM is a Java representation of an XML document. JDOM provides a way to represent that document for easy and efficient reading, manipulation, and writing. 2.4. Integration of DC XML with FOAF Author data are enriched to add co-author relationships and more information about the author. OntoOAI works with data sources that describe information about people when they are expressed with the FOAF specification. OntoOAI used DBPedia for this purpose, specifically the “PersonData” dataset. Thus, for each Dublin Core record obtained from the harvesting processes, information is added to the creator label corresponding to the author or co-authors of the resource. The OntoOAI model expects that if an information resource has several authors, the creator label will be repeated such that the same procedure is carried out for each author. Then, for each individual, the RDF type is specified, such as http://xmlns.com/foaf/0.1/Person, and the name (foaf:name), surname (foaf:givenName, foaf:surname), people with whom the individual has a relationship, in this case co-authorship (foaf:knows), interest (foaf:topic_interest), and date of birth (onto:birthDate) are also included. Finally, the creator metadata are obtained as follows: The process of recognizing an author as an entity was performed using the full name of the author and exact matching. Needless to say, the problem of homonymy or polysemy (two authors with the same name) and synonymy (an author who publishes under different names) among authors affects the results of this phase of the process. Therefore, it is evident that there is an important area of opportunity in the unique identification of people on the web, particularly of researchers or authors in the academic–scientific content. No doubt, this model can be improved if the repositories that present metadata under the OAI-PMH make use of dereferenceable URIs, but this practice is not commonly included in OAI-PMH data providers. The use of ORCID or a similar unique identifier that guarantees an URI for an author is a desirable requirement for repositories and should be exposed through OAI-PMH. 2.5. Ontological Engineering The model verifies the resultant knowledge through an ontology that results from combining two existing ones: Dublin Core and FOAF. This process is performed by merging ontologies. Next, each ontology and the technique used for merging are described, as well as the information validation process for which this new ontology is created.


6 of 22

Dublin Core: the resources that feed the knowledge base are structured with DC metadata under the OAI-PMH protocol. For this work, the http://purl.org/dc/terms namespace is used. Friend of a Friend: the second ontology in OntoOAI is FOAF, used to incorporate additional data on the authors of articles, books, and other academic resources. In combination, their social relations based on co-authorship can even be expressed to determine related resources. The FOAF namespace is Information 2018, 9, x FOR PEER REVIEW 6 of 22 used to represent information about people, such as names, birth dates, photographs, and especially otherespecially people to other whompeople they are some It isin particularly useful for representing social and to related whom in they are way. related some way. It is particularly useful for network data. representing social network data. FOAF works under the spirit of the AAA (Anyone can say Anything about Any topic) principle. FOAF works under the spirit of the AAA (Anyone can say Anything about Any topic) principle. In this case, the issues that anyone usually talks about are people. Other topics that are commonly In this case, the issues that anyone usually talks about are people. Other topics that are commonly related to what can be said about people and that are included in FOAF are organizations (to which related to what can be said about people and that are included in FOAF are organizations (to which they belong), which they work), documents (that(that theythey havehave created or thator describe them), they belong), projects projects (on (on which they work), documents created that describe and many others [2]. them), and many others [2]. The intention intention of of merging merging the the two two ontologies, ontologies, Dublin Dublin Core Core and and FOAF, FOAF, lies a The lies in in generating generating a representation of the knowledge of the integration of data from the OAI-PMH repositories (Dublin representation of the knowledge of the integration of data from the OAI‐PMH repositories (Dublin Core) and other sources of information to enrich the author data (FOAF), with the goal of building Core) and other sources of information to enrich the author data (FOAF), with the goal of building a a model to verify the consistency of this integration. The ontological model proposed is illustrated model to verify the consistency of this integration. The ontological model proposed is illustrated in in the following graph. With it, one can model the properties of an information resource (e.g., book, the following graph. With it, one can model the properties of an information resource (e.g., book, article) with its author or authors; for example, a researcher is an author of a publication (dc:creator) article) with its author or authors; for example, a researcher is an author of a publication (dc:creator) but is also a person (foaf:person) with individual properties. Likewise, the relationship of authorship but is also a person (foaf:person) with individual properties. Likewise, the relationship of authorship between a publication and a researcher is represented, along with the co-authorship of a researcher between a publication and a researcher is represented, along with the co‐authorship of a researcher with one or more others with whom they write a publication (Figure 2). with one or more others with whom they write a publication (Figure 2).

Figure 2. Authorship and co‐authorship relations in merging ontologies. Figure 2. Authorship and co-authorship relations in merging ontologies.

In general, operations with ontologies are non‐trivial tasks that cannot be fully resolved In general, operations with ontologies are non-trivial tasks that cannot be fully resolved automatically automatically due to a variety of factors, such as insufficient specification of an ontology for finding due to a variety of another factors, such as insufficient specification an ontology for finding similarities with similarities with ontology. Therefore, they are ofusually performed manually or semi‐ another ontology. Therefore, they are usually performed manually or semi-automatically, where a tool automatically, where a tool helps to find possible relationships between elements of different helps to find possible relationships between elements of different ontologies, but the final confirmation ontologies, but the final confirmation of the relationship is left to an expert, who will make decisions of the relationship is left to an expert, who will make decisions based on the description of the natural based on the description of the natural language of the elements of the ontology and common sense. language of the elements of the ontology and common sense. With the help of the Protégé tool and its ‘Refactor > Merge Ontologies’ function, the merging of With the help of the Protégé tool and its “Refactor > Merge Ontologies” function, the merging of the two input ontologies was performed. As Ameen, Rahman, and Rani [16] mention, this automatic the two input ontologies was performed. As Ameen, Rahman, and Rani [16] mention, this automatic integration does not resolve the inconsistencies generated after the process. Therefore, the ontology integration does not resolve the inconsistencies generated after the process. Therefore, the ontology output was subjected to further refinement following the steps of the merging algorithm proposed by the same authors (Figure 3).


7 of 22

output was subjected to further refinement following the steps of the merging algorithm proposed by the same authors (Figure 3). Information 2018, 9, x FOR PEER REVIEW 7 of 22 Information 2018, 9, x FOR PEER REVIEW

7 of 22

Figure 3. OntoOAI ontology merger algorithm based on [16]. Figure 3. OntoOAI ontology merger algorithm based on [16]. Figure 3. OntoOAI ontology merger algorithm based on [16].

The automatic process in Protégé identified classes that were identical in name (rdfs:Class and The automatic process in Protégé identified classes that were identical in name (rdfs:Class and The automatic process in Protégé identified classes that were identical in name (rdfs:Class and Agent) and performed the merger. However, for the case of “BibliographicResource” and Agent) and performed the merger. However, for the case “BibliographicResource” and “Document”, Agent) and performed the merger. However, for of the case of “BibliographicResource” and “Document”, it did not perform a merger because the name did not match, and thus, the identification it did not perform a merger because the name did not match, and thus, the identification had to be “Document”, it did not perform a merger because the name did not match, and thus, the identification had to be made explicit manually, as it is shown in Figure 4. made explicit manually, as it is shown in Figure 4. had to be made explicit manually, as it is shown in Figure 4. The class hierarchy of the resulting ontology is shown in the Figure 5. In addition, it was The The class class hierarchy hierarchy of of the the resulting resulting ontology ontology is is shown shown in in the the Figure Figure 5.5. In In addition, addition, it it was was necessary to identify other similarities to declare additional equivalencies and restrictions as well as necessary to identify other similarities to declare additional equivalencies and restrictions as well as to necessary to identify other similarities to declare additional equivalencies and restrictions as well as to resolve inconsistencies. resolve inconsistencies. to resolve inconsistencies. Similarly, it was necessary to analyze the object properties to identify semantic matches, in Similarly, necessary to to analyze the the object properties to identify semantic matches, in which Similarly, itit was was necessary analyze object properties to identify semantic matches, in which it is also important to check and match ranges and domains. Regarding these properties, the it is also important to check and match ranges and domains. Regarding these properties, the Dublin which it is also important to check and match ranges and domains. Regarding these properties, the Dublin Core Creator (http://purl.org/dc/terms/creator) was defined as similar to FOAF maker Core Creator was defined as similar to FOAFto maker Dublin Core (http://purl.org/dc/terms/creator) Creator (http://purl.org/dc/terms/creator) was defined as similar FOAF (http:// maker (http://xmlns.com/foaf/0.1/maker), shown in Figure 6. xmlns.com/foaf/0.1/maker), shown in Figure 6. (http://xmlns.com/foaf/0.1/maker), shown in Figure 6.

Figure 4. Declaration of “Document” and “BibliographicResource” equivalence. Figure 4. Declaration of “Document” and “BibliographicResource” equivalence. Figure 4. Declaration of “Document” and “BibliographicResource” equivalence.


8 of 22

Information 2018, 9, x FOR PEER REVIEW Information 2018, 9, x FOR PEER REVIEW

8 of 22 8 of 22

Figure 5. Class hierarchy of the merged ontology. Figure 5. Class hierarchy of the merged ontology. Figure 5. Class hierarchy of the merged ontology.

Figure 6. Equivalency of the “Creator” and “Maker” properties. Figure 6. Equivalency of the “Creator” and “Maker” properties. Figure 6. Equivalency of the “Creator” and “Maker” properties.

The resulting ontology, called “DCFOAF_merged.owl”, is verified using a reasoner; given that The resulting ontology, called “DCFOAF_merged.owl”, is verified using a reasoner; given that The resulting ontology, called “DCFOAF_merged.owl”, is verified using a reasoner; given that the stated and inferred models are similar, it is said that the ontology is consistent. The metrics of the the stated and inferred models are similar, it is said that the ontology is consistent. The metrics of the the stated and inferred models are similar, it is said that the ontology is consistent. The metrics of the resulting ontology are shown in the Table 1. resulting ontology are shown in the Table 1. resulting ontology are shown in the Table 1. Thus, this ontology verifies the consistency of the RDF/XML files obtained in the previous phases. Thus, this ontology verifies the consistency of the RDF/XML files obtained in the previous phases. Thus, this ontology verifies the consistency of the RDF/XML files obtained in the previous phases. Table 1. Metrics of the resulting ontology, “DCFOAF_merged.owl”. Table 1. Metrics of the resulting ontology, “DCFOAF_merged.owl”.

Classes Classes Object properties Object properties Data properties Data properties

42 42 66 66 40 40


9 of 22

Table 1. Metrics of the resulting ontology, “DCFOAF_merged.owl”. Classes Object properties Data properties

42 66 40


9 of 22

2.6. Storage

2.6. Storage OntoOAI uses TDB of Jena framework as a triple store. Once the data are collected, processed, enriched, and validated, they pass to the triples store. Upon completion of these processes, a set of OntoOAI uses TDB of Jena framework as a triple store. Once the data are collected, processed, RDF graphs—serialized in RDF/XML—files are obtained. In each of them, a predetermined amount enriched, and validated, they pass to the triples store. Upon completion of these processes, a set of RDF of information resources is described, which dependsIn oneach the configuration of the source repository; graphs—serialized in RDF/XML—files are obtained. of them, a predetermined amount of as will be seen in the case study, each harvested repository provides a different number of resources information resources is described, which depends on the configuration of the source repository; as will per file. be seen in the case study, each harvested repository provides a different number of resources per file. Each of these resources is described by the 12 Dublin Core metadata, which, when repeatable, Each of these resources is described by the 12 Dublin Core metadata, which, when repeatable, can represent at least 12 elements per resource, which make up the properties of each along with the can represent at least 12 elements per resource, which make up the properties of each along with the properties included in the stages of conversion to RDF and the enrichment of authors. Each of these properties included in the stages of conversion to RDF and the enrichment of authors. Each of these properties gives rise to a triple within TDB. Triples loading is performed using the Jena TDBLoader, properties gives rise to a triple within TDB. Triples loading is performed using the Jena TDBLoader, which allows simultaneous bulk loading and the generation of an index. It performs bulk loading which allows simultaneous bulk loading and the generation of an index. It performs bulk loading operations more efficiently than simple RDF reading in a TDB model. operations more efficiently than simple RDF reading in a TDB model. To make this loading more efficient, shell scripts were programmed that run directly on TDB, To make this loading more efficient, shell scripts were programmed that run directly on TDB, a ascript scriptfor foreach eachrepository repositoryharvested harvestedwith withthe thenumber number of of loading loading statements statements corresponding corresponding to to the the number of collected files, as discussed in the case study. The following diagram shows the process that number of collected files, as discussed in the case study. The following diagram shows the process is followed (Figure 7). that is followed (Figure 7).

Figure 7. Triple storage process. Figure 7. Triple storage process.

2.7. Context Awareness 2.7. Context Awareness The ability of OntoOAI to perceive information about the user environment and provide results The ability of OntoOAI to perceive information about the user environment and provide results based on this this information. information. The infer non-explicit non‐explicit situations situations and and thus thus based on The fundamental fundamental concept concept is is to to infer manifest intelligent behaviour. Therefore, a profile is associated with a user that consists of a set of manifest intelligent behaviour. Therefore, a profile is associated with a user that consists of a set features, some more dynamic than others, to represent the circumstances of academic, personal, and of features, some more dynamic than others, to represent the circumstances of academic, personal, even social space‐time. Thus, a student profile contains identification data, curriculum, period, and and even social space-time. Thus, a student profile contains identification data, curriculum, period, registered courses, among others. and registered courses, among others. For this this work, work, the the student student representation representation model model based based on on ontologies ontologies for for intelligent intelligent tutoring tutoring For systems for for the the distance learning of Panagiotopoulos, Kalou, Pierrakeas, and Kameas was systems distance learning of Panagiotopoulos, Kalou, Pierrakeas, and Kameas [17] was[17] adapted, adapted, using the following four classes: using the following four classes:  •  •  

“Student”, which identifies any student. “Student”, which identifies any student. “StudentCourseInformation”, which includes information relevant to the educational process, “StudentCourseInformation”, which includes information relevant to the educational process, such as modules for the program of study, school, homework, tests, and so on. such as modules for the program of study, school, homework, tests, and so on. “StudentCurrentActivity”, which refers to the details of the academic activity of the course year. “StudentPersonalInformation”, which is the static and permanent student information.

The ontology proposed by these authors was not located, and it did not exactly meet the needs of OntoOAI; accordingly, a new ontology called OntoOAIStudent was designed, based on the ontology proposed by those authors but with a new design.


• •

10 of 22

“StudentCurrentActivity”, which refers to the details of the academic activity of the course year. “StudentPersonalInformation”, which is the static and permanent student information.

The ontology proposed by these authors was not located, and it did not exactly meet the needs of OntoOAI; accordingly, a new ontology called OntoOAIStudent was designed, based on the ontology proposed by those authors but with a new design. The user context consists of the information known in advance and determined in a static manner based on the OntoOAIStudent model. The information comes from the school or school control offices, which could provide data on students for retrieving information relevant to their studies. Information 2018, 9, x FOR PEER REVIEW 10 of 22 The information is automatically imported from the school database to the OntoOAI database in an initial load through CSV files. This representation considers information independent of the application database in an initial load through CSV files. This representation considers information independent and does not include information result of the user interaction with the application. of the application and does not include information result of the user interaction with the application. To create the ontology that describes the student, the four major classes of [17] are adopted, but To create the ontology that describes the student, the four major classes of [17] are adopted, but with the following hierarchy of classes (Figure 8). with the following hierarchy of classes (Figure 8).

Figure 8. Class hierarchy of OntoOAIStudent ontology. Figure 8. Class hierarchy of OntoOAIStudent ontology.

The user profile is grouped into four main classes (Figure 9), as follows: The user profile is grouped into four main classes (Figure 9), as follows: 1. 1. 2. 2.

“Estudiante”, which represents a student. “Estudiante”, which represents a student. “Cursos”, “Cursos”, which which describes describes the the classes classes that that the the student student takes, takes, the the school school where where the the student student is is enrolled, and the program and program level (BA, MA, PhD). enrolled, and the program and program level (BA, MA, PhD). 3. “ActividadActual”, which describes the current period that the student is enrolled in, previous 3. “ActividadActual”, which describes the current period that the student is enrolled in, previous experience, course goals, module, and period. experience, course goals, module, and period. 4. “InformacionPersonal”, which represents accessibility, demographics, and student motivation. 4. “InformacionPersonal”, which represents accessibility, demographics, and student motivation. Finally, the user profile represented in the ontology is among the inference engine inputs Finally, the user profile represented in the ontology is among the inference engine inputs described described below. While the OntoOAI model can be useful in discovering knowledge for teachers, the below. While the OntoOAI model can be useful in discovering knowledge for teachers, the user profile user profile used to test the model focuses on the student. However, it is easily expandable to used to test the model focuses on the student. However, it is easily expandable to teachers, academics, teachers, academics, or researchers and only requires changing the representation of the user profile or researchers and only requires changing the representation of the user profile to one suitable for to one suitable for these types of users. these types of users.


11 of 22


11 of 22

Figure 9. Graph of classes of the ‘OntoOAIStudent.owl’ ontology. Figure 9. Graph of classes of the “OntoOAIStudent.owl” ontology.

2.8. Inference Engine and Knowledge Discovery 2.8. Inference Engine and Knowledge Discovery OntoOAI treats knowledge discovery as the nontrivial extraction of implicit, previously unknown, OntoOAI treats knowledge discovery as the nontrivial extraction of implicit, previously unknown, and potentially useful information, a concept defined by Frawley, Piatetsky‐Shapiro, and Matheus [8]. and potentially useful information, a concept defined by Frawley, Piatetsky-Shapiro, and Matheus [8]. Returning to the process of knowledge discovery defined by these authors but considering that it is Returning to the process of knowledge discovery defined by these authors but considering that it is conducted through logical reasoning using inference techniques, we have the following: conducted through logical reasoning using inference techniques, we have the following: Given a set of facts F (knowledge base composed of triplets obtained from the harvesting and Given a set of facts F (knowledge base composed of triplets obtained from the harvesting and integration), a language L (OWL), and some degree of certainty C (considered “true” for the set of triplets integration), a language L (OWL), and some degree of certainty C (considered “true” for the set of inferred considering the previously defined user profile), a pattern is defined as a declaration S in L that triplets inferred considering the previously defined user profile), a pattern is defined as a declaration describes relationships between a subset FS of F with a certainty c such that S is simpler (in some sense) S in L that describes relationships between a subset FS of F with a certainty c such that S is simpler than the enumeration of all the facts in FS. A pattern that is interesting (as it derives from facts) and (in some sense) than the enumeration of all the facts in FS . A pattern that is interesting (as it derives sufficiently certain (because information is derived considering the user profile) is called knowledge. from facts) and sufficiently certain (because information is derived considering the user profile) is Given the defined process of knowledge discovery, an application for performing it was designed. called knowledge. Its architecture and the interaction between the components are shown in the following figure. Given the defined process of knowledge discovery, an application for performing it was designed. It is important to remember that the required input information corresponds to the search terms Its architecture and the interaction between the components are shown in the following figure. of a user and to the context expressed in a profile, as explained in the previous section. Said input flows through a rule‐based reasoner that processes a query on a database of triplets using the API of ontologies from the Jena Framework of the Apache Software Foundation (Figure 10).


12 of 22

It is important to remember that the required input information corresponds to the search terms of a user and to the context expressed in a profile, as explained in the previous section. Said input flows through a rule-based reasoner that processes a query on a database of triplets using the API of ontologies from the Jena Framework of the Apache Software Foundation (Figure 10). Information 2018, 9, x FOR PEER REVIEW 12 of 22

Figure 10. Architecture of the OntoOAI inference engine. Figure 10. Architecture of the OntoOAI inference engine.

The Jena inference subsystem is designed to permit a range of inference engines or reasoners to The Jena inference subsystem is designed to permit a range of inference engines or reasoners to be connected with Jena. Such engines are used to derive additional RDF declarations derived from be Jena. with Such optional engines are used to derive additional declarations derived from an an connected RDF base, with together ontological information and RDF the axioms and rules associated RDF base, together with optional ontological information and the axioms and rules associated with the with the reasoner [18]. The Jena engine infers logical consequences from a set of explicitly determined reasoner [18]. The Jena engine infers logical consequences from a set of explicitly determined facts or facts or axioms and typically provides automatic support for reasoning tasks such as classification, axioms and provides support reasoningRDF tasksdeclarations such as classification, debugging, typically and query [19]. It automatic is used to derive for additional from the debugging, RDF base, and query [19]. It is used to derive additional RDF declarations from the RDF base, together with together with optional ontological information and the axioms and rules from the reasoner. optional ontological information and the axioms and rules from the reasoner. For the purpose of experimenting with this development, the included OWL reasoner was used, For the purpose of experimenting with this development, the included OWL reasoner was which is an implementation based on the rules of OWL/Lite. The query process occurs in real‐time used, which is an implementation based onprocessing, the rules ofgiving OWL/Lite. The query occurs in on data previously collected from batch as a result the process set of inferences real-time on data previously collected from batch processing, giving as a result the set of inferences corresponding to information resources prepared as an input for applications that are responsible for corresponding to information resources prepared as an input for applications that are responsible for end‐user visualization. end-user visualization. The inference subsystem is designed to derive a set of declarations based on the database of facts The subsystem designed process, to derivethe a set of declarations based onthe theuser, database from the inference OAI‐PMH resource isharvesting context information from and of facts from the OAI-PMH resource harvesting process, the context information from the user, ontological information. and ontological information. The inference engine is based on the assumption that if a resource of information is interesting The inference engine is based on the assumption that if a resource of information is interesting for for the user then the resource’s author is also relevant, so it will be of the user’s interest too other the user then the resource’s author is also relevant, so it will be of the user’s interest too other works works whose author, subject, or title correspond to the mentioned author. whose author, subject, or title correspond to the mentioned author. This way of linking resources was selected to run the inference engine due to the fact that this This way of linking resources was selected to run the inference engine due to the fact that this information is included in the Dublin Core specification. information is included in the Dublin Core specification. Additional information as references or affiliations are not considered in the Dublin Core format Additional information as references orwere, affiliations not considered Dublin Core format of the OAI‐PMH data providers. If they then are this model could in be the expanded to include of the OAI-PMH data providers. If they were, then this model could be expanded to include inferences inferences based on more data. basedThe inference process, which follows a deductive reasoning modus ponens, or forward chaining, on more data. The inference process, which follows a deductive reasoning modus ponens, or forward chaining, is based on the following syllogism: is based on the following syllogism: Given the functions Given the functions f (a) = Ra, which defines the authorship or authorial contribution of a resource r written by an f (a) = Ra, which defines the authorship or authorial contribution of a resource r written by an author a, author a, which defines the participation of an author a in a resource t, where participation f (a) = Ta, fmeans that the author is the subject, title, or resource author, (a) = Ta, which defines the participation of an author a in a resource t, where participation means that theTb, author is the subject, or resource author, f (b) = which defines the title, participation of an author b in a resource t, where participation means that the author is the subject, title or resource author, where: R = {r|r ∊ R}, defined by the information resources contained in the knowledge base,


13 of 22


13 of 22 f (b) = Tb, which defines the participation of an author b in a resource t, where participation 13 of 22 means that the author is the subject, title or resource author, T = {t|t ∊ T}, defined by the information resources contained in the knowledge base that were where: Information 9, xT}, PEERgroup REVIEW 13 were of 22 T = 2018, {t|tby ∊ defined by the information resources contained in the knowledge base that written aFOR certain of authors, Information 2018, 9, x FOR PEER REVIEW

written a certain authors,or collaborators definedgroup by theofinformation resources contained in group the knowledge base, resources {r|r by BR =={b|b ∊ R}, B}, authors of a given of information B = {b|b ∊in B},the defined by thebase, authors or collaborators of a given group of information resources contained knowledge written by ainbase, certain group of authors, knowledge contained the knowledge base, We have {t|t ∊∊that: T}, defined defined by bythe theauthors information resources contained contained ingroup the knowledge knowledge baseresources that were were {t|t T}, information resources the base that BTT==={b|b B}, or collaborators of a givenin of information We have that: written byinaathe certain groupof of authors, written by certain group authors, If (the resource rknowledge is interesting for the user) contained base, B {b|b B}, defined authors collaborators of aa given B ∊(the authors or collaborators ofuser). given group group of of information information resources resources Then author aby of the resource r isor interesting to the If ==(the resource r is interesting for the user) We have that: contained in the knowledge base, Ifcontained (theThen author a author ofknowledge resource r is interesting to the user) (the a of resource r is interesting to the user). in the base, Then (other Ta r whose (the author arof resource isfor interesting the user)or title correspond to the author a are IfIf(the resource isworks interesting theauthor, user) to subject, We Wehave havethat: that: interesting to the Then(the (other works Ta whose subject, title correspond to the author a are Then author auser). of resource r isauthor, interesting to theoruser). If (the resource r is interesting for the user) IfIf(the (works Ta whose author, or title correspond interesting the user).rsubject, If author a ofto resource is interesting to the user) to the author a are interesting to the user) (the resource r is interesting for the user) Then(the (the author of resource istitle interesting to the the user). Then set of authors or collaborators B of resources Taauthor is interesting the author user). If (works Ta whose author, subject, correspond the a are interesting to thea user) Then (other works Taresource whose author, subject, ortotitle correspond totothe are Then (the author aa of rror is interesting to user). (the author aaset of resource rr is to the IfIf set of authors or collaborators B of Ta is interesting to the user) Then (the of authors or collaborators B ofuser) resources Ta is interesting to the user). interesting to user). If(the (the author ofthe resource is interesting interesting toresources the user) Then (other whose or title correspond to the author a are Then (the setworks of resources Tb author, written by authors Bto are interesting tothe the user) (the setTa of authors or Ta collaborators B ofsubject, Ta resources is interesting user) IfIf(works whose author, subject, orauthor, title correspond the author a to are interesting tointeresting the user) Then (other works Ta whose subject, or title correspond to the author a are to the user). Then (the set resources Tb written byBauthors B are interesting to thetouser) Then (the set authors of resources Ta is interesting the user). interesting toofof the user). or collaborators The set of resources contained in sets Ta and Tb represent the knowledge discovered according (works Ta whose subject, to the aa are to IfIf set of or collaborators B title of Tacorrespond resources is theinteresting user) If(the (works Taauthors whose author, author, subject,or or title correspond tointeresting the author author to are interesting to the the user) user) to theThe search context: contained in sets Ta and Tb represent the knowledge discovered according set and of resources Then (the set of authors or collaborators B of resources Ta is interesting to the user). Then Tbcollaborators written by authors B are interesting to the user) Then(the (theset setofofresources authors or B of resources Ta is interesting to the user). to the search and context: If authors BB of Knowledge =collaborators Ta ∪ Tb If (the (the set set of ofdiscovered authors or or collaborators of Ta Taresources resourcesisisinteresting interestingtotothe theuser) user) The set of resources contained in sets Ta and Tb represent the knowledge discovered according Then (the set of resources Tb written by authors B are interesting to the |Ta ∪Then Tb| =(the Total resources discovered for the user Knowledge discovered = Ta ∪ Tb set of resources written by authors B are interesting to the user) user) to the search and context: |Ta ∪ Tb| = Total resources discovered for the user The set contained ininsets represent knowledge discovered according to The knowledge forms theTaoutput the user,the in a knowledge table format with metadata and Thediscovered setof ofresources resources contained sets Taand andTbfor Tb represent the discovered according Knowledge discovered = Ta ∪ Tb the search and context: links thediscovered full textcontext: ofknowledge each resource obtained. This for output can also visualised other ways, such The forms the output the user, in abetable formatinwith metadata and to theto search and |Ta ∪ Tb| = Total resources discovered for the user as network output andresource hierarchical visualisation, where the relationships between links to the type full text of each obtained. This output canone alsocan be see visualised in other ways, such Knowledge Knowledge discovered discovered== Ta Ta ∪∪ Tb Tb resources, thereby enriching user experience. as network type output andthe hierarchical where seeformat the relationships between The discovered knowledge forms thevisualisation, output for the user, one in acan table with metadata and |Ta Tb| ==Total Total resources resources discovered discovered for for the the user user |Ta ∪ ∪ Tb| resources, the user experience. links to the thereby full text enriching of each resource obtained. This output can also be visualised in other ways, such The knowledge forms the output user, in aa table with and 2.9.network Batch Processing and Real-Time Processing as type output and hierarchical visualisation, where one see format the relationships between The discovered discovered knowledge forms the output for for the the user, incan table format with metadata metadata and links to the full text of each resource obtained. This output can also be visualised in other ways, such 2.9. Batch Processing and Real-Time Processing resources, thereby enriching the user experience. links to the full text of each resource obtained. This output can also be visualised in other ways, such The nature of the operation of OAI-PMH requires collecting data repositories in batch processes as type output and visualisation, where can see between as network network typeof output and hierarchical hierarchical visualisation, where one onedata canrepositories see the the relationships relationships between for theThe following reasons: nature the operation of OAI-PMH requires collecting in batch processes resources, thereby enriching the user experience. 2.9. Batch Processing and Real-Time Processing resources, thereby enriching the user experience. for the following reasons: (a) Metadata harvest time depends on the response times of individual repositories. Additionally, The nature of the operation ofProcessing OAI-PMH requires collecting data repositories in batch processes 2.9. Processing and Real-Time the total metadata collection time to times the operation moderepositories. of the harvester, either (a) Batch Metadata harvest depends on is thesubject response of individual Additionally, 2.9. Batch Processing andtime Real-Time Processing for the following reasons: sequential or parallel, for all repositories that aretotocollecting be harvested. mode, iteither will the total metadata collection time is subject the operation Thus, mode for of the thefirst harvester, The The nature nature of of the the operation operation of of OAI-PMH OAI-PMH requires requires collecting data data repositories repositories in in batch batch processes processes be the sum total of the response times of all repositories, and for the parallel mode, it depends sequential or parallel, for all repositories that are to be harvested. Thus, for the first mode, it will (a) Metadata harvest time depends on the response times of individual repositories. Additionally, for for the the following following reasons: reasons: on number of of repositories that are simultaneously and theharvester, sloweritresponse be the the sum total the response times of allharvested repositories, and for the parallel mode, depends the total metadata collection time is being subject to the operation mode of the either (a) Metadata harvest time on the response of Additionally, time of number each processing thread. on the of repositories that harvested simultaneously and the slower response or parallel, fordepends all repositories that are totimes be harvested. Thus,repositories. for the first mode, it will (a) sequential Metadata harvest time depends onare thebeing response times of individual individual repositories. Additionally, the total metadata collection time is subject to the operation mode of the harvester, either (b) be The availability of the repositories at the time oftometadata collection processes can itprevent a time of each processing thread.time thetotal sum total of response times all repositories, and for mode the parallel depends the metadata collection is of subject the operation of themode, harvester, either sequential orparallel, parallel, forallallrepositories repositories that to harvested. be harvested. forfirst the first mode, repository from located. Thus, a batch, itare is possible to perform attempts to reconnect toa (b) on The of repositories at the time of metadata collection processes can prevent theavailability number ofbeing repositories that arein being harvested simultaneously and slower response sequential or for that are to be Thus,Thus, for the mode, it will it will be the sum of the Thus, response of allpossible repositories, andparallel for the mode, parallel it the repository without impacting the in end time. repository from being atimes batch, it is tofor perform attempts to reconnect to time of each processing thread. be the sum total oftotal thelocated. response times of all repositories, and the it mode, depends depends on thewithout number of repositories that are harvested simultaneously and prevent the slower thethe repository impacting time. (b) The availability ofrepositories repositories atthe theend time of being metadata collection processes can a on number of that are being harvested simultaneously and the slower response Thus, batch processing and central storage of the results of the collection allow the handling of response time of each processing thread. repository from being located. Thus, in a batch, it is possible to perform attempts to reconnect to time of each processing thread. large Thus, volumes of data and the construction of applications and/or services based on integrated and batch processing and central storage of the results of the collection allow the handling of (b) availability at the without impacting the endtime time.of (b) the Therepository of repositories time of metadata metadata collection collection processes processes can can prevent prevent aa available data. The query is based on a set of parameters enteringbased an inference engineand to large volumes of data andprocess the construction of applications and/or services on integrated repository from from being being located. located. Thus, Thus, in in a batch, it is possible to perform perform attempts attempts to to reconnect reconnect to to return a result, itquery is performed in based real time. begins once is received and concludes available data. The process is on of aItset of parameters entering an inference engineofto Thus, batchand processing and central storage the results of the the input collection allow the handling the repository without impacting the end time. with thea result, sending of itthe output results. Inof this way, this once information engine data return is performed in real time. It begins theservices inputretrieval isbased received and uses concludes large volumes ofand data and the construction applications and/or on integrated and collected theThe harvesting program have been previously enriched, and storeddata in withThus, theby sending of the output results. Inonthis way, this information retrieval engine uses available data. query process is that based aof set ofresults parameters anallow inference engine to Thus, batch processing and central central storage of the results oftransformed, theentering collection allow the handling of batch processing and storage the of the collection the handling of alarge centralized manner. collected by the harvesting program been previously transformed, enriched, and stored in return a result, and it is performed inthat realhave time. It begins once the input is based received concludes large volumes of data and the construction ofapplications applications and/or services based onand integrated and volumes of data and the construction of and/or services on integrated and a centralized manner. with the sending of query the output results. In on thisaaway, information retrieval engine uses data available data. The set of ofthis parameters entering an inference inference engine to available data. process is based set parameters entering an engine to 3. Experimental Results collected by the and harvesting program in that have been previously enriched, andconcludes stored in return result, and performed in real real time. begins once oncetransformed, the input input is is received received and and concludes return aa result, itit isisperformed time. ItIt begins the 3.centralized Experimental Results awith manner. with the sending sending of the the output output results. results. In this way, way, this this information information retrieval retrieval engine engine uses uses data data the of Sources of Information collected by the harvesting program that have been previously transformed, enriched, and stored in of Information 3.aSources Experimental Resultsof experimentation and validation of the proposed model, two repositories centralized manner. For the purposes Information 2018, x FOR PEER by REVIEW of 22 TA=={t|t ∊ 9, T}, defined resources contained in the knowledge base that 13 were {a|a A}, defined bythe theinformation authors or collaborators of information resources contained in the

were located: Redalyc.org, the portal of the Network of Journals of proposed Latin America andtwo the repositories Caribbean, For the purposes of experimentation and validation of the model, Sources of Information 3. Experimental Results the portal of the Network of Journals of Latin America and the Caribbean, were located: Redalyc.org, For the purposes of experimentation and validation of the proposed model, two repositories


14 of 22

collected by the harvesting program that have been previously transformed, enriched, and stored in a centralized manner. 3. Experimental Results Sources of Information For the purposes of experimentation and validation of the proposed model, two repositories were located: Redalyc.org, the portal of the Network of Journals of Latin America and the Caribbean, Spain, and Portugal [20], and the institutional repository of Roskilde University (RUDAR—Roskilde University Digital Archive), Denmark. Redalyc, available at http://www.redalyc.org, is a system of scientific information that currently maintains a collection of more than 500,000 full-text articles from more than one thousand peer-reviewed and open access scientific journals, edited by more than 600 institutions in Latin America. Redalyc also allows reading and downloading items and provides indicators for measuring scientific production for countries, institutions, journals, and authors. Redalyc was selected first because it is one of the largest and most representative sources of scientific content in Latin America and second because the authors of this work have participated in this project since its inception, and thus, the experience and full knowledge of this project is gathered for the implementation of OntoOAI. The RUDAR repository, at http://rudar.ruc.dk/, is the information system that collects, preserves, disseminates, and provides access to the intellectual and academic production of the University of Roskilde, a public university in Denmark founded in 1972. This repository mainly concentrates research products in digital form, including preprints, books, student reports, technical reports, articles, theses, conference proceedings, and images. RUDAR was selected for the implementation of OntoOAI considering that this repository appears in OpenDOAR and therefore is displayed as a site that implements the OAI-PMH protocol. Thus, as RUDAR was selected, any other repository that complies with this minimum condition could be selected. URLBaseRedalyc: http://oai.redalyc.org/redalyc/oai URLBaseRudar: http://diggy.ruc.dk:8080/dspace-oai/request Using the “OAI_PMH_Harvester” tool developed for OntoOAI, metadata collection processes were performed for each of the two repositories. Table 2 shows the results. From Redalyc, 19,027 XML files were obtained, corresponding to 380,540 recovered items, given that each XML file contains 20 articles. In the case of Rudar, 153 files were recovered with 100 resources each, for a total of 15,300, including books, theses, and other types of content. Table 2. Results of the OAI-PMH harvesting process. Repository

Total XML Files Obtained

Total Dublin Core Resources

Harvest Time

Harvest Size

Redalyc RUDAR Total

19,027 154 19,181

380,540 15,400 395,940

45.6 h 1.2 h 46.8 h

762.9 MB 28.7 MB 791.6 MB

One must remember that to raise the data to another semantic level, it is necessary to begin with the conversion of the XML files harvested from RUDAR and Redalyc to RDF. For this purpose, the converter developed for OntoOAI called XML_DC2RDF was used. Subsequently, the data from the authors are enriched to add more information that appears in data sources other than the harvested repositories (Figure 11). This enrichment is performed following the FOAF specification and by obtaining data from DBPedia, particularly the “PersonData” dataset, as described before.


15 of 22

An example of matches is presented below. As noted, the case of the author “Edgar Morin”, who has four items in Redalyc and whose data were enriched with information from DBpedia, corresponds to the data in the author’s Wikipedia biography. Information 2018, 9, x FOR PEER REVIEW 15 of 22

Figure 11. Example of enriched RDF. Figure 11. Example of enriched RDF.

If the participant OAI‐PMH data providers exposed an URI for each author, this phase could be If the participant OAI-PMH data providers exposed an URI for each author, this phase could be significantly improved. Harnessing initiatives such as the international researcher identifier provided significantly improved. Harnessing initiatives such as the and international identifiergiven provided by ORCID is important. It is noteworthy that ORCID Redalyc researcher are interoperable, that by ORCID is important. It is noteworthy that ORCID and Redalyc are interoperable, given that ORCID currently does not display author data for the Semantic Web like Wikipedia does. All ORCID currently does not display author data for the Semantic Web like Wikipedia does. All resources resources harvested by OAI‐PMH from RUDAR and Redalyc were subjected to this process of author harvested by OAI-PMH from RUDAR and Redalyc were subjected to this process of author enrichment, enrichment, with a total runtime of 36 h 34 min. After the enrichment phase, each RDF obtained was with a total runtime of 36 h 34 min. After the enrichment phase, each RDF obtained was subjected to a subjected to a verification process using an application in Jena to ensure compliance with the model verification process using an application in Jena to ensure compliance with the model data described data described by the “DCFOAF_merged.owl” ontology described above. by the “DCFOAF_merged.owl” ontology described above. The final results of this stage of enrichment and verification show that 13.8% of the resources The final results of this stage of enrichment and verification show that 13.8% of the resources harvested from repositories had at least one author matching individuals in “PersonData”, for a total harvested from repositories had at least one author matching individuals in “PersonData”, for a total of 60,927 authors with matches (Table 3). This result shows that there are resources associated with of 60,927 authors with matches (Table 3). This result shows that there are resources associated with more than one author. Some items could not be included in the resulting RDF files because they failed more than one author. Some items could not be included in the resulting RDF files because they failed the verification with respect to ontology, as some resources omitted essential metadata, such as title the verification with respect to ontology, as some resources omitted essential metadata, such as title or or date, or had formation problems. date, or had formation problems. Table 3. Results of the process of author enrichment. Table 3. Results of the process of author enrichment.

Repository Resources Described Resources with Authors in DBPedia Authors Localised Repository Resources Described Resources 52,251 (13.7%) with Authors in DBPedia Authors Localised Redalyc 380,534 57,918 RUDAR 14,885 2550 (17.13%) 3009 Redalyc 380,534 52,251 (13.7%) 57,918 RUDAR 14,885 2550 (17.13%) 3009 Total 395,419 54,801 (13.8%) 60,927 Total

395,419

54,801 (13.8%)

60,927

The loading of triplets to TDB was performed in bulk and took 8.97 h in total. The results showed 7,917,081 triplets stored in the triplet store. This knowledge base uses 522.54 GB of disk space. The The loading of triplets to TDB was performed in bulk and took 8.97 h in total. The results resulting composition of the OntoOAI base is shown in Table 4. showed 7,917,081 triplets stored in the triplet store. This knowledge base uses 522.54 GB of disk space. The resulting composition of the OntoOAI base is shown in Table 4.



16 of 22

Table 4. Results of loading to TDB.

16 of 22

Property Triplets http://xmlns.com/foaf/0.1/Person> 60,354 Table 4. Results of loading to TDB. 394,776 http://purl.org/dc/elements/1.1/source 379,966 Property Triplets http://www.w3.org/1999/02/22‐rdf‐syntax‐ns#type 60,354 http://xmlns.com/foaf/0.1/Person> 60,354 http://purl.org/dc/elements/1.1/date 394,775 394,776 http://purl.org/dc/elements/1.1/source 379,966 http://xmlns.com/foaf/0.1/name 60,354 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 60,354 http://purl.org/dc/elements/1.1/format 385,975 http://purl.org/dc/elements/1.1/date 394,775 http://purl.org/dc/elements/1.1/description 361,577 http://xmlns.com/foaf/0.1/name 60,354 http://purl.org/dc/terms/modified 394,772 http://purl.org/dc/elements/1.1/format 385,975 http://dbpedia.org/ontology/birthDate 60,354 http://purl.org/dc/elements/1.1/description 361,577 http://purl.org/dc/terms/modified 394,772 http://purl.org/dc/elements/1.1/type 405,463 http://dbpedia.org/ontology/birthDate 60,354 http://purl.org/dc/elements/1.1/title 394,776 http://purl.org/dc/elements/1.1/type 405,463 http://purl.org/dc/elements/1.1/publisher 382,412 http://purl.org/dc/elements/1.1/title 394,776 http://purl.org/dc/terms/isPartOf 394,780 http://purl.org/dc/elements/1.1/publisher 382,412 http://purl.org/dc/terms/isPartOf 394,780 http://purl.org/dc/elements/1.1/subject 1,600,540 http://purl.org/dc/elements/1.1/subject 1,600,540 http://purl.org/dc/elements/1.1/rights 379,966 http://purl.org/dc/elements/1.1/rights 379,966 http://xmlns.com/foaf/0.1/surname 60,354 http://xmlns.com/foaf/0.1/surname 60,354 http://xmlns.com/foaf/0.1/givenName 60,354 http://xmlns.com/foaf/0.1/givenName 60,354 http://purl.org/dc/elements/1.1/relation 381,980 http://purl.org/dc/elements/1.1/relation 381,980 http://purl.org/dc/elements/1.1/creator 968,903 http://purl.org/dc/elements/1.1/creator 968,903 http://purl.org/dc/elements/1.1/language 394,650 http://purl.org/dc/elements/1.1/language 394,650 The context awareness of OntoOAI is based on the profile of a user who describes their academic The context awareness of OntoOAI is based on the profile of a user who describes their academic data according to the OntoOAIStudent ontology. To describe the context of users that was used in data according to the OntoOAIStudent ontology. To describe the context of users that was used in the case study, two profiles were defined: the first was identified with “itesm: A01210238” and an the case study, two profiles were defined: the first was identified with “itesm: A01210238” and an identifier consisting of the school initials and student registration number (Figure 12). identifier consisting of the school initials and student registration number (Figure 12).

Figure 12. Profile of user itesm: A01210238. Figure 12. Profile of user itesm: A01210238.

The second profile (Figure 13) corresponds to a student of a German university identified with The second profile (Figure 13) corresponds to a student of a German university identified with “univ: DL9510078”. “univ: DL9510078”.

Information 2018, 9, 144 Information 2018, 9, x FOR PEER REVIEW

17 of 22 17 of 22

Figure 13. Profile of user univ: DL9510078. Figure 13. Profile of user univ: DL9510078.

Once all previous steps were performed and the triplet store was ready for the reasoner, Once all previous steps were performed and the triplet store was ready for the reasoner, previously defined user profiles were used, and a pair of exercises in selective knowledge discovery previously defined user profiles were used, and a pair of exercises in selective knowledge discovery were documented. were documented. The experiment shown in this section was performed considering the user with the profile The experiment shown in this section was performed considering the user with the profile “itesm: “itesm: A01210238”. A search for “complexity” was tested by introducing as such the term by the A01210238”. A search for “complexity” was tested by introducing as such the term by the user, which user, which is where OntoOAI began to be tested. is where OntoOAI began to be tested. First, it is important to remember that the search is context‐sensitive, which means that if a First, it is important to remember that the search is context-sensitive, which means that if a student of computer science or biology performs the same search, the results will be different because student of computer science or biology performs the same search, the results will be different because “complexity” in computing is approached from a different perspective from the sociological one. “complexity” in computing is approached from a different perspective from the sociological one. Now, the process performed to obtain this result will be explained. First is a search based on matches Now, the process performed to obtain this result will be explained. First is a search based on that yields results in a scientific paper called “The Emergence of Sense from Non‐Sense” by the author matches that yields results in a scientific paper called “The Emergence of Sense from Non-Sense” by Edgar Morin, a French philosopher and sociologist (the latter information was obtained from Wikipedia the author Edgar Morin, a French philosopher and sociologist (the latter information was obtained along with other data about the author) published in the journal Convergence in 2007 and whose themes from Wikipedia along with other data about the author) published in the journal Convergence in (described as dc:subject) are “sense of the world”, “non‐sense”, “Sociology”, and “complexity”. 2007 and whose themes (described as dc:subject) are “sense of the world”, “non-sense”, “Sociology”, The diagram shown in Figure 14 shows the reasoning behind the output displayed to the user. and “complexity”. As previously explained, the resource “: redalyc.org: oai 10504408”, referring to the article called “The The diagram shown in Figure 14 shows the reasoning behind the output displayed to the user. Emergence of Sense from Non‐Sense”, is located due to the similarities in the topics it treats, As previously explained, the resource “: redalyc.org: oai 10504408”, referring to the article called consistent with the theme of “complexity”, and the fit with the user program expressed in the profile, “The Emergence of Sense from Non-Sense”, is located due to the similarities in the topics it treats, in this case “sociology”. The reasoning follows the syllogism explained below: consistent with the theme of “complexity”, and the fit with the user program expressed in the profile, If (the resource “oai:redalyc.org:10504408” is interesting to the user) in this case “sociology”. The reasoning follows the syllogism explained below: Then (the author “Edgar Morin” is interesting to the user). If (the resource “oai:redalyc.org:10504408” is interesting to the user) If (the author “Edgar Morin” is interesting to the user) Then (the author “Edgar Morin” is interesting to the user). Then (other works whose author, subject, or title match “Edgar Morin” are interesting to the If (the author “Edgar Morin” is interesting to the user) user). Then (other works whose author, subject, or title match “Edgar Morin” are interesting to the user). If (works Ta are interesting to the user) If (works Ta are interesting to the user) Then (the set of authors or contributors B of resources Ta is interesting to the user). Then (the set of authors or contributors B of resources Ta is interesting to the user). If (the set of authors B of resources Ta is interesting for the user) If (the set of authors B of resources Ta is interesting for the user) Then (the set of resources Tb written by authors B is interesting to the user), Then (the set of resources Tb written by authors B is interesting to the user), where: where: r = “oai:redalyc.org:10504408”, r = “oai:redalyc.org:10504408”, a = “Edgar Morin”, a = “Edgar Morin”, Ta = {“oai:redalyc.org:8601934800”, “oai:redalyc.org:27903809”, “oai:redalyc.org:49615201”, “oai:rudar.ruc.dk:1800/14983”, “oai:rudar.ruc.dk:1800/26454”},


18 of 22

Ta = {“oai:redalyc.org:8601934800”, “oai:redalyc.org:27903809”, “oai:redalyc.org:49615201”, “oai:rudar.ruc.dk:1800/14983”, “oai:rudar.ruc.dk:1800/26454”}, B = {“Michel Olsen”, “Anders Sondergaard Skovbakke”, “Matias Find”, “Mads Christiansen”, “Allan Dreyer Hansen”} Tb = {“oai:rudaruc.dk:1800/12794”, “oai:rudaruc.dk:1800/24877”, “oai:rudaruc.dk:1800/16477”, “oai:rudaruc.dk:1800/25805”, “oai:rudaruc.dk:1800/25386”, “oai:rudaruc.dk:1800/23930”, “oai:rudaruc.dk:1800/15047”, “oai:rudaruc.dk:1800/7734”, “oai:rudaruc.dk:1800/18163”, “oai:rudaruc.dk:1800/18070”, “oai:rudaruc.dk:1800/18892”, “oai:rudaruc.dk:1800/13812”, “oai:rudaruc.dk:1800/24240”, “oai:rudaruc.dk:1800/25873”, “oai:rudaruc.dk:1800/15987”, “oai:rudaruc.dk:1800/4230”, “oai:rudaruc.dk:1800/4231”, “oai:rudaruc.dk:1800/24002”, “oai:rudaruc.dk:1800/6878”, “oai:rudaruc.dk:1800/24581”, “oai:rudaruc.dk:1800/17313”, “oai:rudaruc.dk:1800/3831”, “oai:rudaruc.dk:1800/3830”, “oai:rudaruc.dk:1800/3970”, “oai:rudaruc.dk:1800/6613”, “oai:rudaruc.dk:1800/23967”, “oai:rudaruc.dk:1800/14741”, “oai:rudaruc.dk:1800/23613”, “oai:rudaruc.dk:1800/14129”, “oai:rudaruc.dk:1800/23282”, “oai:rudaruc.dk:1800/17645”, “oai:rudaruc.dk:1800/19221”, “oai:rudaruc.dk:1800/16704”, “oai:rudaruc.dk:1800/23015”} Thus, the set of resources contained in Ta and Tb comprise the output to the user, i.e., the knowledge found according to the search and context. Discovered knowledge = Ta ∪ Tb For example, only resources discovered from the authors of the thesis “oai:rudaruc.dk:1800/26454” were considered, which totalled 34 after removing duplicates because several resources were co-written by an author of that same group. For display purposes, articles by the author Michel Olsen, author of “oai: rudaruc: dk: 1800/14983”, were omitted, equivalent to 25 additional resources. While the exploration of the forms of visualisation of the obtained results is not proposed in this work, it should be noted that given the present associations between resources, it is possible to take advantage of graphs or hierarchical visualizations. Such forms of presenting information could allow the user to explore different levels of the results of the inferences and to discriminate information based on present relationships. An example is shown in Figure 15.


19 of 22 19 of 22

Figure 14. Selective knowledge discovery from the example. Figure 14. Selective knowledge discovery from the example.

While the exploration of the forms of visualisation of the obtained results is not proposed in this work, it should be noted that given the present associations between resources, it is possible to take advantage of graphs or hierarchical visualizations. Such forms of presenting information could allow the user to explore different levels of the results of the inferences and to discriminate information based on present relationships. An example is shown in Figure 15.


20 of 22 20 of 22

Figure 15. Graph‐type visualization of the example results. Figure 15. Graph-type visualization of the example results.

4. Conclusions and Future Work 4. Conclusions and Future Work On the path towards Semantic Web, sites are moving at very different rates. There are sites that On the path towards Semantic Web, sites are moving at very different rates. There are sites that do not incorporate any form of structure into the data that they present, even among the ones that do not incorporate any form of structure into the data that they present, even among the ones that use use the most sophisticated and detailed formats to describe their information. the most sophisticated and detailed formats to describe their information. The protocol for harvesting OAI‐PMH metadata is the result of a concern for making interoperable The protocol for harvesting OAI-PMH metadata is the result of a concern for making interoperable content available under the Open Access initiative. Thousands of repositories, which integrate content available under the Open Access initiative. Thousands of repositories, which integrate intellectual, academic, and scientific production, participate in the Open Access movement and have intellectual, academic, and scientific production, participate in the Open Access movement and have adopted this specification to describe, display, and exchange information. However, the structuring of adopted this specification to describe, display, and exchange information. However, the structuring of the OAI‐PMH alone is not enough to take advantage of the benefits of the Semantic Web. the OAI-PMH alone is not enough to take advantage of the benefits of the Semantic Web. Thanks to the Semantic Web, software can process content, reason with it, combine it, and Thanks to the Semantic Web, software can process content, reason with it, combine it, and perform perform deductions logically to solve problems automatically. As Segaran, Evans, and Taylor [21] deductions logically to solve problems automatically. As Segaran, Evans, and Taylor [21] mention, mention, building models of the world is not something new to system designers, but with semantic building models of the world is not something new to system designers, but with semantic systems, systems, these models must be expressed such that systems distributed throughout the Web can be these models must be expressed such that systems distributed throughout the Web can be read and the read and the models understood and used with precision. models understood and used with precision. The main contribution is to propose a semantic approach to selective knowledge discovery, The main contribution is to propose a semantic approach to selective knowledge discovery, enabling data structured with OAI‐PMH to be processed by semantic technologies, which allows the enabling data structured with OAI-PMH to be processed by semantic technologies, which allows execution of inference mechanisms on the dataset in addition to the use of ontology merging the execution of inference mechanisms on the dataset in addition to the use of ontology merging techniques for the verification of knowledge. techniques for the verification of knowledge. Furthermore, this model suggests several lines of future work. OntoOAI could be extended to Furthermore, this model suggests several lines of future work. OntoOAI could be extended to take advantage of sources of Linked Open Data information as input, in addition to OAI‐PMH take advantage of sources of Linked Open Data information as input, in addition to OAI-PMH content, content, and to apply knowledge discovery procedures, developing mechanisms for the filtering of academic or scientific information.


21 of 22

and to apply knowledge discovery procedures, developing mechanisms for the filtering of academic or scientific information. In addition, the proposal can be enhanced with the use of controlled vocabularies and/or multilingual ontologies to retrieve information in various languages. Other inference engines—such as Pellet, Racer, or FaCT—should also be tested for full OWL DL reasoning. Regarding awareness to context, the type of information stored in the user profile is static, determined during the design time, and cannot be updated or modified at runtime. In the sense that Eyharabide and Amandi (2012) highlight, next-generation user profiles should not just be a passive user data store but also an active component that can learn and update both the information and the type of information stored. Finally, it is also considered as a future work to add more metadata to the inference engine in order to expand the model information retrieval capabilities that, in this moment, is based just on co-authorship. Author Contributions: A.B.-G. designed the methodology, developed the software components, performed the experiment and wrote the article. E.A.-L. conceived the database, analyzed the data and interpreted the results. Funding: This research received no external funding. Acknowledgments: We like to show our gratitude to all Open Access journals that participate in Redalyc project. Without the invaluable information they publish this research would not have been possible. Conflicts of Interest: The authors declare no conflict of interest.

References 1. 2. 3.

4. 5.

6.

7.

8. 9. 10.

11.

Keßler, C.; d’Aquin, M.; Dietze, S. Linked Data for Science and Education. Semant. Web J. 2013, 4, 1–2. Allemang, D.; Hendler, J. Semantic Web for the Working Ontologist, 3rd ed.; Morgan Kaufmamm: Waltham, MA, USA, 2011; ISBN 978-0123859655. Chan, L.; Cuplinskas, D.; Eisen, M.; Friend, F.; Genova, Y.; Guédon, J.-C.; Hagemann, M.; Harnad, S.; Johnson, R.; Kupryte, R.; et al. Budapest Open Access Initiative. 2002. Available online: http://www.soros. org/openaccess/read.shtml (accessed on 8 June 2018). Max Planck Society. Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities. 2003. Available online: http://openaccess.mpg.de/Berlin-Declaration (accessed on 8 June 2018). Brown, P.; Cabell, D.; Chakravarti, A.; Cohen, B.; Delamothe, T.; Eisen, M.; Grivell, L.; Guédon, J.-C.; Hawley, R.S.; Johnson, R.K.; et al. Declaración de Bethesda sobre Publicación de Acceso Abierto. 2003. Available online: http://ictlogy.net/articles/bethesda_es.html (accessed on 8 June 2018). Albert, P.; Holmes, K.; Börner, K.; Conlon, M. Research Discovery through Linked Open Data. In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, Washington, DC, USA, 10–14 June 2012; ACM: New York, NY, USA, 2012; pp. 429–430. Wiederhold, G. Foreword: On the barriers and future of knowledge discovery. In Advances in Knowledge Discovery and Data Mining; Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Eds.; AAAI Press: Menlo Park, CA, USA, 1996. Frawley, W.J.; Piatetsky-Shapiro, G.; Matheus, C.J. Knowledge Discovery in Databases: An Overview. AI Mag. 1992, 13, 57–70. IBM. Knowledge Discovery and Data Mining. Available online: http://researcher.watson.ibm.com/researcher/ view_group.php?id=144 (accessed on 8 June 2018). Becerril Garcia, A.; Lozano Espinosa, R.; Molina Espinosa, J.M. Semantic Approach to Context-Aware Resource Discovery over Scholarly Content Structured with OAI-PMH. Comput. Sist. 2016, 20, 127–142. [CrossRef] Becerril Garcia, A.; Lozano Espinosa, R.; Molina Espinosa, J. Modelo para consultas semánticas sensibles al contexto sobre recursos educativos estructurados con OAI-PMH. In Proceedings of the Encuentro Nacional de Ciencias de la Computación, ENC 2014, Oaxaca, Mexico, 3–5 November 2014; pp. 1–15.


12.

13.

14.

15. 16. 17.

18. 19. 20.

21.

22 of 22

Ministerio para la Ciencia e Innovación de España. Informe Modelos de Metadatos para Contenidos Multimedia. Available online: http://omediadis.udl.cat/html/deliverables/215-Modelos_Metadatos_ Contenidos_Multimedia/ (accessed on 10 November 2017). Haslhofer, B.; Schandl, B. The OAI2LOD Server: Exposing OAI-PMH Metadata as Linked Data. In Proceedings of the International Workshop on Linked Data on the Web (LDOW2008), Beijing, China, 22 April 2008. Bizer, C.; Cyganiak, R. D2R Server—Publishing Relational Databases on the Semantic Web. 2006. Available online: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer-Cyganiak-D2R-Server-ISWC2006. pdf (accessed on 8 June 2018). SIMILE. OAI2RDF. 2006. Available online: http://simile.mit.edu/repository/RDFizers/oai2rdf/ (accessed on 8 June 2018). Ameen, A.; Rahman Khan, K.; Rani, B. Semi-Automatic Merging of Ontologies using Protégé. Int. J. Comput. Appl. 2014, 85, 35–42. Panagiotopoulos, I.; Kalou, A.; Pierrakeas, C.; Kameas, A. An Ontology-Based Model for Student Representation in Intelligent Tutoring Systems for Distance Learning. In Artificial Intelligence Applications and Innovations; Lazaros Iliadis, I.M., Ed.; Springer: Halkidiki, Greece, 2012. Apache Software Foundation. Available online: http://jena.apache.org/about_jena/about.html (accessed on 8 June 2018). Dentler, K.; Cornet, R.; Teije, A.; de Keizer, N. Comparison of Reasoners for large Ontologies in the OWL 2 EL Profile. Semant. Web 2011, 2, 71–78. Becerril-García, A.; Aguado-López, E.; Rogel-Salazar, R.; Garduño-Oropeza, G.; Zúñiga-Roca, M. De un modelo centrado en la revista a un modelo centrado en entidades: La publicación y producción científica en la nueva plataforma Redalyc.org. Aula Abierta 2012, 40, 53–64. (In Spanish) Segaran, T.; Evans, C.; Taylor, J. Programming the Semantic Web, 1st ed.; O’Reilly: Sebastopol, CA, USA, 2009; p. 302. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

A Semantic Model for Selective Knowledge Discovery over OAI ... - MDPI

A Semantic Model for Selective Knowledge Discovery over OAI ... - MDPI

Suggest Documents

Knowledge Discovery for Semantic Web - Semantic Scholar

Knowledge Discovery over the Deep Web

SDA: Towards a novel knowledge discovery model for ...

Knowledge Semantic Representation A Generative Model for ...

A Selective Pharmacophore Model for 2-Adrenoceptor Agonists - MDPI

Actionable Knowledge Discovery for Threats ... - Semantic Scholar

Knowledge Discovery for Semantic Web - osm.cs.byu.edu

Knowledge Discovery for Flexible Querying - Semantic Scholar

A Novel Discretizer for Knowledge Discovery ... - Semantic Scholar

a semantic based approach for knowledge discovery and acquisition

A Virtual Mart for Knowledge Discovery in Databases - Semantic Scholar

Semantic Knowledge Discovery from ... - Semantic Scholar

Knowledge Discovery in Database: A knowledge management ...

Terrorism Knowledge Discovery Project: A Knowledge ... - CiteSeerX

linear regression model for knowledge discovery in ...

A Knowledge Discovery Model of Identifying Musical ... - UNC Charlotte

Assisted Knowledge Discovery for the

Title: A Graph-Based Approach for Semantic Process Model Discovery ...

Discovery of Novel Small-Molecule Compounds with Selective ... - MDPI

Discovery of CNS-Like D3R-Selective Antagonists Using 3D ... - MDPI

A Nonhuman Primate Model for the Selective

Knowledge-Guided Bioinformatics Model for ... - Semantic Scholar

Over-selective responding

Selective Harvesting over Networks