International Journal of Information Management 30 (2010) 559–566
Contents lists available at ScienceDirect
International Journal of Information Management journal homepage: www.elsevier.com/locate/ijinfomgt
Ontology management and evolution for business intelligence Alexander Mikroyannidis ∗ , Babis Theodoulidis Manchester Business School, University of Manchester, Manchester M13 9SS, UK
a r t i c l e
i n f o
Keywords: Information management Business intelligence Ontology Ontology management Ontology evolution
a b s t r a c t The amount of heterogeneous data that is available to organizations nowadays has made information management a seriously complicated task, yet crucial since this data can be a valuable asset for business intelligence. Ontologies can act as a semantically rich knowledge base in systems that specialize in information management. The present work investigates the potential of ontologies in supporting the information lifecycle within a corporate environment for business intelligence. The paper demonstrates the use of Heraclitus II, a framework that employs ontology management and evolution in the context of information management systems. The capabilities of the framework in facilitating information management and business intelligence are evaluated through a real-life case study from the life sciences industry. © 2009 Elsevier Ltd. All rights reserved.
1. Introduction The rate of growth in the amount of information available nowadays within a corporate environment poses major difficulties as well as challenges in decision making. Business intelligence (BI) consists of a collection of techniques and tools, aiming at providing businesses with the necessary support for decision making. Examples of simple BI services that already exist are various search and filtering services, as well as various content providers and aggregators that deliver semi-custom information bundles to particular users. On a more sophisticated level, information management (IM) can assist a manager in monitoring specific organizations, technologies, or areas of research, as well as being able to analyze primary data in order to draw conclusions at the level of the company’s competition, sector or industry. Ontologies are a key enabling technology for IM, as they offer information a common representation and semantics. They constitute “a shared and common understanding of a domain that can be communicated between people and application systems” (Davies, Fensel, & Harmelen, 2003). An ontology comprises a formal description of a certain domain, by defining the ontology objects (or entities) that characterise the domain, namely concepts (or classes), as well as their instances and relations. Ontologies express information
∗ Corresponding author. Current address: Knowledge Media Institute, The Open University, Milton Keynes MK7 6AA, UK. E-mail addresses:
[email protected] (A. Mikroyannidis),
[email protected] (B. Theodoulidis). 0268-4012/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.ijinfomgt.2009.10.002
in a machine-processable form, thus allowing for its efficient manipulation by software agents. They are commonly represented with the use of XML-based languages, such as RDFS (www.w3.org/TR/rdf-schema) and OWL (www.w3.org/TR/owlfeatures). Ontologies provide to IM systems a semantically rich knowledge base for interpretation of unstructured content. Based on the semantics encoded within ontologies, information can be extracted from natural language texts and, on a further level of processing, knowledge can be discovered that will assist BI. Nevertheless, the way ontologies are usually managed within IM systems is unsophisticated and disregard important factors. Ontology layering or integration is rarely used and the dynamic aspect of ontologies, which requires appropriate evolution mechanisms, is often neglected. Overall, the potential of ontologies in IM and BI has yet to be fully realized and put to practical use. The Heraclitus II framework (Mikroyannidis & Theodoulidis, 2006) considers ontologies as a semantically rich knowledge base for information management and proposes a methodology for the management and evolution of this knowledge base. In this paper, we demonstrate the application of Heraclitus II on a real-life case study related to BI in the life sciences industry. We evaluate the framework by examining the challenges posed by the case study and how these are met by Heraclitus II. The remainder of this paper is organized as follows: Section 2 examines existing approaches in IM, as well as ontology management and evolution. Section 3 provides a brief overview of the Heraclitus II framework. In Section 4, the application of the framework on the case study is presented and in Section 5 the framework is evaluated. Finally, the paper is concluded and some plans for future work are provided.
560
A. Mikroyannidis, B. Theodoulidis / International Journal of Information Management 30 (2010) 559–566
2. Related work Information management (IM) can be defined as “the process of managing information as a strategic resource for improving organizational performance” (Chaffey & Wood, 2005). Modern IM systems have to deal with unstructured data of considerable volume, combining methodologies from numerous disciplines, such as Information Extraction, Natural Language Processing, Information Retrieval, and Data Mining. GATE (general architecture for text engineering) (Cunningham, Maynard, Bontcheva, & Tablan, 2002) is an architecture for language engineering, containing a suite of tools for language processing and information extraction. The GATE data flow consists of a pipeline of Processing Resources (lemmatisers, part-of-speech taggers, etc.) that run in series and utilize Language Resources (ontologies, lexicons, corpora, etc.). GATE provides a common object-oriented model of ontologies and a unified API for their use by Processing Resources (Bontcheva, Tablan, Maynard, & Cunningham, 2004). Ontologies of different formats, such as RDF(S) and OWL, need to be converted into this model so that they can be used within the GATE pipeline. Ellogon (Petasis, Karkaletsis, Paliouras, Androutsopoulos, & Spyropoulos, 2002) is a multi-lingual, cross-platform, generalpurpose text engineering environment. Ellogon is based on a modular architecture that allows reuse of its sub-systems in order to facilitate the creation of applications targeting specific linguistic needs. APIs for creating processing components are provided for a wide range of programming languages, including C, C++, Java, Tcl, Perl and Python. Although ontologies are not part of the standard Ellogon architecture, they can be added through external components that use the Ellogon APIs. The unstructured information management architecture (UIMA) (Ferrucci & Lally, 2004) is a software architecture for developing and deploying unstructured information management (UIM) applications. A UIM application can be generally characterized as a software system that analyzes large volumes of unstructured information in order to discover, organize, and deliver relevant knowledge to the end user. UIM applications utilize a variety of technologies including statistical and rule-based natural language processing, information retrieval, machine learning, ontologies, and automated reasoning. Whereas the aforementioned systems can be mostly regarded as generic architectures for the deployment of IM components, the Parmenides framework (Mikroyannidis, Theodoulidis, & Persidis, 2006) aims in providing managers with specialized tools that will aid decision making, via a customized analysis of the desired market. Ontologies play a crucial role in the Parmenides architecture. The knowledge base used for document processing and knowledge discovery is a collection of ontologies. In particular, these ontologies are utilized by the information extraction engine for semantic annotation of documents, as well as by knowledge discovery tools for data pre-processing tasks. The approaches adopted in most IM systems regarding ontology management are quite simplistic. Even when multiple ontologies are used, these are not integrated but each is managed separately. In addition, the evolution of ontologies for alignment with changing business requirements is usually overlooked. Some of these limitations are addressed by systems specializing in ontology management and evolution, such as Sesame, OMM, OntoView, PROMPT, and KAON. Sesame (Broekstra, Kampman, & Harmelen, 2002) is a generic architecture for the storage and querying of RDF and RDFS ontologies. Sesame can be coupled with a variety of repositories for the storage of ontologies, including relational databases, RDF triple stores, or remote storage services on the web. RQL, a declarative query language, is used for querying RDF data at a semantic
level. The ontology middleware module (OMM) (Kiryakov, Simov, & Ognyanov, 2002) has been built on top of Sesame to handle version control and change management. Sesame and OMM have been integrated to formulate the ontology management system of the On-To-Knowledge project (www.ontoknowledge.org). OntoView (Klein, 2004) is a web-based environment for ontology version control. It was initially designed to compare RDF ontologies, but was later extended to include other representations as well. OntoView maintains not only the transformations between different versions of ontologies, but also the conceptual relations between concepts in these versions, thus making them interoperable. PROMPT (Noy & Musen, 2003) is a suite of tools for ontology versioning and integration that have been implemented as extensions for the Protégé ontology editor (http://protege.stanford.edu). These include iPROMPT, an interactive ontology-merging tool, AnchorPROMPT, a graph-based tool for finding similarities between ontologies, PROMPTDiff, a tool for finding differences between two versions of the same ontology, and PROMPTFactor, a tool for extracting parts of an ontology. These tools share user interface components, heuristics, data structures, and provide information for one another. The Protégé ontology editor provides overall capabilities for managing multiple ontologies. The KArlsruhe ONtology and Semantic Web tool suite (KAON) (Haase et al., 2004) provides a number of open source software modules for the engineering, management, and evolution of ontologies. KAON modules include Description Logic Programs that are responsible for ontology reasoning, the KAON Server, which provides a generic infrastructure to facilitate engineering of ontology-based applications, and KAONtoEdit which enables OntoEdit (www.ontoknowledge.org/tools/ontoedit.shtml) to use the KAON API in order to load, modify and store KAON ontology models. The survey on existing ontology management and evolution methodologies has revealed certain shortcomings. First of all, ontology integration suffers significantly and layering is rarely employed. In addition, there is a lack of an ontology model that will capture and enhance the temporal information contained in ontologies. Current approaches in ontology modelling do not offer sufficient temporal semantics for the evolution history of ontologies. Finally, key issues for ontology evolution, such as consistency preservation and change propagation, are often neglected. The Heraclitus II framework attempts to address these shortcomings in the context of IM and BI.
3. The Heraclitus II framework Fig. 1 shows the ontology layering architecture adopted in Heraclitus II. The lower layers represent more generic and all-purpose ontologies, while the upper layers are customized for certain uses within an IM system. When traversing the layers from bottom to top, each layer reuses and extends the previous ones. In addition, whenever a layer extends the ones below it (e.g. with the insertion of new concepts), these extensions are propagated to the lower layers. Each layer is maintained by a different group of ontology authors, depending on the expertise that each layer requires. The integration of the ontology pyramid layers is achieved with the use of ontology mapping between ontologies belonging to the same layer (intra-layer), or different ones (inter-layer). The Lexical Ontology layer contains domain-independent ontologies of a purely lexicographical nature. This layer handles lexicographical issues, such as multilingualism. An example of such an ontology is the widely adopted WordNet (Fellbaum, 1998). Modelling of a certain domain is the main characteristic of the Domain Ontology layer. The ontologies of the gene ontology (GO) project
A. Mikroyannidis, B. Theodoulidis / International Journal of Information Management 30 (2010) 559–566
561
Fig. 1. The Heraclitus II ontology pyramid.
(The Gene Ontology Consortium, 2000), as well as the foundational model of anatomy (FMA) (Cornelius Rosse, 2003) are examples of domain ontologies in bioinformatics. A vital aspect of an IM system is the sources it collects unstructured or structured data from. Depending on the domain, these sources can be news portals, corporate databases, scientific publications, etc. The Data Source Ontology layer specializes in the organization of information in these data sources. A common ontology of this type is the web site ontology of the Heraclitus framework (Mikroyannidis & Theodoulidis, 2007). Finally, on top of the pyramid lies the Application Ontology layer, containing software development ontologies that represent the software organization of an IM system. This type of ontologies allows for the interconnection between software structures and ontological data in order to facilitate the process of ontology-driven software development (Knublauch, 2004). Ontology evolution in Heraclitus II is bitemporal, taking place over valid and transaction time. The valid time of a fact is defined as the time when that fact is true in the modelled reality. The transaction time of a fact is defined as the time when that fact is current in the knowledge base and may be retrieved. Bitemporal evolution allows for retro-active as well as pro-active changes to be captured and represented on the knowledge base. A retro-active change occurs when a fact that is entered at a certain transaction time in the knowledge base, has been valid in the real world before this transaction time. On the other hand, when the valid time of a fact is greater than its transaction time, then a pro-active change is captured in the knowledge base. The Heraclitus II temporal semantics employed for ontology evolution are based on the TAU Temporal Information Management framework (Kakoudakis & Theodoulidis, 2001; Theodoulidis et al., 1998). Consistency preservation is another goal of ontology evolution in Heraclitus II. This is performed in two levels: structurally and semantically, resolving inconsistencies that arise when the structure or semantics of an ontology become invalid because of a change. In addition, a change taking place in a particular ontology is propagated internally (inside the changed ontology), as well as externally (in depending ontologies via intra- and inter-layer mappings), so that simultaneous evolution of all ontology layers is achieved.
4. Framework application During the application of the Heraclitus II framework, the following steps were followed. Firstly, each ontology layer was constructed according to the requirements of the case study. Following ontology construction, appropriate links between the layers were created in order to define the integration of the different layers and complete the construction of the Heraclitus II ontology pyramid. Finally, a number of ontology evolution scenarios were implemented based on the case study requirements.
For the purposes of building and maintaining the case study ontologies, the Protégé ontology editor was used as our principal software system, on top of which we deployed plugins to extend its functionality. In particular, the Protégé-OWL plugin (http://protege.stanford.edu/plugins/owl) was used to add support for the development of ontologies in OWL. The OntoLing plugin (http://ai-nlp.info.uniroma2.it/software/OntoLing) was employed for the automatic enrichment of the case study ontologies with linguistic data. In order to visualise various parts of the ontology pyramid, we used the OntoViz plugin (http://protegewiki.stanford.edu/index.php/OntoViz). 4.1. Case study description Biovista (www.biovista.com) is an SME established in 1996. It is based in Charlottesville, USA and Athens, Greece. Its area of specialisation is in BI products and services for the life science industry. These products and services are targeted primarily at managers in charge of business development and investment or strategic level issues who wish to make decisions on industry and company developments or simply to monitor these on an ongoing basis. Biovista’s interest lies in business news, which are mainly provided by web sources, such as Business Wire (www.businesswire.com). These news items contain important facts, such as who is the CEO of a particular company, as well as more complex events, such as collaborations between companies or new product releases. Biovista want to identify such facts and events and store them in a knowledge base for further reasoning with them in order to analyze and monitor industry developments. The services offered by Biovista focus on assessing the quality of management and collaborations of a company, reviewing its potential clients and discovering trends in the stock market. In particular, answers to the following questions are provided: • • • • • •
Is company X cheap? What is its position in the stock market? What is the collaboration profile of company X? What is the position of company X regarding its product pipeline? Who are (potential) clients of company X? How mature is the product line of company X? What is the quality of the management of company X?
4.2. Ontology pyramid construction The applications used in the Biovista case study are mainly associated with information extraction tasks from web data. Therefore, the application ontology that was constructed for this case study represents the organization of the software components of a typical information extraction application. It is based on the taxonomy of information extraction software components described
562
A. Mikroyannidis, B. Theodoulidis / International Journal of Information Management 30 (2010) 559–566
in Cunningham (2000). It is broad enough, so that it can be used not only in the particular case study, but also in any case study that requires this type of data processing. The data source ontology was based on the organization of articles of the BioSpace news portal (www.biospace.com), a leading provider of web-based resources and information to the life science industry. BioSpace acts as an aggregator by gathering news items from various sources, such as Business Wire (www.businesswire.com) and Market Wire (www.marketwire.com). These web portals serve Biovista as the primary sources for the extraction of information about the life sciences industry. The domain ontology for the Biovista case study includes concepts, relations and instances from the life sciences industry. The business events that Biovista is interested in are represented as concepts in the domain ontology and are the following: • Personnel events: The instances of this concept are events referring to the personnel of life sciences organizations, such as appointments, promotions, staff increases/reductions, etc. • Agreements: This concept regards various types of agreements between organizations of the life sciences industry, such as alliances, license agreements, mergers, acquisitions, etc. • Clinical events: This concept concerns various phases of clinical experiments of products that are being developed. • Product events: This concept mainly regards the launches and sales/purchases of products. The lexical ontology used in this case study was WordNet (Fellbaum, 1998), together with the OntoLing plugin for the automatic linguistic enrichment of the ontology pyramid. During the process of linguistic enrichment, relations were discovered between ontology objects of the pyramid and linguistic structures of WordNet. As a result, synonyms and definitions were added to the concepts of the ontology pyramid. 4.3. Ontology pyramid integration The case study ontologies were mapped to each other in order to be integrated into the Heraclitus II ontology pyramid. Fig. 2 shows an example of inter-layer ontology mappings between the domain and the data source ontologies. The namespace prefixes for the domain and data source ontologies are do and dso, respectively. As shown in Fig. 2, the Clinical Development and Product News classes of the data source ontology have been mapped to classes of the domain ontology. Each story referring to a clinical development or a product announcement is associated with the product that the news story is about. Therefore, the Clinical Development and Product News classes are mapped to the Product class of the domain ontology, via the dso:developing product and dso:product relations respectively. A clinical development story also contains the stage of the product, expressed via the dso:development stage relation between the Clinical Development class and the Product stage class of the domain ontology.
translated into insertions of new instances in the data source ontology. The following example demonstrates how evolution of the knowledge base was performed with regard to product developments specifically. As shown in Fig. 2, the Product News class of the data source ontology is mapped to the Product class of the domain ontology. Consequently, the insertion of Product News instances in the data source ontology had to be propagated in the domain ontology. More specifically, the corresponding instances of the Product class received an update in their do:acquired by, do:bought by, do:developed by, and do:product stage properties. Table 1 shows a series of changes regarding insertion of Product News instances and their impact on an instance of the Product class. In particular, we gathered news items regarding the product line of Misys Healthcare Systems (www.misyshealthcare.com). Misys Healthcare Systems is among the top healthcare IT companies in North America in the area of clinical products and web-based technologies for sharing patient data across distributed medical care settings. A key product of Misys is Misys CPR (Computerized Patient Record), which is represented in the domain ontology by an instance of the Product class, labelled Misys CPR. The first column of Table 1 contains the transaction time of the captured change (instance insertion) and the second column displays the title of the inserted instance. The third column shows the changes that were propagated in the domain ontology and the Misys CPR instance specifically. The last column shows the valid time period for the propagated changes. On 29/07/2003, Misys announces the acquisition of Misys CPR, formerly marketed as Patient1, from Per-Se Technologies Inc. As a result, the Patient1 instance in the domain ontology is renamed to Misys CPR and its acquired by and developed by properties are set to Misys Healthcare Systems. The valid time period for these changes is [29/07/2003, now). On 27/08/2003, Misys CPR is bought by the New York City Health & Hospitals Corporation, causing the addition of a bought by property in the Misys CPR instance. Subsequent insertions of news item instances about Misys CPR sales enrich the Misys CPR instance with more bought by properties. Some retro-active changes are recorded on 19/08/2004, 27/09/2004, 29/10/2004, and 25/05/2006 which refer to actual transaction times. For example, on 19/08/2004, a news item about the University Health Network contains the information that the latter has been using Misys CPR since 1988, when it was still developed by Per-Se Technologies Inc. This causes a new bought by property to be inserted and set to be retro-actively valid from 1988. Similarly, on 27/09/2004 Arnot Ogden Medical Center is mentioned in a news item as a 4-year user of Misys CPR, thus causing the addition of a bought by property with valid time [2000, now). Fig. 3 provides a graphical representation of the evolution of the Misys CPR instance over valid and transaction time. The shaded 3D rectangles represent the lifespan of the properties belonging to the Misys CPR instance. The lifespan of each property starts at the corresponding transaction and valid time signified with the dashed lines on the transaction and valid time axes. Each lifespan does not end at a particular time point in the two time axes, but can be extended until present time. In order to keep the complexity of the diagram to a minimum, we have included only the properties that were mentioned in the text.
4.4. Ontology pyramid evolution During this phase, the Heraclitus II evolution process was performed on the integrated ontology pyramid of the Biovista case study. A typical scenario from the Biovista BI process was used. According to this scenario, news items from various web sources, including Business Wire, are used to enrich Biovista’s knowledge base and allow monitoring of various business developments. In the context of the Heraclitus II framework, these enrichments were
5. Framework evaluation This section discusses the evaluation of Heraclitus II based on the outcomes of its application on the Biovista case study. More specifically, the impact on the case study in terms of ontology construction, collaborative ontology management, as well as ontology evolution and its relation with BI is examined.
A. Mikroyannidis, B. Theodoulidis / International Journal of Information Management 30 (2010) 559–566
563
Fig. 2. Mapping of the Clinical Development and Product News classes of the data source ontology to classes of the domain ontology.
5.1. Ontology construction A fundamental problem of using ontologies in case studies where users with different backgrounds are involved is to estab-
lish an agreed version that is accepted by all users. Each party involved, based on their expertise and requirements, have a different conceptualization of the problem in question. The multi-layered architecture of Heraclitus II addresses this issue by assigning to
Table 1 Change propagation over two time dimensions for the properties of the Misys CPR instance. Transaction time
Data source instance title
Propagated changes
Valid time
29/07/2003
Misys Healthcare Systems Completes Asset Purchase of Patient1® Product Line from Per-Se Technologies, Inc.
[29/07/2003, now)
27/08/2003
Misys Healthcare Systems Signs Agreement with New York City Health and Hospitals Corporation North Bronx Healthcare Network Integrates Speech Recognition with Misys CPRTM System Enterprise-wide Patient Care and Financial Operations POH Medical Center Selects Misys Clinical Information Suite to Improve Enterprise-wide Patient Care and Financial Operations Pascack Valley Hospital Chooses Misys Healthcare Systems to Advance to Next Level of CPR and CPOE Sophistication Physicians at University Health Network Enter Medication Orders Online to Enhance Patient Safety and CPOE Adoption Arnot Health Selects Misys EMR Solution; First Misys Optimum Client to Connect Hospital and Physician Office The Kingdom of Saudi Arabia National Guard Health Affairs Goes Live with Misys CPR Electronic Patient Record System Riverview Hospital Chooses Misys Healthcare Systems to Improve Patient Care and Increase Efficiency Brockville General Hospital Selects Misys Healthcare Systems to Deploy Computerized Clinical Information Network St. John’s Regional Medical Center And Pleasant Valley Hospital Recommit To Misys CPRTM With Plans To Add Computerized Physician Order Entry (CPOE) Capability Misys Healthcare Systems To Deliver Enterprise-Wide Electronic Health Records (EHR) For Daughters Of Charity Health System Misys Healthcare Systems Expands Partnership With Princeton Healthcare System
do:product name = Misys CPR do:acquired by = Misys Healthcare Systems do:developed by = Misys Healthcare Systems do:bought by = New York City Health & Hospitals Corporation do:bought by = North Bronx Healthcare Network
do:bought by = POH Medical Center
[22/10/2003, now)
do:bought by = Pascack Valley Hospital
[18/05/2004, now)
do:bought by = University Health Network
[1988, now)
do:bought by = Arnot Ogden Medical Center
[2000, now)
do:bought by = Saudi Arabia National Guard Health Affairs (NGHA) do:bought by = Riverview Hospital
[29/09/2004, now)
do:bought by = Brockville General Hospital (BGH)
[19/12/2005, now)
do:bought by = St. John’s Regional Medical Center and St. John’s Pleasant Valley Hospital
[1993, now)
do:bought by = Daughters of Charity Health System (DCHS)
[15/12/2006, now)
do:bought by = Princeton HealthCare System (PHCS)
[24/01/2007, now)
23/09/2003
22/10/2003 18/05/2004 19/08/2004
27/09/2004 29/10/2004 29/09/2005 19/12/2005 25/05/2006
15/12/2006
24/01/2007
[27/08/2003, now) [23/09/2003, now)
[29/09/2005, now)
564
A. Mikroyannidis, B. Theodoulidis / International Journal of Information Management 30 (2010) 559–566
Fig. 3. Graphical representation of the bitemporal evolution of the Misys CPR instance.
each user group a distinct role in ontology construction and maintenance. In this way, the ontology construction task is made easier, since each user group focuses on their area of interest. There is an issue, nevertheless, in ontology construction that Heraclitus II has not fully addressed. This issue is inherent in ontology modelling and regards the reluctance of users against dedicating effort in ontology construction. The top-down modelling approach that ontologies propose is usually foreign to how experts view their domain of expertise. The hierarchical categorization of ontologies can pose serious limitations in modelling a certain domain. The fact that ontologies are usually controlled centrally by experts makes their adoption by a wide user-base difficult. Furthermore, in order to fully cover all the aspects of a domain and create ontologies and metadata for use in logic reasoning, the ontology author is required to dedicate a considerable amount of effort. Heraclitus II proposes evolution as the answer to this problem; even if the starting ontology is not complete, it can be improved over time through an efficient evolution process. Another problem that came up when performing ontology construction for Heraclitus II in the Biovista case study is related to the structure of the ontology pyramid. The main question concerning the ontology authors when building the Heraclitus II pyramid is “In which layer should an ontology object be placed?” The answer is usually quite straightforward as each layer has a distinct role in the pyramid. However, there are cases when ontological structures can belong to more than one layer. This can cause confusion with regard to where one layer ends and where the next one begins. In these cases, the ontology authors have to reach an agreement regarding where the ontological structures in question should be initially placed and use inter-layer mappings to couple the involved layers. This initial placement can be later revised and, if needed, corrected during the evolution process. 5.2. Collaborative ontology management Collaboration between different parties in the process of constructing and maintaining ontologies is an important success factor for modern IM systems. As the knowledge bases of these systems grow in size and diversity, the need for a larger and more diverse base of ontology authors increases. A number of essential tasks that
an environment for collaborative ontology management should support are described in Bao, Hu, Caragea, Reecy, & Honavar (2006). We will use these as metrics in order to evaluate the collaborative perspective of Heraclitus II.
• Knowledge integration: A fundamental task in a collaborative environment is the integration of contributions from multiple participants. Heraclitus II provides a multi-layer architecture that is constructed and managed by diverse parties. Reusability and integration is supported through ontology mapping. • Concurrency management: Different ontology authors need to be able to work on different parts of the knowledge base simultaneously. In case the same part of the knowledge base is concurrently edited by more than one author, this can cause conflicts. Heraclitus II does not provide the means to resolve these conflicts in a real-time fashion. Various technologies could be used to address this issue, such as CVS (The Gene Ontology Consortium, 2000), Wiki (Auer, Dietzold, & Riechert, 2006; Schaffert, 2006), or peer-to-peer based solutions (Becker, Eklund, & Roberts, 2005; Xexeo et al., 2004). • Consistency maintenance: Parts of the knowledge base developed by different authors may be inconsistent with each other, since an ontology usually reflects the point of view of each author. The mechanisms for structural and semantic consistency preservation as well as change propagation provided by Heraclitus II ensure that the knowledge base always stays clean of inconsistencies. • Privilege management: In order to ensure the accuracy of the knowledge base, a collaborative environment needs to assign different levels of privileges to its users, based on their expertise, authority, and responsibility. Our framework implements a flat scheme regarding privilege management, by giving each user group equal privileges in their layer of responsibility. • History maintenance: Collaborative environments should provide the means to recover from wrong or unintended changes to the knowledge base. All changes to the knowledge base should be thus recorded in order to be able to track the authorship of a change and to prevent loss of important information. The bitemporal modelling of Heraclitus II retains all the necessary information to achieve this goal.
A. Mikroyannidis, B. Theodoulidis / International Journal of Information Management 30 (2010) 559–566
• Scalability: Long-term collaboration of diverse parties usually increases the size of knowledge bases; therefore, a collaborative environment has to be scalable to large ontologies. The Biovista case study has involved only medium-sized ontologies. Running the case study for a longer period so that more information is accumulated in the knowledge base or application of the framework on larger scale case studies will determine the outcome of this metric for Heraclitus II. 5.3. Ontology evolution and BI A key aspect of the Biovista case study is its BI perspective. In order to facilitate the BI process, bitemporal evolution of the knowledge base has been employed. This has enhanced significantly the temporal information stored in the knowledge base and allows for more advanced Information Retrieval (IR) and Knowledge Discovery (KD) tasks to be accomplished. Through these tasks, the Biovista management is able to identify trends in sectors of the life science industry that are of interest. Business reports can be compiled about the evolution of the profile of a company over time. For this task, temporal information can be used regarding the company’s products, personnel, and collaborations/agreements, combined with data from the stock market. This information can help the manager assess the potential of these companies and take the appropriate decisions. For example, let us assume that the manager wants to answer the question: “Is company Y a good target for acquisition by company X?” In order to answer this question, business logic dictates that the following indicators should be considered: • • • • •
Technology gaps in company X. Compatible technologies between companies X and Y. Common personnel network. Financial situation of company Y. Technology trends.
Based on the Heraclitus II bitemporal modelling, the above indicators can be enhanced with the addition of bitemporal information. More specifically, the manager can monitor over two time dimensions the technologies provided by companies X and Y, so that he can estimate and compare their dynamics and discover technology trends. In addition, the relations between the personnel of the companies in question can be studied over time. Finally, in order to assess the current financial situation of the companies, as well as make short and long-term predictions about their situation, the manager can associate the bitemporal history of the sales/purchases of the companies with stock market data. 6. Conclusions and future work The Heraclitus II framework employs bitemporal modelling and a layering architecture for the management and evolution of ontologies within IM systems. A practical application of the framework on a case study from the life sciences industry has provided us with an insight of the problems and challenges related with BI. Heraclitus II provides solutions in ontology construction, integration, and maintenance. In addition, the proposed bitemporal evolution model aims at enhancing the temporal expressivity of ontologies, thus having a direct positive impact on BI. Heraclitus II offers various opportunities for further research. One interesting direction to follow would be towards establishing OWL extensions based on the Heraclitus II bitemporal modelling. Work in this area so far (W3c, 2006) has proposed only basic temporal OWL structures, such as interval and duration. The Heraclitus II bitemporal ontology model can offer additional, more expres-
565
sive structures, regarding the two time dimensions as well as the lifespan and history of ontology objects. Another addition to Heraclitus II that is worth pursuing is related to the collaborative aspects. In particular, offering the ability of managing the knowledge base in a concurrent fashion, improving the users privilege scheme, as well as working on the scalability of the framework, are important areas for further investigation. Exploring additional domains and applying the framework in more case studies can offer a better understanding of issues related to IM and BI. A potential area for further research is the construction and maintenance of the ontology pyramid, through the incorporation of Social Web technologies, such as folksonomies. Heraclitus II could in this way function as a testbed toward building a Social Semantic Web (Mikroyannidis, 2007). Acknowledgements The authors would like to thank the CEO of Biovista, Dr. Andreas Persidis, for providing the case study and its supporting material, including requirements, datasets and helpful feedback on the paper. References Auer, S., Dietzold, S., & Riechert, T. (2006). OntoWiki—A tool for social, semantic collaboration. In 5th International Semantic Web Conference (ISWC 2006) Athens, GA, USA, (pp. 736–749). Springer LNCS. Bao, J., Hu, Z., Caragea, D., Reecy, J., & Honavar, V. G. (2006). A tool for collaborative construction of large biological ontologies. In 17th International Conference on Database and Expert Systems Applications (DEXA’06) Krakow, Poland, (pp. 191–195). Becker, P., Eklund, P., & Roberts, N. (2005). Peer-to-peer based ontology editing. In International Conference on Next Generation Web Services Practices (NWeSP 2005) Seoul, Korea, (pp. 259–264). Bontcheva, K., Tablan, V., Maynard, D., & Cunningham, H. (2004). Evolving GATE to meet new challenges in language engineering. Natural Language Engineering, 10(3/4), 349–373. Broekstra, J., Kampman, A., & Harmelen, F. V. (2002). Sesame: A generic architecture for storing and querying RDF and RDF schema. In J. Davis, D. Fensel, & V. F. Harmelen (Eds.), Towards the semantic web: Ontology-driven knowledge management. John Wiley and Sons Ltd. Chaffey, D., & Wood, S. (2005). Business information management: Improving performance using information systems. FT Prentice Hall. Cornelius Rosse, J. L. V. M. J. (2003). A reference ontology for biomedical informatics: The foundational model of anatomy. Biomedical Informatics, 36(2003), 478–500. Cunningham, H. (2000) Software Architecture for Language Engineering. PhD Thesis, Department of Computer Science, University of Sheffield, Sheffield http://gate.ac.uk/sale/thesis/. Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: A framework and graphical development environment for robust NLP tools and applications. In 40th Annual Meeting of the Association for Computational Linguistics (ACL’02) Philadelphia, (pp. 168–175). Davies, J., Fensel, D., & Harmelen, F. V. (2003). Towards the semantic web: Ontologydriven knowledge management (1st ed.). John Wiley and Sons Ltd. Fellbaum, C. (1998). WordNet: An electronic lexical database. The MIT Press. Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3), 327–348. Haase, P., Sure, Y. & Vrandecic, D. (2004) Ontology Management and Evolution – Survey, Methods and Prototypes. SEKT Deliverable D3.1.1, Institute AIFB, University of Karlsruhe, Karlsruhe, Germany. http://www.sekt-project.org/rd/deliverables/. Kakoudakis, I., & Theodoulidis, B. (2001). TAU: Towards a unified temporal information management Framework. Bulletin of the Italian Association for Artificial Intelligence (AI*IA), 1, 55–61. Kiryakov, A., Simov, K. I., & Ognyanov, D. (2002). Ontology middleware: Analysis and design. On-To-Knowledge Project Deliverable, 38 http://www.ontoknowledge.org/downl/del38.pdf Klein, M. (2004) Change Management for Distributed Ontologies. PhD Thesis, Vrije Universiteit Amsterdam, http://www.cs.vu.nl/∼mcaklein/thesis/. Knublauch, H. (2004). Ontology-driven software development in the context of the semantic web: An example, scenario with Protégé/OWL. In 1st International Workshop on the Model-Driven Semantic Web (MDSW2004) Monterey, California, USA, Mikroyannidis, A. (2007). Toward a social semantic web. IEEE Computer, 113–115. Mikroyannidis, A., & Theodoulidis, B. (2006). Heraclitus II: A framework for ontology management and evolution. In 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006) Hong Kong, China, IEEE Computer Society, (pp. 514–521).
566
A. Mikroyannidis, B. Theodoulidis / International Journal of Information Management 30 (2010) 559–566
Mikroyannidis, A., & Theodoulidis, B. (2007). Heraclitus: A framework for semantic web adaptation. IEEE Internet Computing, 11(3), 45–52. Mikroyannidis, A., Theodoulidis, B., & Persidis, A. (2006). PARMENIDES: Towards business intelligence discovery from Web Data. In 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006) Hong Kong, China, IEEE Computer Society, (pp. 1057–1060). Noy, N. & Musen, M. (2003) The PROMPT suite: Interactive tools for ontology merging and mapping. Technical report, SMI, Stanford University, CA, USA. http://smiweb.stanford.edu/auslese/smi-web/research/details.jsp?PubId=973. Petasis, G., Karkaletsis, V., Paliouras, G., Androutsopoulos, I., & Spyropoulos, C. D. (2002). Ellogon: A new text engineering platform. In 3rd International Conference on Language Resources and Evaluation (LREC-2002) Las Palmas, Canary Islands, Spain, (pp. 72–78). Schaffert, S. (2006). IkeWiki: A semantic wiki for collaborative knowledge management. In 15th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE’06) Manchester, UK, (pp. 388–396). The Gene Ontology Consortium. (2000). Gene Ontology: Tool for the unification of biology. Natural Genetics, 25, 25–29. Theodoulidis, B., Kakoudakis, I., Hanias, K., Pilafitzis, G., Svinterikou, M., & Kazantzidis, E. (1998). The TAU temporal object system. In 6th International Conference on Extending Database Technology (EDBT) Valencia, Spain, W3c (2006). Time ontology in OWL. In J. R. Hobbs, & F. Pan (Eds.), W3C Working Draft. World Wide Web Consortium http://www.w3.org/TR/owl-time/. Xexeo, G., De Souza, J. M., Vivacqua, A., Miranda, B., Braga, B., Almentero, B. K., et al. (2004). Peer-to-peer collaborative editing of ontologies. In 8th International
Conference on Computer Supported Cooperative Work in Design (CSCWD 2004) Xiamen, China, (pp. 186–190). Alexander Mikroyannidis is a postdoctoral research fellow in the Knowledge Media Institute, Open University. His expertise is mainly in information and knowledge management with the use of Semantic and Social Web technologies, such as ontologies and folksonomies. He holds a PhD in Informatics from Manchester Business School, an MPhil in Computation from the University of Manchester Institute of Science & Technology (UMIST), and a BEng in Electrical and Computer Engineering from the University of Patras, Greece. He has contributed to the 5th, 6th, and 7th Framework Programme of the European Community, through participation in ROLE (IST-2009-231396), DEMO-net (IST-2006-27219), CASPAR (IST-2005-33572), and PARMENIDES (IST-2001-39023). Babis Theodoulidis is a senior lecturer in the Manchester Business School. His research interests are in information management, including spatiotemporal information management, information governance, and data/text mining. Theodoulidis has a Diploma in computer engineering and informatics from the University of Patras, Greece, an MSc in computer science from the University of Glasgow, UK, and a PhD in computation from the University of Manchester Institute of Science & Technology (UMIST). He has been involved extensively with the European Community research programs through Esprit I Tempora (P2469), Esprit III ORES (P7224), and Esprit IV Chorochronos (FMRX960056), FP7 Commius (IST-2008-213876) and he was the project manager for FP5 Parmenides (IST-200139023).