eration data mining systems were individual research-driven tools for perform-
ing generic learning ... going research on service-oriented knowledge discovery.
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
Third-Generation Data Mining: Towards Service-Oriented Knowledge Discovery SoKD’10
September 24, 2010 Barcelona, Spain
Editors Melanie Hilario University of Geneva, Switzerland Nada Lavrač Vid Podpečan Jožef Stefan Institute, Ljubljana, Slovenia Joost N. Kok LIACS, Leiden University, The Netherlands
- ii -
Preface
It might seem paradoxical that third-generation data mining (DM) remains an open research issue more than a decade after it was first defined1 . First generation data mining systems were individual research-driven tools for performing generic learning tasks such as classification or clustering. They were aimed mainly at data analysis experts whose technical know-how allowed them to do extensive data preprocessing and tool parameter-tuning. Second-generation DM systems gained in both diversity and scope: they not only offered a variety of tools for the learning task but also provided support for the full knowledge discovery process, in particular for data cleaning and data transformation prior to learning. These so-called DM suites remained, however, oriented towards the DM professional rather than the end user. The idea of third-generation DM systems, as defined in 1997, was to empower the end user by focusing on solutions rather than tool suites; domain-specific shells were wrapped around a core of DM tools, and graphical interfaces were designed to hide the intrinsic complexity of the underlying DM methods. Vertical DM systems have been developed for applications in data-intensive fields such as bioinformatics, banking and finance, e-commerce, telecommunications, or customer relationship management. However, driven by the unprecedented growth in the amount and diversity of available data, advances in data mining and related fields gradually led to a revised and more ambitious vision of third-generation DM systems. Knowledge discovery in databases, as it was understood in the 1990s, turned out to be just one subarea of a much broader field that now includes mining unstructured data in text and image collections, as well as semi-structured data from the rapidly expanding Web. With the increased heterogeneity of data types and formats, the limitations of attribute-value vectors and their associated propositional learning techniques were acknowledged, then overcome through the development of complex object representations and relational mining techniques. Outside the data mining community, other areas of computer science rose up to the challenges of the data explosion. To scale up to tera-order data volumes, high-performance computers proved to be individually inadequate and had to be networked into grids in order to divide and conquer computationally intensive tasks. More recently, cloud computing allows for the distribution of data and computing load to a large number of distant computers, while doing away with the centralized hardware infrastructure of grid computing. The need to harness multiple computers for a given task gave rise to novel software paradigms, foremost of which is service-oriented computing. 1
G. Piatetsky-Shapiro. Data mining and knowledge discovery: The third generation. In Foundations of Intelligent Systems: 10th International Symposium, 1997.
- iii -
As it name suggests, service-oriented computing utilizes services as the basic constructs to enable composition of applications from software and other resources distributed across heterogeneous computing environments and communication networks. The service-oriented paradigm has induced a radical shift in our definition of third-generation data mining. The 1990’s vision of a data mining tool suite encapsulated in a domain-specific shell gives way to a service-oriented architecture with functionality for identifying, accessing and orchestrating local and remote data/information resources and mining tools into a task-specific workflow. Thus the major challenge facing third-generation DM systems is the integration of these distributed and heterogeneous resources and software into a coherent and effective knowledge discovery process. Semantic Web research provides the key technologies needed to ensure interoperability of these services; for instance, the availability of widely accepted task and domain ontologies ensures common semantics for the annotation, search and retrieval of the relevant data/knowledge/software resources, thus enabling the construction of shareable and reusable knowledge discovery workflows. SoKD’10 is the third in a series of workshops that serve as the forum for ongoing research on service-oriented knowledge discovery. The papers selected for this edition can be grouped under 3 main topics. Three papers propose novel techniques for the construction, analysis and re-use of data mining workflows. A second group of two papers addresses the problem of building ontologies for knowledge discovery. Finally, two papers describe applications of service-oriented knowledge discovery in plant biology and predictive toxicology.
Geneva, Ljubljana, Leiden July 2010
Melanie Hilario Nada Lavrač Vid Podpečan Joost N. Kok
- iv -
Workshop Organization
Workshop Chairs Melanie Hilario (University of Geneva) Nada Lavrač (Jožef Stefan Institute) Vid Podpečan (Jožef Stefan Institute) Joost N. Kok (Leiden University)
Program Committee Abraham Bernstein (University of Zurich, Switzerlnd) Michael Berthold (Konstanz University, Germany) Hendrik Blockeel (Leuven University, Belgium) Jeroen de Bruin (Leiden University, The Netherlands) Werner Dubitzky (University of Ulster, UK) Alexandros Kalousis (University of Geneva, Switzerland) Igor Mozetič (Jožef Stefan Institute, Slovenia) Filip Železny (Czech Technical University, Czechia)
Additional Reviewers Agnieszka Ławrynowicz (Poznan University of Technology, Poland) Yvan Saeys (Ghent University, Belgium)
-v-
Table of Contents
Data Mining Workflows: Creation, Analysis and Re-use Data Mining Workflow Templates for Intelligent Discovery Assistance and Auto-Experimentation Jörg-Uwe Kietz, Floarea Serban, Abraham Bernstein, Simon Fischer
1
Workflow Analysis Using Graph Kernels Natalja Friesen, Stefan Rüping
13
Re-using Data Mining Workflows Stefan Rüping, Dennis Wegener, Philipp Bremer
25
Ontologies for Knowledge Discovery Exposé: An Ontology for Data Mining Experiments Joaquin Vanschoren, Larisa Soldatova
31
Foundations of Frequent Concept Mining with Formal Ontologies Agnieszka Ławrynowicz
45
Applications of Service-Oriented Knowledge Discovery Workflow-based Information Retrieval to Model Plant Defence Response to Pathogen Attacks Dragana Miljković, Claudiu Mihăilă, Vid Podpečan, Miha Grčar, Kristina Gruden, Tjaša Stare, Nada Lavrač OpenTox: A Distributed REST Approach to Predictive Toxicology Tobias Girschick, Fabian Buchwald, Barry Hardy, Stefan Kramer
- vi -
51
61
Data Mining Workflow Templates for Intelligent Discovery Assistance and Auto-Experimentation J¨ org-Uwe Kietz1 , Floarea Serban1 , Abraham Bernstein1 , and Simon Fischer2
2
1 University of Zurich, Department of Informatics, Dynamic and Distributed Information Systems Group, Binzm¨ uhlestrasse 14, CH-8050 Zurich, Switzerland {kietz|serban|bernstein}@ifi.uzh.ch Rapid-I GmbH, Stockumer Str. 475, 44227 Dortmund, Germany
[email protected]
Abstract. Knowledge Discovery in Databases (KDD) has grown a lot during the last years. But providing user support for constructing workflows is still problematic. The large number of operators available in current KDD systems makes it difficult for a user to successfully solve her task. Also, workflows can easily reach a huge number of operators (hundreds) and parts of the workflows are applied several times. Therefore, it becomes hard for the user to construct them manually. In addition, workflows are not checked for correctness before execution. Hence, it frequently happens that the execution of the workflow stops with an error after several hours runtime. In this paper3 we present a solution to these problems. We introduce a knowledge-based representation of Data Mining (DM) workflows as a basis for cooperative-interactive planning. Moreover, we discuss workflow templates, i.e. abstract workflows that can mix executable operators and tasks to be refined later into sub-workflows. This new representation helps users to structure and handle workflows, as it constrains the number of operators that need to be considered. Finally, workflows can be grouped in templates which foster re-use further simplifying DM workflow construction.
1
Introduction
One of the challenges of Knowledge Discovery in Databases (KDD) is assisting the users in creating and executing DM workflows. Existing KDD systems such as the commercial Clementine 4 and Enterprise Miner 5 or the open-source 3
4 5
This paper reports on work in progress. Refer to http://www.e-lico.eu/eProPlan to see the current state of the Data Mining ontology for WorkFlow planning (DMWF), the IDA-API, and the eProPlan Protege plug-ins we built to model the DMWF. The RapidMiner IDA-wizard will be part of a future release of RapidMiner check http://www.rapidminer.com/ for it. http://www.spss.com/software/modeling/modeler-pro/ http://www.sas.com/technologies/analytics/datamining/miner/
Weka 6 , MiningMart 7 , KNIME 8 and RapidMiner 9 support the user with nice graphical user interfaces, where operators can be dropped as nodes onto the working pane and the data-flow is specified by connecting the operator-nodes. This works very well as long as neither the workflow becomes too complicated nor the number of operators becomes too large. The number of operators in such systems, however, has been growing fast. All of them contain over 100 operators and RapidMiner, which includes Weka, even over 600. It can be expected that with the incorporation of text-, image, and multimedia-mining as well as the transition from closed systems with a fixed set of operators to open systems, which can also use Web services as operators (which is especially interesting for domain specific data access and transformations), will further accelerate the rate of growth resulting in total confusion for most users. Not only the number of operators, but also the size of the workflows is growing. Today’s workflows can easily contain hundreds of operators. Parts of the workflows are applied several times ( e.g. the preprocessing sub-workflow has to be applied on training, testing, and application data) implying that the users either need to copy/paste or even to design a new sub-workflow10 several times. None of the systems maintain this “copy”-relationship, it is left to the user to maintain the relationship in the light of changes. Another weak point is that workflows are not checked for correctness before execution: it frequently happens that the execution of the workflow stops with an error after several hours runtime because of small syntactic incompatibilities between an operator and the data it should be applied on. To address these problems several authors [1, 12, 4, 13] propose the use of planning techniques to automatically build such workflows. However all these approaches are limited in several ways. First, they only model a very small set of operations and they work on very short workflows (less than 10 operators). Second, none of them models operations that work on individual columns of a data set, they only model operations that process all columns of a data set equally together. Lastly, the approaches cannot scale to large amounts of operators and large workflows: their used planning approaches will necessarily get lost in the too large space of “correct” (but nevertheless most often unwanted) solutions. In [6] we reused the idea of hierarchical task decomposition (from the manual support system CITRUS [11]) and knowledge available in Data Min6 7 8 9 10
http://www.cs.waikato.ac.nz/ml/weka/ http://mmart.cs.uni-dortmund.de/ http://www.knime.org/ http://rapid-i.com/content/view/181/190/ Several operators must be exchanged and cannot be just reapplied. Consider for example training data (with labels) and application data (without labels). Labeldirected operations like feature selection or discretization by entropy used on the training data cannot work on the application data. But even if there is a label like on separate test data, redoing feature selection/discretization may result in selecting/building different features/bins. But to apply and test the model exactly the same features/bins have to be selected/build.
-2-
ing (e.g. CRISP-DM) for hierarchical task network (HTN) planning [9]. This significantly reduces the number of generated unwanted correct workflows. Unfortunately, since it covers only generic DM knowledge, it still does not capture the most important knowledge a DM engineer uses to judge workflows and models useful: understanding the meaning of the data11 . Formalizing the meaning of the data requires a large amount of domain knowledge. Eliciting all the possible needed background information about the data from the user would probably be more demanding for her than designing useful workflows manually. Therefore, the completely automatic planning of useful workflows is not feasible. The approach of enumerating all correct workflows and then let the user choose the useful one(s) will likely fail due to the large number of correct workflows (infinite, without a limit on the number of operations in the workflow). Only cooperative-interactive planning of workflows seems to be feasible. In this scenario the planner ensures the correctness of the state of planning and can propose a small number of possible intermediate refinements of the current plan to the user. The user can use her knowledge about the data to choose useful refinements, can make manual additions/corrections, and use the planner again for tasks that can be routinely solved without knowledge about the data. Furthermore, the planner can be used to generate all correct sub-workflows to optimize the workflow by experimentation. In this paper we present a knowledge-based representation of DM workflows, understandable to both planner and user, as the foundation for cooperativeinteractive planning. To be able to represent the intermediate states of planning, we generalize this to “workflow templates”, i.e. abstract workflows that can mix executable operators and tasks to be refined later into sub-workflows (or sub-workflow-templates). Our workflows follow the structure of a Data Mining Ontology for Workflows (DMWF). It has a hierarchical structure consisting of a task/method decomposition into tasks, methods or operators. Therefore, workflows can be grouped based on the structure decomposition and can be simplified by using abstract nodes. This new representation helps the users since akin to structured programming the elements (operators, tasks, and methods) of a workflow actively under consideration are reduced significantly. Furthermore, this approach allows to group certain sequences of operators as templates to be reused later. All this simplifies and improves the design of a DM workflow, reducing the time needed to construct workflows, and decreases the workflow’s size. This paper is organized as follows: Section 2 describes workflows and their representation as well as workflow template, Section 3 shows the advantages of workflow templates, Section 4 presents the current state and future steps and finally Section 5 concludes our paper. 11
Consider a binary attribute “address invalid”: just by looking at the data it is almost impossible to infer that it does not make sense to send advertisement to people with this flag set at the moment. In fact, they may have responded to previous advertisements very well.
-3-
2
DM Workflow
DM workflows generally represent a set of DM operators, which are executed and applied on data or models. In most of the DM tools users are only working with operators and setting their parameters (values). Data is implicit, hidden in the connectors, the user provides the data and applies the operators, but after each step new data is produced. In our approach we distinguish between all the components of the DM workflow: operators, data, parameters. To enable the system and user to cooperatively design the workflows, we developed a formalization of the DM workflows in terms of an ontology. To be able to define a DM workflow we first need to describe the DMWF ontology since workflows are stored and represented in DMWF format. This ontology encodes rules from the KDD domain on how to solve DM tasks, as for example the CRISP-DM [2] steps in the form of concepts and relations (Tbox – terminology). The DMWF has several classes that contribute in describing the DM world, IOObjects, MetaData, Operators, Goals, Tasks and Methods. The most important ones are shown in Table 1. Class
Description Examples Input and output used by Data, Model, Report operators MetaData Characteristics of the IOObjects Attribute, AttributeType, DataColumn, DataFormat DataTableProcessing, ModelProcessing, Modeling, Operator DM operators MethodEvaluation A DM goal that the user could DescriptiveModelling, PatternDiscovery, Goal solve PredictiveModelling, RetrievalByContent CleanMV, CategorialToScalar, DiscretizeAll, Task A task is used to achieve a goal PredictTarget CategorialToScalarRecursive, CleanMVRecursive, Method A method is used to solve a task DiscretizeAllRecursive, DoPrediction IOObject
Table 1: Main classes from the DMWF ontology 12
Properties uses – usesData – usesModel produces – producesData – producesModel parameter simpleParameter solvedBy worksOn – inputData – outputData
Domain
Range
Description
Operator
IOObject
defines input for an operator
Operator
IOObject
defines output for an operator
Operator Operator Task
MetaData data type Method
defines other parameters for operators
TaskMethod
IOObject
worksWith
TaskMethod
MetaData
decomposedTo
Method
A task is solved by a method The IOObject elements the Task or Method works on
The MetaData elements the Task or Method worksWith Operator/Task A Method is decomposed into a set of steps
Table 2: Main roles from the DMWF ontology The classes from the DWMF ontology are connected through properties as shown in Table 2. The parameters of operators as well as some basic characteristics of data are values (integer, double, string, etc.) in terms of data properties, 12
Later on we use usesProp, producesProp, simpleParamProp, etc. to denote the subproperties of uses, produces, simpleParameter, etc. .
-4-
e.g. number of records for each data table, number of missing values for each column, mean value and standard deviation for each scalar column, number of different values for nominal columns, etc. Having them modeled in the ontology enables the planner to use them for planning. 2.1
What is a workflow?
In our approach a workflow constitutes an instantiation of the DM classes; more precisely is a set of ontological individuals (Abox - assertions). It is mainly composed from several basic operators, which can be executed or applied with the given parameters. The workflow follows the structure illustrated in Fig. 1. A workflow consists of several operator applications, instances of operators as well as their inputs and outputs – instances of IOObject, simple parameters (values which can have different data types like integer, string, etc.), or parameters – instances of MetaData. The flow itself is rather implicit, it is represented by shared IOObjects used and produced by Operators. The reasoner can ensure that every IOObject has only 1 producer and that every IOObject is either given as input to the workflow or produced before it can be used. Operator[usesProp1 {1,1}⇒ IOObject, . . . , usesPropn {1,1}⇒IOObject, producesProp1 {1,1}⇒ IOObject, . . . , producesPropn {1,1}⇒IOObject, parameterProp1 {1,1}⇒ MetaData, . . . , parameterPropn {1,1}⇒ MetaData, simpleParamProp1 {1,1}⇒ dataType, . . . , simpleParamPropn {1,1}⇒ dataType].
Fig. 1: Tbox for operator applications and workflows Fig. 2 illustrates an example of a real workflow. It is not a linear sequence since models are shared between subprocesses, so the workflow produced is a DAG (Direct Acyclic Graph). The workflow consists of two subprocesses: the training and the testing which share the models. We have a set of basic operator individuals (FillMissingValues1 , DiscretizeAll1 , etc.) which use individuals of IOObject (TrainingData, TestData, DataTable1 , etc.) as input and produce individuals of IOObject (PreprocessingModel1 , Model1 , etc.) as output. The example does not display the parameters and simple parameters of operators but each operator could have several such parameters. 2.2
Workflow templates
Very often the DM workflows have a large number of operators (hundreds), even more some sequences of operators may repeat and be executed several times in the same workflow. This becomes a real problem since the users need to construct and maintain the workflows manually. To overcome this problem we introduce the notion of workflow templates. When the planner generates a workflow it follows a set of task/method decomposition rules encoded in the DMWF ontology. Every task has a set of methods able to solve it. The task solved by a method is called the head of the method. Each method is decomposed into a sequence of steps which can be
-5-
Training
uses Data
Data
produces Data
Discretize All1
Data
uses Data
Table1 produces Model
Data
produces Data ApplyPrepro
cessing Model1
uses Data
Modeling1
produces Model
Preprocess ingModel2
uses Model
Test
Data Table2
produces Model
Preprocess ingModel1
uses Data
produces Data
FillMissing Values1
Model1
uses Model uses Data Data
uses Model
produces Data ApplyPrepro
cessing Model2
Table3
uses Data Data Table4
Apply Model1
produces Data
uses Data Data Table5
Report Accuracy1
produces Data Report1
Fig. 2: A basic workflow example
either tasks or operators as shown in the specification in Fig. 3. The matching between the current and the next step is done based on operators’ conditions and effects as well as methods’ conditions and contributions as described in [6]. Such a set of task/method decompositions works similarly to a context-free grammar: tasks are the non-terminal symbols of the grammar, operators are the terminal-symbols (or alphabet) and methods for a task are the grammarrule that specify how a non-terminal can be replaced by a sequence of (simpler) tasks and operators. In this analogy the workflows are words of the language specified by the task/method decomposition grammar. To be able to generate not only operator sequences, but also operator DAGs14 , it additionally contains a specification for passing parameter constraints between methods, tasks and operators15 . In the decomposition process the properties of the method’s head (the task) or one of the steps can be bound to the same variable as the properties of other steps. TaskMethod[worksOnProp1 ⇒ IOObject, . . . , worksOnPropn ⇒ IOObject, worksWithProp1 ⇒ MetaData, . . . , worksWithPropn ⇒ MetaData] {Task, Method} :: TaskMethod. Task[solvedBy ⇒ Method]. {step1 , . . . , stepn } :: decomposedTo. Method[step1 ⇒{Operator|Task}, . . . , stepn ⇒ {Operator|Task}]. Method.{head|stepi }.prop = Method.{head|stepi }.prop prop := workOnProp | workWithProp |usesProp| producesProp | parameterProp | simpleParamProp
Fig. 3: TBox for task/method decomposition and parameter passing constraints A workflow template represents the upper (abstract) nodes from the generated decomposition, which in fact are either tasks, methods or abstract operators. If we look at the example in Fig. 2 none of the nodes are basic operators. Indeed, they are all tasks as place-holders for several possible basic operators. 14
15
The planning process is still sequential, but the resulting structure may have a non-linear flow of objects. Giving it the expressive power of a first-order logic Horn-clause grammar.
-6-
For example, DiscretizeAll has different discretization methods as described in Section 3, therefore DiscretizeAll represents a task which can be solved by the DiscretizeAllAtOnce method. The method can have several steps, e.g, the first step is an abstract operator RM DiscretizeAll, which subsequently has several basic operators like RM Discretize All by Size, RM Discretize All by Frequency. The workflows are produced by an HTN planner [9] based on the DMWF ontology as background knowledge (domain) and on the goal and data description (problem). In fact, a workflow is equivalent to a generated plan. The planner generates only valid workflows since it checks the preconditions of every operator present in the workflow, also operator’s effects are the preconditions of the next operator in the workflow. In most of the existing DM tools the user can design a workflow, start executing it, and after some time discover that in fact some operator was applied on data with missing values or on nominals whilst, in fact, it can handle only missing value free data and scalars. Our approach can avoid such annoying and time consuming problems by using conditions and effects of operators. An operator is applicable only when its preconditions are satisfied, therefore the generated workflows are semantically correct.
3
Workflow Templates for auto-experimentation
To illustrate the usefulness of our approach, consider the following common scenario. Given a data table containing numerical data, a modelling algorithm should be applied that is not capable of processing numerical values, e.g., a simple decision tree induction algorithm. In order to still utilize this algorithm, attributes must first be discretized. To discretize a numerical attribute, its range of possible numerical values is partitioned, and each numerical value is replaced by the generated name of the partition it falls into. The data miner has multiple options to compute this partition, e. g., RapidMiner [8] contains five different algorithms to discretize data: – Discretize by Binning. The numerical values are divided into k ranges of equal size. The resulting bins can be arbitrarily unbalanced. – Discretize by Frequency. The numerical values are inserted into k bins divided at thresholds computed such that an equal number of examples is assigned to each bin. The ranges of the resulting bins may be arbitrarily unbalanced. – Discretize by Entropy. Bin boundaries are chosen as to minimize the entropy in the induced partitions. The entropy is computed with respect to the label attribute. – Discretize by Size. Here, the user specifies the number of examples that should be assigned to each bin. Consequently, the number of bins will vary. – Discretize by User Specification. Here, the user can manually specify the boundaries of the partition. This is typically only useful if meaningful boundaries are implied by the application domain.
-7-
Each of these operators has its advantages and disadvantages. However, there is no universal rule of thumb as to which of the options should be used depending on the characteristics or domain of the data. Still, some of the options can be excluded in some cases. For example, the entropy can only be computed if a nominal label exists. There are also soft rules, e.g., it is not advisable to choose any discretization algorithm with fixed partition boundaries if the attribute values are skewed. Then, one might end up with bins that contain only very few examples. Though no such rule of thumb exists, it is also evident that the choice of discretization operator can have a huge impact on the result of the data mining process. To support this statement, we have performed experiments on some standard data sets. We have executed all combinations of the five discretization operators Discretize by Binning with two and four bins, Discretize by Frequency with two and four bins, and Discretize by Entropy on the 4 numerical attributes of the well-known UCI data set Iris. Following the discretization, a decision tree was generated and evaluated using a ten-fold cross validation16 . We can observe that the resulting accuracy varies significantly, between 64.0% and 94.7% (see Table 3). Notably, the best performance is not achieved by selecting a single method for all attributes, but by choosing a particular combination. This shows that finding the right combination can actually be worth the effort. Dataset #numerical attr. # total attr. min. accuracy max. accuracy Iris 4 4 64.0% 94.7% Adult 6 14 82.6% 86.3%
Table 3: The table shows that optimizing the discretization method can be a huge gain for some tables, whereas it is negligible for others. Consider the number of different combinations possible for k discretization operators and m numeric attributes. This makes up for a total of k m cominations. If we want to try i different values for the number of bins, we even have (k·i)m different combinations. In the case of our above example, this makes for a total of 1 296 combinations. Although knowing that the choice of discretization operator can make a huge difference, most data miners will not be willing to perform such a huge amount of experiments. In principle, it is possible to execute all combinations in an automated fashion using standard RapidMiner operators. However, such a process must be custom-made for the data set at hand. Furthermore, discretization is only one out of numerous typical preprocessing steps. If we take into consideration other steps like the replacement of missing values, normalization, etc., the complexity of such a task grows beyond any reasonable border. This is where workflow templates come into play. In a workflow template, it is merely specified that at some point in the workflow all attributes must be discretized, missing values be replaced or imputed, or a similar goal be achieved. The planner can then create a collection of plans satisfying these constraints. 16
The process used to generate these results is available on the myExperiment platform [3]: http://www.myexperiment.org/workflows/1344
-8-
Clearly, simply enumerating all plans only helps if there is enough computational power to try all possible combinations. Where this is not possible, the number of plans must be reduced. Several options exist: – Where fixed rules of thumb like the two rules mentioned above exist, this is expressed in the ontological description of the operators. Thus, the search space can be reduced, and less promising plans can be excluded from the resulting collection of plans. – The search space can be restricted by allowing only a subset of possible combinations. For example, we can force the planner to apply the same discretization operator to all attributes (but still allow any combination with other preprocessing steps). – The ontology is enriched by resource consumption annotations describing the projected execution time and memory consumption of the individual operators. This can be used to rank the retrieved plans. – Where none of the above rules exist, meta mining from systematic experimentation can help to rank plans and test their execution in a sensible order. This work is ongoing work within the e-Lico project. – Optimizing the discretization step does not necessarily yield such a huge gain as presented above for all data sets. We executed a similar optimization as the one presented above for the numerical attributes of the Adult data set. Here, the accuracy only varies between 82.6% and 86.3% (see Table 3). In hindsight, the reason for this is clear: Whereas all of the attributes of the Iris data set are numerical, only 6 out of 14 attributes of the Adult dataset are. Hence, the expected gain for Iris is much larger. A clever planner can spot this fact, removing possible plans where no large gain can be expected. Findings like these can also be supported by meta mining. All these approaches help the data miner to optimize steps where this is promising and generating and executing the necessary processes to be evaluated.
4
Current state
The current state and some of the future development plans of our project are shown in Fig.4. The system consists of a modeling environment called eProPlan (e-Lico Protege-based Planner) in which the ontology that defines the behavior of the Intelligent Discovery Assistant (IDA) is modeled. eProPlan comprises several Prot´eg´e4-plugins [7], that add the modeling of the operators with their conditions and effects and the task-method decomposition to the baseontology modeling. It allows to analyze workflow inputs and to set up the goals to be reached in the workflow. It also adds a reasoner-interface to our reasoner/planner, such that the applicability of operators to IO-Objects can be tested (i.e. the correct modeling of the condition of an operator), a single operator can be applied with applicable parameter setting (i.e. the correct modeling of the effect of an operator can be tested), and also the planner can be asked to
-9-
Modeling & testing
Workflow generation
Va li Pl date an
Exp lain Plan
Repair Plan nd pa Ex Task
IDA-API Exp Task an sio n
e riev Ret n Pla
s
N Plans for Task
DMO
Fig. 4: (a) eProPlan architecture
st Be od th Me
Reasoning & planning
B Op est era tor
N Best Plans
A Op pply era tor
ble lica s App rator e Op
(b) The services of the planner
generate a whole plan for a specified task (i.e. the task-method decomposition can be tested). Using eProPlan we modeled the DMWF ontology which currently consists of 64 Modeling (DM) Operators, including supervised learning, clustering, and association rules generation of which 53 are leaves i.e. executable RapidMiner Operators. We have also 78 executable Preprocessing Operators from RapidMiner and 30 abstract Groups categorizing them. We also have 5 Reporting (e.g. a data audit, ROC-curve), 5 Model evaluation (e.g. cross-validation) and Model application operators from RapidMiner. The domain model which describes the IO-Objects of operators (i.e. data tables, models, reports, text collections, image collections) consists of 43 classes. With that the DMWF is by far the largest collection of real operators modeled for any planner-IDA in the related work. A main innovation of our domain model over all previous planner-based IDAs is that we did not stop with the IO-Objects, but modeled their parts as well, i.e. we modeled the attributes and the relevant properties a data table consists of. With this model we are able to capture the conditions and effects of all these operators not only on the table-level but also on the column-level. This important improvement was illustrated on the example of discretization in the last section. On the Task/Method decomposition side we modeled a CRISP-DM top-level HTN. Its behavior can be modified by currently 15 (sub-) Goals that are used as further hints for the HTN planner. We also have several bottom-level tasks as the DiscretizeAll described in the last section, e.g. for Missing Value imputation and Normalization. To access our planner IDA in data mining environment we are currently developing an IDA-API (Intelligent Data Assistant - Application Programming Interface). The first version of the API will offer the ”AI-Planner” services in Fig.4(b), but we are also working to extend our planner with the case-based planner services shown there and our partner is working to integrate the probabilistic planner services [5]. The integration of the API into RapidMiner as a wizard is displayed in Fig. 5 and it will be integrated into Taverna [10] as well.
- 10 -
Fig. 5: A screenshot of the IDA planner integrated as a Wizard into RapidMiner.
5
Conclusion and future work
In this paper we introduced a knowledge-based representation of DM workflows as a basis for cooperative-interactive workflow planning. Based on that we presented the main contribution of this paper: the definition of workflow templates, i.e. abstract workflows that can mix executable operators and tasks to be refined later into sub-workflows. We argued that these workflow templates serve very well as a common workspace for user and system to cooperatively design workflows. Due to their hierarchical task structure they help to make large workflows neat. We experimentally showed on the example of discretization that they help to optimize the performance of workflows by auto-experimentation. Future work will try to meta-learn from these workflow-optimization experiments, such that a probabilistic extension of the planner can rank the plans based on their expected success. We argued that knowledge about the content of the data (which cannot be extracted from the data) has a strong influence on the design of useful workflows. Therefore, previously designed workflows for similar data and goals likely contain an implicit encoding of this knowledge. This means an extension to case-based planning is a promissing direction for future work as well. We expect workflow templates to help us in case adaptation as well, because they show what a sub-workflow wants to achieve on the data. Acknowledgements: This work is supported by the European Community 7th framework ICT-2007.4.4 (No 231519) “e-Lico: An e-Laboratory for Interdisciplinary Collaborative Research in Data Mining and Data-Intensive Science”.
- 11 -
References 1. A. Bernstein, F. Provost, and S. Hill. Towards Intelligent Assistance for a Data Mining Process: An Ontology-based Approach for Cost-sensitive Classification. IEEE Transactions on Knowledge and Data Engineering, 17(4):503–518, April 2005. 2. P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth. Crisp–dm 1.0: Step-by-step data mining guide. Technical report, The CRISP–DM Consortium, 2000. 3. D. De Roure, C. Goble, and R. Stevens. The design and realisation of the myexperiment virtual research environment for social sharing of workflows. In Future Generation Computer Systems 25, pages 561–567, 2009. 4. C. Diamantini, D. Potena, and E. Storti. KDDONTO: An Ontology for Discovery and Composition of KDD Algorithms. In Service-oriented Knowledge Discovery (SoKD-09) Workshop at ECML/PKDD09, 2009. 5. M. Hilario, A. Kalousis, P. Nguyen, and A. Woznica. A data mining ontology for algorithm selection and meta-learning. In Service-oriented Knowledge Discovery (SoKD-09) Workshop at ECML/PKDD09, 2009. 6. J.-U. Kietz, F. Serban, A. Bernstein, and S. Fischer. Towards cooperative planning of data mining workflows. In Service-oriented Knowledge Discovery (SoKD-09) Workshop at ECML/PKDD09, 2009. 7. H. Knublauch, R. Fergerson, N. Noy, and M. Musen. The Prot´eg´e OWL plugin: An open development environment for semantic web applications. Lecture notes in computer science, pages 229–243, 2004. 8. I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 935–940. ACM, 2006. 9. D. Nau, T.-C. Au, O. Ilghami, U. Kuter, W. Murdock, D. Wu, and F.Yaman. SHOP2: An HTN planning system. JAIR, 20:379–404, 2003. 10. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Greenwood, T. Carver, M. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 2004. 11. R. Wirth, C. Shearer, U. Grimmer, T. P. Reinartz, J. Schl¨ osser, C. Breitner, R. Engels, and G. Lindner. Towards process-oriented tool support for knowledge discovery in databases. In PKDD ’97: Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, pages 243–253, London, UK, 1997. Springer-Verlag. ˇ akov´ ˇ 12. M. Z´ a, P. Kˇremen, F. Zelezn´ y, and N. Lavraˇc. Planning to learn with a knowledge discovery ontology. In Planning to Learn Workshop (PlanLearn 2008) at ICML 2008, 2008. ˇ akov´ ˇ 13. M. Z´ a, V. Podpeˇcan, F. Zelezn´ y, and N. Lavraˇc. Advancing data mining workflow construction: A framework and cases using the orange toolkit. In Serviceoriented Knowledge Discovery (SoKD-09) Workshop at ECML/PKDD09, 2009.
- 12 -
Workflow Analysis using Graph Kernels Natalja Friesen and Stefan Rüping1 Fraunhofer IAIS, 53754 St. Augustin, Germany, {natalja.friesen,stefan.rueping}@iais.fraunhofer.de, WWW home page: http://www.iais.fraunhofer.de
Abstract. Workflow enacting systems are a popular technology in business and e-science alike to flexibly define and enact complex data processing tasks. Since the construction of a workflow for a specific task can become quite complex, efforts are currently underway to increase the re-use of workflows through the implementation of specialized workflow repositories. While existing methods to exploit the knowledge in these repositories usually consider workflows as an atomic entity, our work is based on the fact that workflows can naturally be viewed as graphs. Hence, in this paper we investigate the use of graph kernels for the problems of workflow discovery, workflow recommendation, and workflow pattern extraction, paying special attention to the typical situation of few labeled and many unlabeled workflows. To empirically demonstrate the feasibility of our approach we investigate a dataset of bioinformatics workflows retrieved from the website myexperiment.org. Key words: Workflow analysis, graph mining
1
Introduction
Workflow enacting systems are a popular technology in business and e-science alike to flexibly define and enact complex data processing tasks. A workflow is basically a description of the order in which a set of services have to be called with which input in order to solve a given task. Since the construction of a workflow for a specific task can become quite complex, efforts are currently underway to increase the re-use of workflows through the implementation of specialized workflow repositories. Driven by specific applications, a large collection of workflow systems have been prototyped such as Taverna [12] or Triana [15]. As the high numbers of workflows can be generated and stored relatively easily it becomes increasingly hard to keep an overview about the available workflows. Workflow repositories and websites such as myexperiment.org tackle this problem by offering the research community the possibility to publish and exchange complete workflows. An even higher amount of integration has been described in the idea of developing a Virtual Research Environment (VRE, [2]). Due to the complexity of managing a large repository of workflows, data mining approaches are needed to support the user in making good use of the knowledge that is encoded in these workflows. In order to improve the flexibility of a workflow system, a number of data mining tasks can be defined:
- 13 -
Workflow recommendation: Compute a ranking of the available workflows with respect to their interestingness to the user for a given task. As it is hard to formally model the user’s task and his interest in a workflow, one can also define the task of finding a measure of similarity on workflows. Given a (partial) workflow for the task the user is interested in, the most similar workflows are then recommended to the user. Metadata extraction: Given a workflow (and possibly partial metadata), infer the metadata that describes the workflow best. As most approaches for searching and organizing workflows are based on descriptive metadata, this task can be seen as the automatization of the extraction of workflow semantics. Pattern extraction: Given a set of workflows, extract a set of sub-patterns that are characteristic for this workflow. A practical purpose of these patterns is to serve as building blocks for new workflows. In particular, given several sets of workflows, one can also define the task of extracting the most discriminative patterns, i.e. patterns that are characteristic for one group but not the others. Workflow construction: Given a description of the task, automatically construct a workflow solving the task from scratch. An approach to workflow construction, based on cooperative planning, is proposed in [11]. However, this approach requires a detailed ontology of services [8], which in practice is often not available. Hence, we do not investigate this task in this paper. In existing approaches to the retrieval and discovery of workflows, workflows are usually considered as an atomic entity, using workflow meta data such as its usage history, textual descriptions (in particular tags), or user-generated quality labels as descriptive attributes. While these approaches can deliver high quality results, they are limited by the fact that all these attributes require either a high user effort to describe the workflow (to use text mining techniques), or a frequent use of each workflow by many different users (to mine for correlations). We restrict our investigations to second approach considering the case where a large collection of working workflow is available. In this paper we are interested in supporting the user in constructing the workflow and reducing the manual effort of workflow tagging. The reason for the focus on the early phases of workflow construction is that in practice it can be observed that often users are reluctant to put too much effort into describing a workflow; they are usually only interested in using the workflow system as a means to get their work done. A second aspect to be considered is that without proper means to discover existing workflows for re-use, it will be hard to receive enough usage information on a new workflow to start up a correlation-based recommendation in the first place. To address these problems, we have opted to investigate solutions to the previously described data mining tasks that can be applied in the common situation of many unlabeled workflows, using only the workflow description itself and no meta data. Our work is based on the fact that workflows can be viewed as graphs. We will demonstrate that by the use of graph kernels it is possible to effectively extract workflow semantics and use this knowledge for the problems of
- 14 -
workflow recommendation and metadata extraction. The purpose of this paper is to answer the following questions: Q1: How good are graph kernels at performing the tasks of workflow recommendation without explicit user input? We will present an approach that is based on exploiting workflow similarity. Q2: Can appropriate meta data about a workflow be extracted from the workflow itself? What can we infer about the semantics of a workflow and its key characteristics? In particular, we will investigate the task of tagging a workflow with a set of user-defined keywords. Q3: How good does graph mining perform at a descriptive approach of workflow analysis, namely the extraction of meaningful graph patterns? The remainder of the paper is structured as follows: Next, we will discuss related work in the area of workflow systems. In Section 3, we give a detailed discussion of representation of workflows and the associated metadata. Section 4 will present the approach of using graph kernels for workflow analysis. The approach will be evaluated on four distinct learning tasks on a dataset of bioinformatics workflows retrieved from the website http://myexperiment.org in Section 5. Section 6 concludes.
2
Related Work
Since workflow systems are getting more complicated, the development of effective discovery techniques particularly for this field has been addressed by many researcher during the last years. Public repositories that enable sharing of workflows are widely used both in business and scientific communities. While first steps toward supporting the user have been made, there is still a need to improve the effectiveness of discovery methods and support the user in navigating the space of available workflows. A detailed overview of different approaches for workflow discovery is given by Goderis [4]. Most approaches are based on simple search functionalities and consider a workflow as an atomic entity. Searching over workflow annotation like titles, textual description , or discovery on the basis of user profiles belongs to basic capabilities of repositories such as myExperiment [14], BioWep1 , Kepler2 or commercial systems like Infosense and Pipeline Pilot. In [5] a detailed study about current practices in workflow sharing, re-using and retrieval is presented. To summarize, the need to take into account structural properties of workflows in the retrieval process was underlined by several users. Authors demonstrate that existing techniques are not sufficient and there is still a need for effective discovery tools. In [6] retrieval techniques and methods for ranking discovered workflows based on graph-subisomorphism matching are presented. Coralles [1] proposes a method for calculating the structural similarity 1 2
http://bioinformatics.istge.it/biowep/ https://kepler-project.org/
- 15 -
of two BPEL (Business Process Execution Language) workflows represented by graphs. It is based on error correcting graph subisomorphism detection. Apart from workflow sharing and retrieval, the design of new workflows is an immense challenge to users of workflow systems. It is both time-consuming and error-prone, as there is a great diversity of choices regarding services, parameters, and their interconnections. It requires the researcher to have specific knowledge in both his research area and in the use of the workflow system. Consequently, it is preferable for a researcher to not start from scratch, but to receive assistance in the creation of a new workflow. A good way to implement this assistance is to reuse or re-purpose existing workflows or workflow patterns (i.e. more generic fragments of workflows). An example of workflow re-use is given in [7], where a workflow to identify genes involved in tolerance to Trypanosomiasis in East African cattle was reused successfully by another scientist to identify the biological pathways implicated in the ability of mice to expel the Trichuris Muris parasite. In [7] it is argued that designing new workflows by reusing and re-purposing previous workflows or workflows patterns has the following advantages: – Reduction of workflow authoring time – Improved quality through shared workflow development – Improved experimental provenance through reuse of established and validated workflows – Avoidance of workflow redundancy While there has been some research comparing workflow patterns in a number of commercially available workflow management systems [17] or identifying patterns that describe the behavior of business processes [18], to the best of our knowledge there exists no work to automatically extract patterns. A pattern mining method for business workflows based on calculation of support values is presented in [16]. However, the set of patterns that was used was derived manually based on an extensive literature study.
3
Workflows
A workflow is a way to formalize and structure complex data analysis experiments. Scientific workflows can be described as a sequence of computation steps together with predefined input and output that arise in scientific problemsolving. Such a definition of workflows enables sharing analysis knowledge within scientific communities in a convenient way. We consider the discovery of similar workflows in the context of a specific VRE called myExperiment [13]. MyExperiment has been developed to support sharing of scientific objects associated with an experiment. It is a collaborative environment where scientists can publish their workflows. Each stored workflow is created by a specific user, is associated with a workflow graph, and contains metadata and certain statistics such as the number of downloads or the average rating given by the users. We split all available information about a workflow
- 16 -
into four different groups: the workflow graph, textual data, user information, and workflow statistics. Next we will characterize each group in more detail. Textual Data: Each workflow in myExperiment has a title and a description text and contains information about the creator and date of creation. Furthermore, the associated tags annotate workflow by several keywords that facilitate searching for workflows and provide more precise results. User Information: MyExperiment was thought also as a social infrastructure for the researchers. The social component is realized by registration of users and allows them to create profiles with different kind of personal information, details about their work and professional life. The members of myExperiment can form complex relationships with other members, such as creating or joining user groups or giving credit to others. All this information can be used in order to find the groups of users having similar research interests or working in related projects. In the end, this type of information can be used to generate the well known correlation-based recommendations of the type “users who liked this workflow also liked the following workflows...”. Workflow Statistics: As statistic data we consider information that is changing with the time, such as the number of views or downloads or the average rating. Statistic data can be very useful for providing a user with a workflow he is likely to be interested in. As we do not have direct information about user preferences, some of the statistics data, e.g. number of downloads or rating, can be considered as a kind of quality measure.
4
A Graph Mining Approach to Workflow Analysis
The characterization of a workflow by metadata alone is challenging because neither of these features give an insight into the underlying sub-structures of the workflow. It is clear that users do not always create a new workflow from scratch, but most likely re-use old components and sub-workflows. Hence, knowledge of sub-structures is important information to characterize a workflow completely. The common approach to represent objects for a learning problem is to describe them as vectors in a feature space. However, when we handle objects that have important sub-structures, such as workflows, the design of a suitable feature space is not trivial. For this reason, we opt to follow a graph mining approach. 4.1
Frequent Subgraphs
Frequent subgraph discovery has received a lot of attention, since it has a wide range of applications areas. Frequently occurring subgraphs in a large set of graphs can represent important motifs in the data. Given a set of graphs G, the support S(G) of a graph G is defined as the fraction of graphs in G in which G occurs. The problem of finding frequent patterns is defined as follows: Given a set of graphs G and minimum support Smin , we want to find all connected subgraphs that occur frequently enough (i.e. S(G) >= Smin ) over the entire set of graphs. The output of the discovery process may contain a large number of such patterns.
- 17 -
4.2
Graph Kernels
Graph kernels, as originally proposed by [3,10], provide a general framework for handling graph data structures by kernel methods. Different approaches for defining graph kernels exist. A popular representation of graphs that is used for examples in protein modeling and drug screening are kernels based on cyclic patterns [9]. However, these are not applicable to workflow data, as workflows are by definition acyclic (because an edge between services A and B represents the relation “A must finish before B can start”). To adequately represent the decomposition of workflows into functional substructures, we follow a third approach: the set of graphs is searched for substructures (in this case paths) that occur in at least a given percentage (support) of all graphs. Then, the feature vector is composed of the weighted counts of such paths. The substructures are sequences of labeled vertices that were produced by graph traversal. The length of a substructure is equal to the number of vertices in it. This family of kernels is called Label Sequence Kernels. The main difference among the kernels lies in how graphs are traversed and how weights are involved in computing a kernel. According to the extracted substructures, these are kernels based on walks, trees or cycles. In our work we used walks based exponential kernels proposed by Gärtner et al. [3]. Since workflows are directed acyclic graphs, in our special case the hardness results of [3] no longer hold and we actually can enumerate all walks. This allows us to explicitly generate the feature space representation of the kernels by defining the attribute values for every substructure (walk). For each substructure s in the set of graphs, let k be the length of the substructure. Then, the attribute λs is defined as: λs =
βk k!
(1)
if the graph contains the substructure s and λs = 0 else. Here β is a parameter that can be optimized, e.g. by cross-validation. A very important advantage of graph kernels approach for discovery task is that distinct substructures can provide an insight into the specific behavior of the the workflow. 4.3
Graph Representation of Workflows
A workflow can be formalized as a directed acyclic labeled graph. The workflow graph has two kind of nodes: regular nodes representing the computation operations and nodes defining input/output data structure. A set of edges shows information and control flow between the nodes. More formally, a workflow graph can be defined as a tuple W = (N, T ), where: N = {C, I, O} C = finite set of computation operations, I/O = finite set of inputs or outputs T ⊆ N × N = finite set of transitions defining the control flow. Labeled graphs contain an additional source of information. There are several alternatives to obtain node labels. On the one hand, users often annotate single
- 18 -
workflow components by a combination of words or abbreviations. On the other hand, each component within workflow system has a signature and an identifier associated with it, e.g. in web-service WSDL format. User created labels suffer from subjectivity and diversity, e.g. the same node representing the same computational operation can be labeled in very different way. The first alternative again assumes some type of user input, so we opt to use the second alternative. An exemplary case where this choice makes a clear difference will be presented later in Section 5.2. Figure 1 shows an example of such transformation obtained for a Taverna workflow [12]. While the left pictures shows a user annotated components the right picture presents workflow graph on the next abstraction level. Obviously, the choice of the right abstraction level is crucial. In this paper, we use a handcrafted abstraction that was developed especially for the MyExperiment data. In general, the use of data mining ontologies [8] may be preferable.
Fig. 1. Transformation of Taverna workflow to the workflow graph.
- 19 -
Group Size Most frequent tags 1 30% localworker, example, mygrid 2 29% bioinformatics, sequence, protein, BLAST, alignment, similarity, structure, search, retrieval 3 24% benchmarks 4 6.7% AIDA , BioAID, text mining, bioassist, demo, biorange 5
5
Description Workflows using local scripts. Sequence similarity search using the BLAST algorithm
Benchmarks WFs. Text mining on biomedical texts using the AIDA toolbox and BioAID web services 6.3% Pathway, microarray, kegg Molecular pathway analysis using the Kyoto Encyclopedia of Genes and Genomes (KEGG) Table 1. Characterization of workflow groups derived by clustering.
Evaluation
In this section we illustrate the use of workflow structure and graph kernels in particular for workflow discovery and pattern extraction. We evaluate results on a real-world dataset of Taverna workflows. However, the same approach can be applied to other workflow systems, as long as the workflows can be transformed to a graph in a consistent way. 5.1
Dataset
For the purposes of this evaluation we used a corpus of 300 real-world bioinformatics workflows retrieved from myExperiment [13]. We chose to restrict ourselves to workflows that were created in Taverna workbench [12] in oder to simplify the formatting of workflows. Since the application area of myExperiment is restricted to bioinformatics, it is likely that sets of similar workflows exist. In the data, user feedback about the similarity of workflow pairs is missing. Hence, we use semantic information to obtain workflows similarity. We made the assumption that workflows targeting the same tasks are similar. Under this assumption we used the cosine similarity of the vector of tags assigned to the workflow as a proxy for the true similarity. An optimization over the number of clusters resulted in five groups shown in Table 1. These tags indeed impose a clear structuring with few overlaps on the workflows. 5.2
Workflow Recommendation
In this section, we address Question Q1: “How good are graph kernels at performing the tasks of workflow recommendation without explicit user input?” The goal is to retrieve workflows that are "close enough" to a user’s context. To do this, we need to be able to compare workflows available in existing VREs with the user’s one. As similarity measure we use the graph kernel from Section 4.2.
- 20 -
We compare our approach based on graph kernels to the following techniques representing the current state of the art [6]: matching of workflow graphs based on the size of the maximal common subgraph (MCS) and a method that considers a workflow as a bag of services. In addition to these techniques we also consider a standard text mining approach, whose main idea is that workflows are documents in XML format. The similarity of a workflow pair is then calculated as the cosine distance between the respective word vectors. In our experiment we predict if two workflows belong to the same cluster. Table 5.2 summarizes the average performances of a leave-one-out evaluation for the four approaches. It can be seen that graph kernels clearly outperform all other approaches in accuracy and recall. For precision, MCS performs best, however, at the cost of a minimal recall. The precision of graph kernels ranks second and is close to the value of MCS. Method Accuracy Precision Recall Graph Kernels 81.2 ± 10.0 71.9 ± 22.0 38.3 ± 21.1 MCS 73.9 ± 9.3 73.5 ± 24.7 4.8 ± 27.4 Bags of services 73.5 ± 10.3 15.5 ± 20.6 3.4 ± 30.1 Text Mining 77.8 ± 8.31 67.2 ± 21.5 31.2 ± 25.8 Table 2. Performance of workflow discovery.
We conclude that graph kernels are very promising for the task of workflow recommendation based only on graph structure without explicit user input. 5.3
Workflow Tagging
We are now interested in Question Q2 of extraction of appropriate metadata from workflows. As a prototypical piece of metadata, we investigate user-defined tags. 20 tags were selected that occur in at least 3% of all workflows. We use tags as proxies that represent the real-world task that a workflow can perform. For each tag we would like to predict if it describes a given workflow. To do that we utilize graph kernels. We tested two algorithms: SVM and k-Nearest Neighbor. Table 3 shows the results of tag prediction evaluated by 2-fold cross validation over 20 keywords, It can be seen that an SVM with graph kernels can predict the selected tags with high AUC and precision, while a Nearest Neighbor approach using graph kernels to define the distance achieves a higher recall. We can conclude that the graph representation of workflow contains enough information to predict appropriate metadata. 5.4
Pattern extraction
Finally, we investigate question Q4, which deals with the more descriptive task of extracting meaningful patterns from sets of workflows that are helpful in the construction of new workflows.
- 21 -
Method AUC Precision Recall Nearest Neighbors 0.54 ± 0.18 0.51 ± 0.21 0.58 ± 0.19 SVM 0.85 ± 0.10 0.84 ± 0.24 0.38 ± 0.29 Table 3. Accuracy of workflows tagging based on graph kernels averaged over all 20 tasks.
We address the issue of extracting patterns that are particularly important within a group of similar workflows in several steps. First, we use a SVM to build a classification model based on the graph kernels. This model identifies all workflows which belong to the same group against workflows from other groups. Then we search for features having high weight value which the model considers as important. We performed such pattern extraction targeting consequently each workflow group. A 10-fold cross-validation shows that this classification can be achieved with high accuracy, values ranging between 81.3% and 94.7%, depending on the class. However, we are more interested in the most significant patterns, which we determine based on the weight that was assigned by the SVM (taking the standard deviation into account). Figure 2 shows an example of workflow patterns and the same pattern inside a workflow that it occurs in. It was considered as important for classifying workflows from group 2, which consists of workflows using the BLAST algorithm to calculate sequences similarity. The presented pattern is a sequence of components that are needed to run a BLAST service. This example shows that graph kernels can be used to extract useful patterns, which then can be recommended to the user during creation of a new workflow.
6
Conclusions
Workflow enacting systems have become a popular tool for the easy orchestration of complex data processing tasks. However, the design and management of workflows are a complex tasks. Machine learning techniques have the potential to significantly simplify this work for the user. In this paper, we have discussed the usage of graph kernels for the analysis of workflow data. We argue that graph kernels are very useful in the practically important situation where no meta data is available. This is due to the fact that the graph kernel approach allows to take decompositions of the workflow into its substructures into account, while allowing an flexible integration of these information contained into these substructures into several learning algorithms. We have evaluated the use of graph kernels in the fields of workflow similarity prediction, metadata extraction, and pattern extraction. A comparison of graphbased workflow analysis with metadata-based workflow analysis in the field of workflow quality modeling showed that metadata-based approaches outperform graph-based approaches in this application. However, it is important to recognize that the goal of the graph-based approach is not to replace the metadata-based approaches, but to serve as an extension when no or few metadata is available.
- 22 -
Fig. 2. Example of workflow graph.
The next step in our work will be to evaluate our approach in more realistic scenario. Future research will investigate several alternatives for the creation of a workflow representation from a workflow graph in order to provide an appropriate representation at different levels of abstraction. One possibility is to obtain label of graph nodes using an ontology that describes the services and key components of a workflow such as in [8].
References 1. Juan Carlos Corrales, Daniela Grigori, and Mokrane Bouzeghoub. Bpel processes matchmaking for service discovery. In In Proc. CoopIS 2006, Lecture Notes in Computer Science 4275, pages 237–254. Springer, 2006. 2. M. Fraser. Virtual Research Environments: Overview and Activity. Ariadne, 2005. 3. Thomas Gaertner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. In Proceedings of the 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, pages 129–143. Springer-Verlag, August 2003. 4. Antoon Goderis. Workflow re-use and discovery in bioinformatics. PhD thesis, School of Computer Science, The University of Manchester, 2008.
- 23 -
5. Antoon Goderis, Paul Fisher, Andrew Gibson, Franck Tanoh, Katy Wolstencroft, David De Roure, and Carole Goble. Benchmarking workflow discovery: a case study from bioinformatics. Concurr. Comput. : Pract. Exper., (16):2052–2069, 2009. 6. Antoon Goderis, Peter Li, and Carole Goble. Workflow discovery: the problem, a case study from e-science and a graph-based solution. In ICWS ’06: Proceedings of the IEEE International Conference on Web Services, pages 312–319. IEEE Computer Society, 2006. 7. Antoon Goderis, Ulrike Sattler, Phillip Lord, and Carole Goble. Seven bottlenecks to workflow reuse and repurposing. The Semantic Web âĂŞ ISWC 2005, pages 323–337, 2005. 8. Melanie Hilario, Alexandros Kalousis, Phong Nguyen, and Adam Woznica. A data mining ontology for algorithm selection and meta-learning. In Proc of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Serviceoriented Knowledge Discovery (SoKD-09), Bled, Slovenia, pages 76–87., 2009. 9. Tamás Horváth, Thomas Gärtner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. In KDD ’04: Proc. of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167. ACM, 2004. 10. Hisashi Kashima and Teruo Koyanagi. Kernels for semi-structured data. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 291–298, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. 11. Jörg-Uwe Kietz, Floarea Serban, Abraham Bernstein, and Simon Fischer. Towards cooperative planning of data mining workflows. In Proc of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD-09), Bled, Slovenia, pages pp. 1–12, September 2009. 12. T Oinn, M.J. Addis, J. Ferris, D.J. Marvin, M. Senger, T. Carver, M. Greenwood, K Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054, June 2004. 13. David De Roure, Carole Goble, Jiten Bhagat, Don Cruickshank, Antoon Goderis, Danius Michaelides, and David Newman. myexperiment: Defining the social virtual research environment. In 4th IEEE International Conference on e-Science, pages 182–189. IEEE Press, December 2008. 14. Robert Stevens David De Roure. The design and realisation of the myexperiment virtual research environment for social sharing of workflows. 2009. 15. Ian J. Taylor, Ewa Deelman, Dennis B. Gannon, and Matthew Shields. Workflows for e-Science: Scientific Workflows for Grids. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. 16. Lucineia Thom, Cirano Iochpe, and Manfred Reichert. Workflow patterns for business process modeling. In Proc. of the CAiSE’06 Workshops - 8th Int’l Workshop on Business Process Modeling, Development, and Support (BPMDS’07), page Vol. 1. Trondheim, Norway, 2007. 17. W. M. P. Van Der Aalst, A. H. M. Ter Hofstede, B. Kiepuszewski, and A. P. Barros. Workflow patterns. Distrib. Parallel Databases, 14(1):5–51, 2003. 18. Stephen A. White. Business process trends. In Business Process Trends, 2004.
- 24 -
Re-using Data Mining Workflows Stefan R¨ uping, Dennis Wegener, and Philipp Bremer Fraunhofer IAIS, Schloss Birlinghoven, 53754 Sankt Augustin, Germany http://www.iais.fraunhofer.de
Abstract. Setting up and reusing data mining processes is a complex task. Based on our experience from a project on the analysis of clinicogenomic data we will make the point that supporting the setup and reuse by setting up large workflow repositories may not be realistic in practice. We describe an approach for automatically collecting workflow information and meta data and introduce data mining patterns as an approach for formally describing the necessary information for workflow reuse. Key words: Data Mining, Workflow Reuse, Data Mining Patterns
1
Introduction
Workflow enacting systems are a popular technology in business and e-science alike to flexibly define and enact complex data processing tasks. A workflow is basically a description of the order in which a set of services have to be called with which input in order to solve a given task. Driven by specific applications, a large collection of workflow systems have been prototyped, such as Taverna1 or Triana2 . The next generation of workflow systems are marked by workflow repositories such as MyExperiment.org, which tackle the problem of organizing workflows by offering the research community the possibility to publish, exchange and discuss individual workflows. However, the more powerful these environments become, the more important it is to guide the user in the complex task of constructing appropriate workflows. This is particularly true for the case of workflows which encode a data mining tasks, which are typically much more complex and in a more constant state of frequent change than workflows in business applications. In this paper, we are particularly interested in the question of reusing successful data mining applications. As the construction of a good data mining process invariably requires to encode a significant amount of domain knowledge, this is a process which cannot be fully automated. By reusing and adapting existing processes that have proven to be successful in practical use, we hope to be able to save much of this manual work in a new application and thereby increase the efficiency of setting up data mining workflows. 1 2
http://www.taverna.org.uk http://www.trianacode.org
- 25 -
We report our experiences in designing a system which is targeted at supporting scientists, in this case bioinformaticians, with a workflow system for the analysis of clinico-genomic data. We will make the case that: – For practical reasons it is already a difficult task to gather a non-trivial database of workflows which can form the basis of workflow reuse. – In order to be able to meaningfully reuse data mining workflows, a formal notation is needed that bridges the gap between a description of the workflows at implementation level and a high-level textual description for the workflow designer. The paper is structured as follows: In the next section, we introduce the ACGT project, in the context of which our work was developed. Section 3 describes an approach for automatically collecting workflow information and appropriate meta data. Section 4 presents data mining patterns which formally describe all information that is necessary for workflow reuse. Section 5 concludes.
2
The ACGT Project
The work in this paper is based on our experiences in the ACGT project3 , which has the goal of implementing a secure, semantically enhanced end-to-end system in support of large multi-centric clinico-genomic trials, meaning that it strives to integrate all steps from the collection and management of various kinds of data in a trial up to the statistical analysis by the researcher. In the current version, the various elements of the data mining environment can be integrated into complex analysis pipelines through the ACGT workflow editor and enactor. With respect to workflow reuse, we made the following experiences in setting up and running an initial version of the ACGT environment: – The construction of data mining workflows is an inherently complex problem when it is based on input data with complex semantics, as it is the case in clinical and genomic data. – Because of the complex data dependencies, copy and paste is not an appropriate technique for workflow reuse. – Standardization and reuse of approaches and algorithms works very well on the level of services, but not on the level of workflows. While it is relatively easy to select the right parameterization of a service, making the right connections and changes to a workflow template is quickly getting quite complex, such that user finds it easier to construct a new workflow from scratch. – Workflow reuse only occurs when the initial creator of a workflow detailedly describes the internal logic of the workflow. However, most workflow creators avoid this effort because they simply want to “solve the task at hand”. In summary, the situation of having a large repository of workflows to chose the appropriate one from, which is often assumed in existing approaches for workflow recommendation systems, may not be very realistic in practice. 3
http://eu-acgt.org
- 26 -
3
Collecting Workflow Information
To obtain information about the human creation of data mining workflows it is necessary to design a system which collects realistic data mining workflows out of the production cycle. We developed a system which collects data mining workflows based on plug-ins which were integrated into the data mining software used for production [1]. In particular, we developed plug-ins for Rapidminer, which is an open source data mining software, and Obwious, a self-developed text-mining tool. Every time the user executes a workflow, the workflow definition is send to a repository and stored in a shared abstract representation. The shared abstract representation is mandatory as we want to compare the different formats and types of workflows and to extract the interesting information out of a wide range of workflows to get a high diversity. As we do not only want to observe the final version of a humanly created workflow but the whole chain of workflows which were created in the process of finding and creating this final version, we also need a representation of this chain of workflows. We will call the collection of the connected workflows from the workflow life cycle which solves the same data mining problem on the same data base a workflow design sequence. The shared abstract representation of the workflows is solely orientated on CRISP-Phases and its common tasks, as described in [2]. Based on this we created the following six classes: (1) data preparation: select data, (2) data preparation: clean data, (3) data preparation: construct data, (4) data preparation: integrate data, (5) data preparation: format data, (6) modeling, and (7) other. Of course, it would be also be of interest to use more detailed structures, such as the data mining ontology presented in [3]. The operators of the data mining software that was used are classified using these classes and the workflows are transferred to the shared abstract representation. The abstract information itself consists of the information if any operator of the first five classes - the operators which are doing data preparation tasks - is used in the workflow, and if any changes in the operator themselves are done or if any changes in the parameter settings are done in comparison to the predecessor in this sequence. Furthermore in this representation it is noted which operators of the class Modeling are used and if there are any changes operators or in their parameter setting in comparison to the predecessor in the design sequence. An example of this representation is shown in Figure 1.
Fig. 1. Visualization of a workflow design sequence in the abstract representation
- 27 -
At the end of the first collection phase which lasted 6 months we have collected 2520 workflows in our database which were created by 16 different users. These workflows were processed into 133 workflow design sequences. According to our assumption this would mean that there are about 33 real workflow design sequences in our database. There was an imbalance on the distribution of workflows and workflow design sequences over the two software sources. Because of heavy usage and an early development state of Obwious about 85% of workflows and over 90% of workflow design sequences were created using Rapidminer. Although there has to be much more time the system collects data there are already some interesting informations in the derived data. In Figure 2 one can see that in the workflow creation process the adjustment and modulation of the data preparation operators is as important as the adjustment and modulation of the learner operators. This is contrarily to common assumptions where the focus is only set on the modeling phase and the learner operators. The average length of a computed workflow design sequence is about 18 workflows. In summary, our study shows that a human workflow creator produces many somewhat similar workflows until he finds his final version, which mainly differ in operators and parameters of the CRISP-phases data preparation and modeling. absolute occurrences relative occurrences1 Change 609 24,17% Data preparation Parameter change 405 16,07% CRISP-phase
Learner 1
Sum of all changes Change Parameter change
1014 215 801
40,24% 8,53% 31,79%
Sum of all changes
1016
40,32%
Relative to the absolute count of all workflows of 2520 Fig. 2. Occurrences of changes in CRISP-phases
4
Data Mining Patterns
In the area of data mining there exist a lot of scenarios where existing solutions are reusable, especially when no research on new algorithms is necessary. Lots of examples and ready-to-use algorithms are available as toolkits or services, which only have to be integrated. However, the reuse and integration of existing solutions is not often or only informally done in practice due to a lack of formal support, which leads to a lot of unnecessary repetitive work. In the following we present our approach on the reuse of data mining workflows by formally encoding both technical and and high-level semantics of these workflows. In this work, we aim at a formal representation of data mining processes to facilitate their semi-automatic reuse in business processes. As visualized in Fig. 3, the approach should bridge the gap between a high-level description of the process as in written documentation and scientific papers (which is to general to
- 28 -
lead to an automization of work), and a fine-grained technical description in the form of an executable workflow (which is too specific to be re-used in slightly different cases).
Fig. 3. Different strategies of reusing data mining.
In [4] we presented a new process model for easy reuse and integration of data mining in different business processes. The aim of this work was to allow for reusing existing data mining processes that have proven to be successful. Thus, we aimed at the development of a formal and concrete definition of the steps that are involved in the data mining process and of the steps that are necessary to reuse it in new business processes. In the following we will briefly describe the steps that are necessary to allow for the reuse of data mining. Our approach is based on CRISP [2]. The basic idea is that when a data mining solution is re-used, one can see several parts of the CRISP process are pre-defined, and only needs to execute those parts of CRISP where the original and re-used process differ. Hence, we define Data Mining Patterns to describe those parts that are pre-defined, and introduce a meta-process to model those steps of CRISP which need to be executed when re-using a pattern on a concrete data mining problem. Data Mining Patterns are defined such that the CRISP process (more correctly, those parts of CRISP that can be pre-defined) is the most general Data Mining Pattern, and that we can derive a more specialized Data Mining Pattern out of a more general one by replacing a task by a more specialized one (according to a hierarchy of tasks that we define), CRISP is a standard process model for data mining which describes the life cycle of a data mining project in the following 6 phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. The CRISP model includes a four-level breakdown including phases, generic tasks, specialized tasks and process instances for specifying different levels of abstraction. In the end, the data mining patterns match most to the process instance level of CRISP. In our approach we need to take into account that reuse
- 29 -
may in some cases only be possible at a general or conceptual level. We allow for the specification of different levels of abstraction by the following hierarchy of tasks: conceptual (only textual description is available), configurable (code is available but parameters need to be specified), and executable (code and parameters are specified). The idea of our approach is to be able to describe all data mining processes. The description needs to be as detailed as adequate for the given scenario. Thus, we consider the tasks of the CRISP process as the most general data mining pattern. Every concretion of this process for an specific application is also a data mining pattern. The generic CRISP tasks can be transformed to the following components: Check tasks in the pattern, e.g. checking if the data quality is acceptable; Configurable tasks in the pattern, e.g. setting a certain service parameter by hand; Executable tasks or gateways in the pattern which can be executed without further specification; Tasks in the meta process that are independent of a certain pattern, e.g. checking if the business objective of the original data mining process and the new process are identical; Empty task as the task is obsolete due to the pattern approach, e.g. producing a final report. We defined a data mining pattern as follows: The pattern representing the extended CRISP model is a Data Mining Pattern. Each concretion of this according to the presented hierarchy is also a Data Mining Pattern.
5
Discussion and Future Work
Setting up and reusing data mining workflows is a complex task. When many dependencies on complex data exist, the situation found in workflow reuse is fundamentally different from the one found in reusing services. In this paper, we have given a short insight into the nature of this problem, based on our experience in a project dealing with the analysis of clinico-genomic data. We have proposed two approaches to improve the possibility for reusing workflows, which are the automated collection of a meta data-rich workflow repository, and the definition of data mining patterns to formally encode both technical and high-level semantic of workflows.
References 1. Bremer, P.: Erstellung einer Datenbasis von Workflowreihen aus realen Anwendungen (in german), Diploma Thesis, University of Bonn (2010) 2. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0 Step-by-step data mining guide, CRISP-DM consortium (2000) 3. Hilario, M., Kalousis, A., Nguyen, P., Woznica, A.: A Data Mining Ontology for Algorithm Selection and Meta-Learning. Proc. ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD09), Bled, Slovenia, pp. 76–87 (2009) 4. Wegener, D., R¨ uping, S.: On Reusing Data Mining in Business Processes - A Pattern-based Approach. BPM 2010 Workshops - Proceedings of the 1st International Workshop on Reuse in Business Process Management, Hoboken, NJ (2010)
- 30 -
Expos´ e: An Ontology for Data Mining Experiments Joaquin Vanschoren1 and Larisa Soldatova2 1
2
Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium,
[email protected] Aberystwyth University, Llandinum Bldg, Penglais, SY23 3DB Aberystwyth, UK,
[email protected]
Abstract. Research in machine learning and data mining can be speeded up tremendously by moving empirical research results out of people’s heads and labs, onto the network and into tools that help us structure and filter the information. This paper presents Expos´e, an ontology to describe machine learning experiments in a standardized fashion and support a collaborative approach to the analysis of learning algorithms. Using a common vocabulary, data mining experiments and details of the used algorithms and datasets can be shared between individual researchers, software agents, and the community at large. It enables open repositories that collect and organize experiments by many researchers. As can been learned from recent developments in other sciences, such a free exchange and reuse of experiments requires a clear representation. We therefore focus on the design of an ontology to express and share experiment meta-data with the world.
1
Introduction
Research in machine learning is inherently empirical. Whether the goal is to develop better learning algorithms or to create appropriate data mining workflows for new sources of data, running the right experiments and correctly interpreting the results is crucial to build up a thorough understanding of learning processes. Running those experiments tends to be quite laborious. In the case of evaluating a new algorithm, pictured in Figure 1, one needs to search for datasets, preprocessing algorithms, (rival) learning algorithm implementations and scripts for algorithm performance estimation (e.g. cross-validation). Next, one needs to set up a wide range of experiments: datasets need to be preprocessed and algorithm parameters need to be varied, each of which requires much expertise. This easily amounts to a large range of experiments representing days, if not weeks of work, while only averaged results will ever be published. Any other researcher willing to verify the published results or test additional hypothesis will have to start again from scratch, repeating the same experiments instead of simply reusing them.
- 31 -
Fig. 1. A typical experimental workflow in machine learning research.
1.1
Generalizability and Interpretability
Moreover, in order to ensure that results are generally valid, the empirical evaluation also needs to cover many different conditions. These include various parameter settings and various kinds of datasets, e.g. differing in size, skewness, noisiness, and various workflows of preprocessing techniques. Unfortunately, because of the amount of work involved in empirical evaluation, many studies will not explore these conditions thoroughly, limiting themselves to algorithm benchmarking. It has long been recognized that such studies are in fact only ‘case studies’ [1], and should be interpreted with caution. Sometimes, overly general conclusions can be drawn. In time series analysis research, many studies were shown to be biased toward the datasets being used, leading to contradictory results [16]. Moreover, it has been shown that the relative performance of learning algorithms depends heavily on the amount of sampled training data [23, 29], and is also easily dominated by the effect of parameter optimization and feature selection [14]. As such, there are good reasons to thoroughly explore different conditions, or at least to clearly state under which conditions certain conclusions may or may not hold. Otherwise, it is very hard for other researchers to correctly interpret the results, thus possibly creating a false sense of progress [11]: ...no method will be universally superior to other methods: relative superiority will depend on the type of data used in the comparisons, the particular data sets, the performance criterion and a host of other factors. [...] an apparent superiority in classification accuracy, obtained in laboratory conditions, may not translate to a superiority in real-world conditions...
- 32 -
1.2
A collaborative approach
In this paper, we advocate a much more dynamic, collaborative approach to experimentation, in which all experiment details can be freely shared in repositories (see the dashed arrow in Fig. 1), linked together with other studies, augmented with measurable properties of algorithms and datasets, and immediately reused by researchers all over the world. Any researcher creating empirical meta-data should thus be able to easily share it with others and in turn reuse any prior results of interest. Indeed, by reusing prior results we can avoid unnecessary repetition and speed up scientific research. This enables large-scale, very generalizable machine learning studies which are prohibitively expensive to start from scratch. Moreover, by bringing the results of many studies together, we can obtain an increasingly detailed picture of learning algorithm behavior. If this meta-data is also properly organized, many questions about machine learning algorithms can be answered on the fly by simply writing a query to a database [29]. This also drastically facilitates meta-learning studies that analyze the stored empirical meta-data to find or useful patterns in algorithm performance [28]. 1.3
Ontologies
The use of such public experiment repositories is common practice in many other scientific disciplines. To streamline the sharing of experiment data, they created unambiguous description languages, based on a careful analysis of the concepts used within a domain and their relationships. This is formally represented in ontologies [5, 13]: machine manipulable domain models in which each concept (class) is clearly described. They provide an unambiguous vocabulary that can be updated and extended by many researchers, thus harnessing the “collective intelligence” of the scientific community [10]. Moreover, they express scientific concepts and results in a formalized way that allows software agents to interpret them correctly, answer queries and automatically organize all results [25]. In this paper, we propose an ontology designed to adequately record machine learning experiments and workflows in a standardized fashion, so they can be shared, collected and reused. Section 2 first discusses the use of ontologies in other sciences to share experiment details and then covers previously proposed ontologies for data mining. Next, we present Expos´e, a novel ontology for machine learning experimentation, in Section 3. Section 4 concludes.
2 2.1
Previous work e-Sciences
Ontologies have proven very successful in bringing together the results of researchers all over the world. For instance, in astronomy, ontologies are used to build Virtual Observatories [7, 27], combining astronomical data from many different telescopes. Moreover, in bio-informatics, the Open Biomedical Ontology
- 33 -
(OBO) Foundry3 defines a large set of consistent and complementary ontologies for various subfields, such as microarray data4 , and genes and their products [2]. As such, they create an “open scientific culture where as much information as possible is moved out of people’s heads and labs, onto the network and into tools that can help us structure and filter the information” [20]. Ironically, while machine learning and data mining have been very successful in speeding up scientific progress in these fields by discovering useful patterns in a myriad of collected experimental results, machine learning experiments themselves are currently not being documented and organized well enough to engender the same automatic discovery of insightful patterns that may speed up the design of new data mining algorithms or workflows. 2.2
Data mining ontologies
Recently, the design of ontologies for data mining attracted quite a bit of attention, resulting in many ontologies for various goals. OntoDM [22] is a general ontology for data mining with the aim of providing a unified framework for data mining research. It attempts to cover the full width of data mining research, containing high-level classes, such as data mining tasks and algorithms, and more specific classes related to certain subfields, such as constraints for constraint-based data mining. EXPO [26] is a top-level ontology that models scientific experiments in general, so that empirical research can be uniformly expressed and automated. It covers classes such as hypotheses, (un)controlled variables, experimental designs and experimental equipment. DAMON (DAta Mining ONtology) [4], is a taxonomy meant to offer domain experts a way to look up tasks, methods and software tools given a certain goal. KDDONTO [8] is an OWL-DL ontology also built to discover suitable KD algorithms and to express workflows of KD processes. It covers the inputs and outputs of the algorithms and any pre- and postconditions for their use. KD ontology [31] describes planning-related information about datasets and KD algorithms. It is used in conjunction with an AI planning algorithm: preand postconditions of KD operators are converted into standard PDDL planning problems [18]. It is used in an extension of the Orange toolkit to automatically plan KD workflows [32]. The DMWF ontology [17] also describes all KD operators with their in- and outputs and pre- and postconditions, and is meant to be used in a KD support system that generates (partial) workflows, checks and repairs workflows built by users, and retrieves and adapts previous workflows. DMOP, the Data Mining Ontology for Workflow Optimization [12], models the internal structure of learning algorithms, and is explicitly designed to support algorithm selection. It covers classes such as the structure and parameters of predictive models, the involved cost functions and optimization strategies. 3 4
http://www.obofoundry.org/ http://www.mged.org/ontology
- 34 -
3
The Expos´ e ontology
In this section, we describe Expos´e, an ontology for machine learning experimentation. It is meant to be used in conjunction with experiment databases (ExpDBs) [3, 29, 28]: databases designed to collect the details of these experiments, and to intelligently organize them in online repositories to enable fast and thorough analysis of a myriad of collected results. In this context, Expos´e supports the accurate recording and exchange of data mining experiments and workflows. It has been ‘translated’ into an XML-based language, called ExpML, to describe experiment workflows and results in detail [30]. Moreover, it clearly defines the semantics of data mining experiments stored in the experiment database, so that a very wide range of questions on data mining algorithm performance can be answered through querying [29]. Many examples can be found in previous papers [29, 30]. Finally, although we currently use a relational database, Expos´e will clearly be instrumental in RDF databases, allowing even more powerful queries. It thus supports reasoning with the data, meta learning, data integration, and also enables logical consistency checks. For now, Expos´e focuses on supervised classification on propositional datasets. It is also important to note that, while it has been influenced and adapted by many researchers, it is a straw-man proposal that is intended to instigate discussion and attract wider involvement from the data mining community. It is described in the OWL-DL ontology language [13], and can be downloaded from the experiment database website (http://expdb.cs.kuleuven.be). We first describe the design guidelines used to develop Expos´e, then its toplevel classes, and finally the parts covering experiments, experiment contexts, evaluation metrics, performance estimation techniques, datasets, and algorithms. 3.1
Ontology design
In designing Expos´e, we followed existing guidelines for ontology design [21, 15]: Top-level ontologies It is considered good practice to start from generally accepted classes and relationships (properties) [22]. We started from the Basic Formal Ontology (BFO)5 covering top-level scientific classes and the OBO Relational Ontology (RO)6 offering a predefined set of properties. Ontology reuse If possible, other ontologies should be reused to build on prior knowledge and consensus. We directly reuse several general machine learning related classes from OntoDM [22], experimentation-related classes from EXPO [26], and classes related to internal algorithm mechanisms from DMOP [12]. We wish to integrate Expos´e with existing ontologies, so that it will evolve with them as they are extended further. Design patterns Ontology design patterns7 are reusable, successful solutions to recurrent modeling problems. For instance, a learning algorithm can sometimes act as a base-learner for an ensemble learner. This is a case of an 5 6 7
http://www.ifomis.org/bfo http://www.obofoundry.org/ro/ http://ontologydesignpatterns.org
- 35 -
has description
dataset
model specification
data specification objective specification
information content entity
is concretization of
model data item function specification
algorithm specification
is concretization of
algorithm implementation
digital entity
algorithm application
function application planned process
is concretization of has part
parameter implementation has participant
has participant
has participant
thing
parameter
is concretization of
function implementation
implementation
prediction
has participant
parameter setting
operator has participant executed on has participant
KD workflow
experiment workflow machine
material entity realizable entity quality
data role role
algorithm component role
data property algorithm property
Fig. 2. An overview of the top-level classes in the Expos´e ontology.
agent-role pattern, and a predefined property, ‘realizes’, is used to indicate which entities are able to fulfill a certain role. Quality criteria General criteria include clarity, consistency, extensibility and minimal commitment. These criteria are rather qualitative, and were only evaluated through discussions with other researchers. 3.2
Top-level View
Figure 2 shows the most important top-level classes and properties, many of which are inherited from the OntoDM ontology [22], which in turn reuses classes from OBI8 (i.e., planned process) and IAO9 (i.e. information content entity). The full arrows symbolize an ‘is-a’ property, meaning that the first class is a subclass of the second, and the dashed arrows symbolize other common properties. Double arrows indicate one-to-many properties, for instance, an algorithm application can have many parameter settings. The three most important categories of classes are information content entity, which covers datasets, models and abstract specifications of objects (e.g. algorithms), implementation, and planned process, a sequence of actions meant to achieve a certain goal. When describing experiments, this distinction is very important. For instance, the class ‘C4.5’ can mean the abstract algorithm, a specific implementation or an execution of that algorithm with specific parameter settings, and we want to distinguish between all three. 8 9
http://obi-ontology.org http://code.google.com/p/information-artifact-ontology
- 36 -
composite experiment experiment workflow
experimental design
has participant
machine
has participant
has description
is executed on
KD workflow
simulation
has participant
planned process
has participant
has participant
parameter setting
performance estimation application
learning algorithm application data processing application has specified input
has participant
data processing workflow
has part
evaluation
has participant
has specified input has specified output
has specified input
has specified output has specified output
experimental variable
model evaluation function implementation
has participant
has participant
algorithm implementation
model evaluation result
model evaluation function application
algorithm application
operator
has specified output
learner evaluation
singular experiment
has participant
has description
model has specified output
prediction result dataset
has part
function application prediction
information content entity
Fig. 3. Experiments in the Expos´e ontology.
As such, ambiguous classes such as ‘learning algorithm’ are broken up according to different interpretations (indicated by bold ellipses in Fig. 2): an abstract algorithm specification (e.g. in pseudo-code), a concrete algorithm implementation, code in a certain programming language with a version number, and a specific algorithm application, a deterministic function with fixed parameter settings, run on a specific machine with an actual input (a dataset ) and output (a model ), also see Fig. 3. The same distinction is used for other algorithms (for data preprocessing, evaluation or model refinement), mathematical functions (e.g. the kernel used in an SVM), and parameters, which can have different names in different implementations and different value settings in different applications. Algorithm and function applications are operators in a KD workflow, and can even be participants of another algorithm application (e.g., a kernel or a base-learner), i.e. they can be part of the inner workflow of an algorithm. Finally, there are also qualities, properties of a specific dataset or algorithm (see Figs. 6 and 7), and roles indicating that an element assumes a (temporary) role in another process: an algorithm can act as a base-learner in an ensemble, a function can act as a distance function in a learning algorithm, and a dataset can be a training set in one experiment and a test set in the next. 3.3
Experiments
Figure 3 shows the ontological description of experiments, with the top-level classes from Fig. 2 drawn in filled double ellipses. Experiments are defined as workflows, which allows the description of many kinds of experiments. Some (composite) experiments can also consist of many smaller (singular) experiments, and can use a particular experiment design [19] to investigate the effects of various experimental variables, e.g. parameter settings.
- 37 -
d1
op1
has input
has input
has output
has input
d2
has output
d2
op2 has output
has participant
has participant
workflow
data processing application
dataset
data processing application
dataset
data processing application
dataset
data processing workflow learner application
model
model evaluation function application
dataset train
test
performance estimation application
evaluation
learner evaluation
model evaluation result
Fig. 4. Workflow structure and an example experiment workflow.
We will now focus on a particular kind of experiment: a learner evaluation (indicated by a bold ellipse). This type of experiment applies a specific learning algorithm (with fixed parameters) on a specific input dataset and evaluates the produced model by applying one or several model evaluation functions, e.g. predictive accuracy. In predictive tasks, a performance estimation technique, e.g. 10-fold cross-validation, is applied to generate training- and test sets, evaluate the resulting models and aggregate the results. After it is executed on a specific machine, it will output a model evaluation result containing the outcomes of all evaluations and, in the case of predictive algorithms, the (probabilistic) predictions made by the models. Models are also generated by applying the learning algorithm on the entire dataset. Finally, more often than not, the dataset will have to be preprocessed first. Again, by using workflows, we can define how various data processing applications preprocess the data before it is passed on to the learning algorithm. Figure 4 illustrates such a workflow. The top of the figure shows that it consists of participants (operators), which in turn have inputs and outputs (shown in ovals): datasets, models and model evaluation results. Workflows themselves also have inputs and outputs, and we can define specialized sub-workflows. A data processing workflow is a sequence of data processing steps. The center of Fig. 4 shows one with three preprocessors. A learner evaluation workflow takes a dataset as input and applies performance estimation techniques (e.g. 10-fold cross-validation) and model evaluation functions (e.g. the area under the ROC curve) to evaluate a specific learner application. Of course, there are other types of learner evaluations, both finer ones, e.g. a singular train-test experiment, and more complex ones, e.g. doing an internal model selection to find the optimal parameter settings.
- 38 -
learner evaluation
has specified input
dataset
learning algorithm application
has participant
has participant
model
has specified output
function application function application
model evaluation function application
has participant
has participant
confidence
support
leverage
is concretization of
frequency lift integrated squared error
density-based clustering measure
clustering evaluation measure
is concretization of
function specification
conviction
association evaluation measure
model evaluation function implementation
has participant
cost function application
name version has description
function implementation
parameter setting
has specified input
inter-cluster similarity
distance-based clustering measure
model evaluation function
probabilistic distribution evaluation measure
intra-cluster variance
integrated average squared error
distribution likelihood probability distribution scoring function distribution logprobabilistic model computational likelihood distance measure evaluation measure Kullback-Leibner divergence build cpu time likelihood ratio build memory consumption single point AUROC predictive model evaluation measure
has participant
AUPRC
class prediction evaluation measure
f- measure
AUROC derived measure
binary prediction evaluation function confusion matrix
has participant
multi-class prediction evaluation measure
numeric prediction evaluation measure
averaged binary prediction measure probability errorbased measure
has participant
error-based evaluation measure RMSE RRSE
MAD
kappa statistic
correlation coefficient
RSS MAPE
information criterion AIC
recall
precision specificity
predictive accuracy class RMSE ROC_curve
graphical evaluation measure
has part
cost curve lift chart
PRgraph point
precision-recall curve
BIC
has part
Fig. 5. Learner evaluation measures in the Expos´e ontology.
3.4
Experiment context
Although outside the scope of this paper, Expos´e also models the context in which scientific investigations are conducted. Many of these classes are originally defined in the EXPO ontology [26]. They include authors, references to publications and the goal, hypotheses and conclusions of a study. It also defines (un)controlled or (in)dependent experimental variables, and various experimental designs [19] defining which values to assign to each of these variables. 3.5
Learner evaluation
To describe algorithm evaluations, Expos´e currently covers 96 performance measures used in various learning tasks, some of which are shown in Fig. 5. In some tasks, all available data is used to build a model, and properties of that model are measured to evaluate it, e.g., the inter-cluster similarity in clustering. In binary classification, the predictions of the models are used, e.g., predictive accuracy, precision and recall. In multi-class problems, the same measures can be used by transforming the multi-class prediction into c binary predictions, and averaging the results over all classes, weighted by the number of examples in each
- 39 -
role
data mining data role
data role realizes
dataset
data repository
part of
name
has description has quality
is concretization of
identifier
url
graph
time series
relational database set of instances
data item
target feature class feature
has quality
instance property
labeling
feature property
qualitative feature property
feature datatype
nominal value set numeric datatype
feature entropy feature kurtosis ...
quantitative dataset property ...
unlabeled labeled
quantitative feature property
information-theoretic dataset property
attribute-value table numeric target feature
has part
data feature
has quality
data property
has part
set of tuples
has part
data instance
optimization set
item sequence
sequence
dataset property
bag
training set
version
data specification
quality
bootstrap test set
landmarker
# features # instances
simple dataset property
# missing values
statistical dataset property
target skewness frac1
Fig. 6. Datasets in the Expos´e ontology.
class. Regression measures, e.g., root mean squared error (RMSE) can also be used in classification by taking the difference between the actual and predicted class probabilities. Finally, graphical evaluation measures, such as precision-recall curves, ROC-curves or cost-curves, provide a much more detailed evaluation. Many definitions of these metrics exist, so it is important to define them clearly. Although not shown here, Expos´e also covers several performance estimation algorithms, such as k-fold or 5x2 cross-validation, and statistical significance tests, such as the paired t-test (by resampling, 10-fold cross-validation or 5x2 cross-validation) [9] or tests on multiple datasets [6]. 3.6
Datasets
Figure 6 shows the most important classes used to describe datasets. Specification. The data specification (in the top part of Fig. 6) describes the structure of a dataset. Some subclasses are graphs, sequences and sets of instances. The latter can have instances of various types, e.g., tuples, in which case it can have a number of data features and data instances. For other types of data this specification will have to be extended. Finally, a dataset has descriptions, such as name, version and download url to make it easily retrievable.
- 40 -
Roles. A specific dataset can play different roles in different experiments (top of Fig. 6). For instance, it can be a training set in one evaluation and a test set in the next. Data properties. As said before, we wish to link all empirical results to theoretical metadata, called properties, about the underlying datasets to perform meta-learning studies. These data properties are shown in the bottom half of Fig. 6, and may concern individual instances, individual features or the entire dataset. We define both feature properties such as feature skewness or mutual information with the target feature, as well as general dataset properties such as the number of attributes and landmarkers [24].
3.7
Algorithms
Algorithms can perform very differently under different configurations and parameter settings, so we need a detailed vocabulary to describe them. Figure 7 shows how algorithms and their configurations are expressed in our ontology. From top to bottom, it shows a taxonomy of different types of algorithms, the different internal operators they use (e.g. kernel functions), the definition of algorithm implementations and applications (see Sect. 3.2) and algorithm properties (only two are shown). Algorithm implementations. Algorithm implementations are described with all information needed to retrieve and use them, such as their name, version, url, and the library they belong to (if any). Moreover, they have implementations of algorithm parameters and can have qualities, e.g. their susceptibility to noise. Algorithm composition. Some algorithms use other algorithms or mathematical functions, which can often be selected (or plugged in) by the user. These include base-learners in ensemble learners, distance functions in clustering and nearest neighbor algorithms and kernels in kernel-based learning algorithms. Some algorithm implementations also use internal data processing algorithms, e.g. to remove missing values. In Expos´e, any operator can be a participant of an algorithm application, combined in internal workflows with in- and outputs. Depending on the algorithm, operators can fulfill (realize) certain predefined roles (center of Fig. 7). Algorithm mechanisms. Finally, to understand the performance differences between different types of algorithms, we need to look at the internal learning mechanisms on which they are built. These include the kind of models that are built (e.g. decision trees), how these models are optimized (e.g. the heuristic used, such as information gain) and the decision boundaries that are generated (e.g. axis-parallel, piecewise linear ones in the case of non-oblique decision trees). These classes, which extend the algorithm definitions through specific properties (e.g. has model structure), are defined in the DMOP ontology [12], so they won’t be repeated here.
- 41 -
pattern discovery algorithm
association algorithm
clustering algorithm
Apriori
K means
neural network algorithm curvilinear regression algorithm
least mean squares regression ridge regression
linear regression
rule learning algorithm generalized linear classification algorithm
Bayesian logistic regression logistic regression probit regression
linear discriminant algorithm
kernel density estimator k-nearest neighbor
lazy learning algorithm
tree-augmented naive Bayes decision tree algorithm
recursive partitioning algorithm generative algorithm
Gaussian discriminant analysis naive Bayes algorithm
inductive logic programming algorithm
predictive algorithm
Bayesian net algorithm Gaussian processes support vector machine
kernel-based algorithm ensemble algorithm
learning algorithm has hyperparameter
has part
mixed algorithm ensemble algorithm
stacking algorithm
single algorithm ensemble algorithm
bagging algorithm boosting algorithm
has part
search
base learner
role
algorithm specification
neighbor search
algorithm role
algorithm component role
model processor data processor
kernel
distance estimator
function role is concretization of
algorithm implementation
identifier
data item
parameter
algorithm parameter
is concretization of
model parameter
has part
has quality
has quality
parameter implementation
parameter property quality realizes
operator
programming language operating system
has description
has participant
algorithm property algorithm application
classpath name url version has description
default value susceptibility to noise handles missing values has participant has participant
parameter setting
Fig. 7. Algorithms and their configurations in the Expos´e ontology.
- 42 -
has part
4
Conclusions
We have presented Expos´e, an ontology for data mining experiments. It is complementary to other data mining ontologies such as OntoDM [22], EXPO [26], and DMOP [12], and covers data mining experiments in fine detail, including the experiment context, evaluation metrics, performance estimation techniques, datasets, and algorithms. It is used in conjunction with experiment databases (ExpDBs) [3, 29, 28], to engender a collaborative approach to empirical data mining research, in which experiment details can be freely shared in repositories, linked together with other studies, and immediately reused by researchers all over the world. Many illustrations of the uses of Expos´e to share, collect and query for experimental meta-data can be found in prior work [3, 29, 30].
References 1. Aha, D.: Generalizing from case studies: A case study. Proceedings of the Ninth International Conference on Machine Learning pp. 1–10 (1992) 2. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., IsselTarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. nature genetics 25, 25–29 (2000) 3. Blockeel, H., Vanschoren, J.: Experiment databases: Towards an improved experimental methodology in machine learning. Lecture Notes in Computer Science 4702, 6–17 (2007) 4. Cannataro, M., Comito, C.: A data mining ontology for grid programming. First International Workshop on Semantics in Peer-to-Peer and Grid Computing at WWW 2003 pp. 113–134 (2003) 5. Chandrasekaran, B., Josephson, J.: What are ontologies, and why do we need them? IEEE Intelligent systems 14(1), 20–26 (1999) 6. Demˇsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 7. Derriere, S., Preite-Martinez, A., Richard, A.: UCDs and ontologies. ASP Conference Series 351, 449 (2006) 8. Diamantini, C., Potena, D., Storti, E.: Kddonto: An ontology for discovery and composition of kdd algorithms. Proceedings of the 3rd Generation Data Mining Workshop at the 2009 European Conference on Machine Learning (2009) 9. Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10(7), 1895–1923 (1998) 10. Goble, C., Corcho, O., Alper, P., Roure, D.D.: e-science and the semantic web: A symbiotic relationship. Lecture Notes in Computer Science 4265, 1–12 (2006) 11. Hand, D.: Classifier technology and the illusion of progress. Statistical Science 21(1), 114 (2006) 12. Hilario, M., Kalousis, A., Nguyen, P., Woznica, A.: A data mining ontology for algorithm selection and meta-mining. Proceedings of the ECML/PKDD09 Workshop on 3rd generation Data Mining (SoKD-09) pp. 76–87 (2009) 13. Horridge, M., Knublauch, H., Rector, A., Stevens, R., Wroe, C.: A practical guide to building OWL ontologies using Protege 4 and CO-ODE tools. The University of Manchester (2009)
- 43 -
14. Hoste, V., Daelemans, W.: Comparing learning approaches to coreference resolution. there is more to it than bias. Proceedings of the Workshop on Meta-Learning (ICML-2005) pp. 20–27 (2005) 15. Karapiperis, S., Apostolou, D.: Consensus building in collaborative ontology engineering processes. Journal of Universal Knowledge Management 1(3), 199–216 (2006) 16. Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining and Knowledge Discovery 7(4), 349–371 (2003) 17. Kietz, J., Serban, F., Bernstein, A., Fischer, S.: Towards cooperative planning of data mining workflows. Proceedings of the Third Generation Data Mining Workshop at the 2009 European Conference on Machine Learning (ECML 2009) pp. 1–12 (2009) 18. Klusch, M., Gerber, A., Schmidt, M.: Semantic web service composition planning with owls-xplan. Proceedings of the First International AAAI Fall Symposium on Agents and the Semantic Web (2005) 19. Kuehl, R.: Design of experiments: statistical principles of research design and analysis. Duxbury Press (1999) 20. Nielsen, M.: The future of science: Building a better collective memory. APS Physics 17(10) (2008) 21. Noy, N., McGuinness, D.: Ontology development 101: A guide to creating your first ontology. Stanford University (2002) 22. Panov, P., Soldatova, L., Dzeroski, S.: Towards an ontology of data mining investigations. Lecture Notes in Artificial Intelligence 5808, 257–271 (2009) 23. Perlich, C., Provost, F., Simonoff, J.: Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research 4, 211–255 (2003) 24. Pfahringer, B., Bensusan, H., Giraud-Carrier, C.: Meta-learning by landmarking various learning algorithms. Proceedings of the Seventeenth International Conference on Machine Learning pp. 743–750 (2000) 25. Sirin, E., Parsia, B.: SPARQL-DL: SPARQL query for OWL-DL. Third International Workshop on OWL Experiences and Directions (OWLED 2007) (2007) 26. Soldatova, L., King, R.: An ontology of scientific experiments. Journal of the Royal Society Interface 3(11), 795–803 (2006) 27. Szalay, A., Gray, J.: The world-wide telescope. Science 293, 2037–2040 (2001) 28. Vanschoren, J., Blockeel, H., Pfahringer, B., Holmes, G.: Experiment databases: Creating a new platform for meta-learning research. Proceedings of the ICML/UAI/COLT Joint Planning to Learn Workshop (PlanLearn08) pp. 10–15 (2008) 29. Vanschoren, J., Pfahringer, B., Holmes, G.: Learning from the past with experiment databases. Lecture Notes in Artificial Intelligence 5351, 485–492 (2008) 30. Vanschoren, J., Soldatova, L.: Collaborative meta-learning. Proceedings of the Third Planning to Learn Workshop at the 19th European Conference on Artificial Intelligence (2010) 31. Z´ akov´ a, M., Kremen, P., Zelezn´ y, F., Lavrac, N.: Planning to learn with a knowledge discovery ontology. Second planning to learn workshop at the joint ICML/COLT/UAI Conference pp. 29–34 (2008) 32. Z´ akov´ a, M., Podpecan, V., Zelezn´ y, F., Lavrac, N.: Advancing data mining workflow construction: A framework and cases using the Orange toolkit. Proceedings of the SoKD-09 International Workshop on Third Generation Data Mining at ECML PKDD 2009 pp. 39–51 (2009)
- 44 -
Foundations of frequent concept mining with formal ontologies Agnieszka Lawrynowicz1 Institute of Computing Science, Poznan University of Technology, ul. Piotrowo 2, 60-965 Poznan, Poland
[email protected]
1
Introduction
With increased availability of information published using standard Semantic Web languages, new approaches are needed to mine this growing resource of data. Since Semantic Web data is relational in nature, there have been recently growing number of proposals adapting methods of Inductive Logic Programming (ILP) [1] for the Semantic Web knowledge representation formalisms, most notably Web ontology language OWL1 (grounded on description logics (DLs) [2]). One of the fundamental data mining tasks is the discovery of frequent patterns. Within the setting of ILP, frequent pattern mining has been investigated initially for Datalog, in systems such as WARMR [3], FARMER [4] or c-armr [5]. However, recent proposals have extended the scope of relational frequent pattern mining to operate on description logics, or hybrid languages (combining Datalog with DL or DL with some form of rules), where examples are system SPADA [6], or approaches proposed in [7], and in [8]. However, none of the current approaches that use DLs to mine frequent patterns target a peculiarity of the DL formalism, namely variable–free notation, in representing patterns. This paper aims to fill this gap. The main contributions of the paper are summarized as follows: (a) a novel setting for the task of frequent pattern mining is introduced, coined frequent concept mining, where patterns are (complex) concepts expressed in description logics (corresponding to OWL classes); (b) basic building blocks for this new setting are provided such as generality measure, and refinement operator.
2
Preliminaries
2.1
Representation and Inference
Description logics [2] are a family of knowledge representation languages (equipped with a model-theoretic semantics and reasoning services) that have been adopted as theoretical foundation for OWL language. Basic elements in DLs are: atomic concepts (denoted by A), and atomic roles (denoted by R, S). Complex descriptions (denoted by C and D) are inductively built by using concept and role 1
http://www.w3.org/TR/owl-features
- 45 -
Table 1: Syntax and semantics of example DL constructors. Constructor Universal concept Bottom concept Negation of arbitrary concepts Intersection Union Value restriction Full existential quantification Datatype exists Nominals
Syntax > ⊥ (¬C) (C u D) (C t D) (∀R.C) (∃R.C) (∃T.u) {a1 , ..., an }
Semantics ∆I ∅ ∆I \C I C I ∩ DI C I ∪ DI {a ∈ ∆I |∀b.(a, b) ∈ RI → b ∈ C I } {a ∈ ∆I |∃b.(a, b) ∈ RI ∧ b ∈ C I } {a ∈ ∆I |∃t.(a, t) ∈ T I ∧ t ∈ uD } {a1 I , ..., an I }
constructors. Semantics is defined by interpretations I=(∆I , ·I ), where non-empty set ∆I is the domain of the interpretation and ·I is an interpretation function which assigns to every atomic concept A a set AI ⊆ ∆I , and to every atomic role R a binary relation RI ⊆ ∆I × ∆I . The interpretation function is extended to complex concept descriptions by the inductive definition as presented in Tab. 1. A DL knowledge base, KB, is formally defined as: KB = (T , A), where T is called a TBox, and it contains axioms dealing with how concepts and roles are related to each other, and where A is called an ABox, and it contains assertions about individuals such as C(a) (the invidual a is an instance of the concept C) and R(a, b) (a is R-related to b). Moreover, DLs may also support reasoning with concrete datatypes such as strings or integers. A concrete domain D consists of a set ∆D , the domain of D, and a set pred(D), the predicate names of D. Each predicate name P is associated with an arity n, and an n-ary predicate P D ⊆ (∆D )n . The abstract domain ∆I and the concrete domain ∆D are disjoint. Concrete role T is interpreted as a binary relation T I ⊆ ∆I × ∆D . Example 1 provides a sample DL knowledge base, that represents a part of the domain of data mining with the purpose to be used in meta-mining, e.g. for algorithm selection (the example is based on the ontology for Data Mining Optimization (DMOP) [9]). Example 1 (Description logic KB). T = { RecursivePartitioningAlgorithm v ClassificationAlgorithm, C4.5-Algorithm v RecursivePartitioningAlgorithm, BayesianAlgorithm v ClassificationAlgorithm, NaiveBayesAlgorithm v BayesianAlgorithm, NaiveBayesNormalAlgorithm v NaiveBayesAlgorithm, OperatorExecution v ∃executes.Operator, Operator v ∃implements.Algorithm, > v ∀ hasInput− .OperatorExecution, > v ∀ hasInput.(Data t Model), DataSet v Data }. A = { OperatorExecution(Weka NaiveBayes–OpEx01), Operator(Weka NaiveBayes), executes(Weka NaiveBayes–OpEx01, Weka NaiveBayes), implements(Weka NaiveBayes, NaiveBayesNormal), NaiveBayesNormalAlgorithm(NaiveBayesNormal), hasInput(Weka NaiveBayes–OpEx01,Iris–DataSet),DataSet(Iris–DataSet),
- 46 -
OperatorExecution(Weka NaiveBayes–OpEx02), executes(Weka NaiveBayes–OpEx02, Weka NaiveBayes), hasParameterSetting(Weka NaiveBayes–OpEx02, Weka–NaiveBayes–OpEx02–D), OpParameterSetting(Weka–NaiveBayes–OpEx02–D), setsValueOf(Weka–NaiveBayes–OpEx02–D,Weka NaiveBayes–D), hasValue(Weka–NaiveBayes–OpEx02–D,false), hasParameterSetting(Weka NaiveBayes–OpEx02, Weka–NaiveBayes–OpEx02–K), OpParameterSetting(Weka–NaiveBayes–OpEx02–K), setsValueOf(Weka–NaiveBayes–OpEx02–K,Weka NaiveBayes–K), hasValue(Weka–NaiveBayes–OpEx02–K,false), OperatorExecution(Weka–J48–OpEx01),Operator(Weka J48), executes(Weka–J48–OpEx01, Weka J48), implements(Weka J48, C4.5), C4.5-Algorithm(C4.5) }.
The inference services, further referred to in the paper, are subsumption and retrieval. Given two concept descriptions C and D in a TBox T , C subsumes D (denoted by D v C) if and only if, for every interpretation I of T it holds that DI ⊆ C I . C equivalent to D (denoted by C ≡ D) corresponds to C v D and D v C. The retrieval problem is, given an ABox A and a concept C, to find all individuals a such that A |= C(a). 2.2
Refinement operators for DL
Learning in DLs can be seen as a search in the space of concepts. In ILP it is common to impose an ordering on this search space, and apply refinement operators to traverse it [1]. Downward refinement operators construct specialisations of hypotheses (concepts, in this context). Let (S, ) be a quasi ordered space. Then, a downward refinement operator ρ is a mapping from S to 2S , such that for any C ∈ S, C 0 ∈ ρ(C) implies C 0 C. C 0 is called a specialisation of C. For searching the space of DL concepts, a natural quasi-order is subsumption. If C subsumes D (D v C), then C covers all instances that are covered by D. In this work, subsumption is assumed as a generality measure between concepts. Further details concerning refinement operators proposed for description logics may be found in [10–13].
3
Frequent concept mining
In this section, the task of frequent concept mining is formally introduced. 3.1
The task
The definition of the task of frequent pattern discovery requires a specification of what is counted to calculate the pattern support. In the setting proposed
- 47 -
in this paper, the support of concept C is calulated relatively to the number ˆ from of instances of a user-specified concept of reference, reference concept C, which the search procedure starts (and which is being specialized). Definition 1 (Support). Let C be a concept expressed using predicates from a DL knowledge base KB = (T , A), memberset(C, KB) be a function that returns ˆ denote a reference the set of all individuals a such that A |= C(a), and let C ˆ concept, where C vC. A support of pattern C with respect to the knowledge base KB is defined as the ratio between the number of instances of the concept C, and the number of ˆ in KB: instances of the reference concept C support(C, KB) =
|memberset(C, KB)| ˆ KB)| |memberset(C,
Having defined the support it is now possible to formulate a definition of a frequent concept discovery. Definition 2 (Frequent concept discovery). Given – a knowledge base KB represented in description logic, – a set of patterns in the form of a concept C, where each C is subsumed by a ˆ (C vC), ˆ reference concept C – a minimum support threshold minsup specified by the user, and assuming that patterns with support s are frequent in KB if s ≥ minsup, the task of frequent pattern discovery is to find the set of frequent patterns. Example 2. Let us consider the knowledge base KB from Example 1. Let’s asˆ =OperatorExecution (in general, C ˆ can be also a complex concept, sume that C ˆ in KB. The foland not necessarily a primitive one). There are 3 instances of C lowing example patterns, refinements of OperatorExecution, could be generated: C1 = OperatorExecution u∃executes.Operator C2 = OperatorExecution u∃executes.(Operator u∃implements.ClassificationAlgorithm) C3 =OperatorExecution u∃executes.(Operator u∃implements.RecursivePartitioningAlgorithm) C4 =OperatorExecution u∃executes.{Weka NaiveBayes} C5 =OperatorExecution u∃hasParameterSetting.(OpParameterSetting u ∃setsValueOf.{Weka NaiveBayes–K} u ∃hasValue.false) C6 =OperatorExecution u∃hasInput.Data
The support values of the above patterns are as follows: s(C1 ) = 33 , s(C2 ) = 33 , s(C3 ) = 13 , s(C4 ) = 23 , s(C5 ) = 13 , s(C6 ) = 31 .
- 48 -
3.2
Refinement operator
Depending on a language used, the number of specializations of a concept (ordered by subsumption) may be infinite. There is also a trade-off between the level of completeness of a refinement operator, and its efficiency. Below, a refinement operator is introduced (inspired by [12]), that allows to generate the concepts listed in Example 2, and as such exploites the futures of DMOP ontology (which provides an intended use case for the presented approach). Definition 3 (Downward refinement operator ρ). ρ = (ρt , ρu ), where: [ρt ] given a description in normal form D = D1 t ... t Dn : F (a) D0 ∈ ρt (D) if D0 = 1≤i,j≤n Dk for some j 6= i, 1 ≤ k ≤ n, F =i 0 0 Dk for some Di ∈ ρ u (Di ) (b) D0 ∈ ρt (D) if D0 = Di t j61≤i,j≤n [ρu ] given a conjunctive description C = C1 u ... u Cm : (a) C 0 ∈ ρu (C) if C 0 = C u Cj+1 , where Cj+1 is a primitive concept, and KB |= Cj+1 v C, (b) C 0 ∈ ρu (C) if C 0 = C u Cj+1 , where Cj+1 = ∃R.Dj+1 , (c) C 0 ∈ ρu (C) if C 0 = C u Cj+1 , where Cj+1 = ∃T.uj+1 , (d) C 0 ∈ ρu (C) if C 0 = C u Cj+1 , where Cj+1 = {a}, and KB |= C(a), (e) C 0 ∈ ρu (C) if C 0 = (Ct¬Cj )uCj0 , for some j ∈ {1, ..., m}, where Cj0 = ∃R0 .Dj , Cj = ∃R.Dj , R0 v R. (f ) C 0 ∈ ρu (C) if C 0 = (C t¬Cj )uCj0 , for some j ∈ {1, ..., m}, where Cj0 = ∃R.Dj0 , Cj = ∃R.Dj , Dj0 ∈ ρt (Dj ).
ρt either (a) drops one top-level disjunct or (b) replaces it with a downward refinement obtained with ρu . ρu adds new conjuncts in the form of: (a) an atomic description being a subconcept of a refined concept, (b) an existential restriction involving abstract role, (c) an existential restriction involving concrete role (d) a nominal, being an instance of a refined concept, or (e) replaces one conjunct with a refinement obtained by replacing a role in existential restriction by its subrole, or (f) replaces one conjunct with a refinement obtained by specializing concepts in the range of an existential restriction by ρt . Open world assumption (OWA) in DL reasoning has different specificity, than usually applied closed world assumption (CWA). That’s why, the proposed operator does not specialize concepts through ∀ quantifier. Due to the OWA, even if every instance in a KB of interest would possess certain property, this could not be deduced by a reasoner (always assuming incomplete knowledge, and possible existence of a counterexample). This could be solved, for example, by introducing an epistemic operator [2], but such refinement rule could be costly. The usage of an expressive pattern language, and the presence of OWA (leading to less constraints on possible generated patterns), may result in a large pattern search-space. Thus, further steps are necessary to prune the space explored by the operator. This is usually done in ILP by introducing declarative bias (restrictions on depth, wideness or language of patterns). One of common problems in performing data mining with DLs, is a usual lack of disjointness constraints, resulting e.g., in a huge number of concepts tested as a filler of a given role. Hence, despite of concept depth and wideness restrictions, a declarative bias should enable to restrict the language of patterns beyond the constraints imposed by DL axioms (e.g. restrict a list of fillers of a particular role, etc.).
- 49 -
4
Conclusions and Future Work
To the best of our knowledge, this is the first proposal for mining frequent patterns, expressed as concepts represented in description logics. The paper lays the foundations for this task, as well as proposes first steps towards the solution. The future work will investigate suitable declarative bias for the proposed setting, and will devise an efficient algorithm, most likely employing parallelization. The primary motivation of this work is a future application of the proposed frequent concept mining in real-life scenarios, e.g. for ontology-based meta-learning.
References 1. Nienhuys-Cheng, S., de Wolf, R.: Foundations of Inductive Logic Programming. Volume 1228 of LNAI. Springer (1997) 2. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P., eds.: The Description Logic Handbook. Cambridge University Press (2003) 3. Dehaspe, L., Toivonen, H.: Discovery of frequent Datalog patterns. Data Mining and Knowledge Discovery 3(1) (1999) 7–36 4. Nijssen, S., Kok, J.: Faster association rules for multiple relations. In: Proc. of the 17th Int. Joint Conference on Artificial Intelligence (IJCAI’2001). (2001) 891–897 5. de Raedt, L., Ramon, J.: Condensed representations for inductive logic programming. In: Proc. of the Ninth International Conference on Principles of Knowledge Representation and Reasoning (KR 2004). (2004) 438–446 6. Lisi, F., Malerba, D.: Inducing multi-level association rules from multiple relations. Machine Learning Journal 55(2) (2004) 175–210 7. Z´ akov´ a, M., Zelezn´ y, F., Garcia-Sedano, J.A., Tissot, C.M., Lavrac, N., Kremen, P., Molina, J.: Relational data mining applied to virtual engineering of product designs. In Muggleton, S., Otero, R.P., Tamaddoni-Nezhad, A., eds.: ILP. Volume 4455 of Lecture Notes in Computer Science., Springer (2006) 439–453 8. J´ ozefowska, J., Lawrynowicz, A., Lukaszewski, T.: The role of semantics in mining frequent patterns from knowledge bases in description logics with rules. Theory and Practice of Logic Programming 10(3) (2010) 251–289 9. Hilario, M., Kalousis, A., Nguyen, P., Woznica, A.: A Data Mining Ontology for algorithm selection and meta-learning. In: Proc of the ECML/PKDD’09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD-09). (2009) 76–87 10. Kietz, J.U., Morik, K.: A polynomial approach to the constructive induction of structural knowledge. Machine Learning 14(2) (1994) 193–218 11. Iannone, L., Palmisano, I., Fanizzi, N.: An algorithm based on counterfactuals for concept learning in the Semantic Web. Appl. Intell. 26(2) (2007) 139–159 12. Fanizzi, N., d’Amato, C., Esposito, F.: DL-Foil: Concept learning in Description Logics. In Zelezn´ y, F., Lavraˇc, N., eds.: Proceedings of the 18th International Conference on Inductive Logic Programming, ILP2008. Volume 5194 of LNAI. Springer, Prague, Czech Rep. (2008) 107–121 13. Lehmann, J.: DL-learner: Learning concepts in description logics. Journal of Machine Learning Research (JMLR) 10 (2009) 2639–2642
- 50 -
Workow-based Information Retrieval to Model Plant Defence Response to Pathogen Attacks Dragana Miljkovi¢1 , Claudiu Mih il 3 , Vid Podpe£an1 , Miha Gr£ar1 , Kristina Gruden4 , Tja²a Stare4 , Nada Lavra£1,2 1
Joºef Stefan Institute, Ljubljana, Slovenia University of Nova Gorica, Nova Gorica, Slovenia 3 Faculty of Computer Science, Al.I. Cuza, University of Ia³i, Ia³i, Romania Department of Biotechnology and Systems Biology, National Institute of Biology, Ljubljana, Slovenia 2
4
Abstract. The paper proposes a workow-based approach to support modelling of plant defence response to pathogen attacks. Currently, such models are built manually by merging expert knowledge, experimental results, and literature search. To this end, we have developed a methodology which supports the expert in the process of creation, curation, and evaluation of biological models by combining publicly available databases, natural language processing tools, and hand-crafted knowledge. The proposed solution has been implemented in a service-oriented workow environment Orange4WS, and evaluated using manually developed Petri Net plant defence response model.
1 Introduction Bioinformatics workow management systems have been a subject of numerous research eorts in recent years. For example, Wikipedia lists 19 systems5 which are capable of executing some form of scientic workows. Such systems oer numerous advantages in comparison with monolithic and problem specic solutions. First, repeatability of experiments is easy since the procedure (i.e. the corresponding workow) and parameters can be saved and reused. Second, if the tool is capable of using web services, this ensures a certain level of distributed computation and makes the system more reliable6 and independent. Third, as abstract representations of complex computational procedures, workows are easy to understand and execute, even for non-experts. Finally, such systems typically oer easy access (e.g. by using web services) to large public databases such as PubMed, WOS, BioMart[5], EMBL-EBI data resources7 etc. The topic of this paper is defence response in plants to virus attacks, which has been investigated for a considerable time. However, individual research groups 5 6
7
http://en.wikipedia.org/wiki/Bioinformatics_workow_management_systems The reliability of web service-based solutions is debatable but provided that there is a certain level of redundancy such distributed systems are more reliable than single source solutions [3]. http://www.ebi.ac.uk/Tools/webservices/
- 51 -
usually focus their experimental work on a subset of the entire defence system, while a model of a global defence response mechanism in plants is still to be developed. The motivation of biology experts to develop a more comprehensive model of the entire defence response is twofold. Firstly, it will provide a better understanding of the complex defence response mechanism in plants which means highlighting connections in the network and understanding how the connections operate. Secondly, prediction of experimental results through simulation will save time and indicate further research directions to biology experts. The development of a more comprehensive model of plant defence response for simulation purposes addresses three general research questions:
what is the most appropriate formalism for representing the plant defence model,
how to extract network structure; more precisely, how to retrieve relevant compounds and relations between them,
how to determine network parameters such as initial compound values, speeds of the reactions, threshold values, etc.
Having studied dierent representation formalisms, we have decided to represent the model of the given biological network in the form of a graph. This paper addresses the second research question, i.e. automatized extraction of the graph structure through information retrieval and natural language processing techniques, with the emphasis on a implementation in a service-oriented workow environment. We propose a worow-based approach to support modelling of plant defence response to pathogen attacks, and present an implementation of the proposed workow in a service-oriented environment Orange4WS. The implementation combines open source natural language processing tools, data from publicly available databases, and hand-crafted knowledge. The evaluation of the approach is carried out using a manually crafted Petri net model which was developed by fusing expert knowledge and manual literature mining. The structure of the paper is as follows. Section 2 presents existing approaches to modelling plant defence response and discusses their advantages and shortcommings. Section 3 introduces our manually crafted Petri net model, and proposes a workow-based solution to assist the creation, curation, and evaluation of such models. Section 4 presents and evaluates the results of our work. Section 5 concludes the paper and proposes directions for further work.
2 Related work Due to the complexity of the plant defence response mechanism, a challenge of building a general model for simulation purposes is still not fully addressed. Early attempts to accomplish numerical simulation by means of Boolean formalism from experimental microarray data [4] have already indicated the complexity of defence response mechanisms, and highlighted many crosstalk connections. Furthermore, many components mediating the beginning of the signalling pathway
- 52 -
and the nal response are missing. As the focus of interest of biology experts is now oriented on what could be the bottlenecks in this response such intermediate components are of interest. Other existing approaches, such as the MoVisPP tool [6], attempt to automatically retrieve information from databases and transfer the pathways into the Petri Net formalism. MoVisPP is an online tool which automatically produces Petri Net models from KEGG and BRENDA pathways. However, not all pathways are accessible, and the signalling pathways for plant defence response do not exist in databases. Tools for data extraction and graphical representation are also related to our work as they are used to help experts to understand underlying biological principles. They can be roughly grouped according to their information sources: databases (Biomine [15], Cytoscape [16], ProteoLens [8], VisAnt [7], PATIKA [2]), databases and experimental data (ONDEX [9], BiologicalNetworks [1]), and literature (TexFlame [12]). More general approaches such as [14] to visualization of arbitrary textual data through triplets are also relevant. However, such general systems have to be adapted in order to be able to produce domain-specic models.
3 Approaches to modelling plant defence response This section presents our manually crafted Petri net model using the Cell Illustrator software [11]. We briey describe the development cycle of the model and show some simulation results. The main part of the section discusses our workow-based approach to assist the creation and curation of such biological models.
3.1 A Petri Net model of plant defence response A Petri Net is a bipartite graph with two types of nodes: places and transitions. Standard Petri Net models are discrete and non-temporal, but their various extensions can represent both qualitative and quantitative models. The Cell Illustrator software implements Hybrid Functional Petri Net extension, which was used in our study. In Hybrid Functional Petri Net formalism, the speed of transition depends on the amount of input components and both discrete and continuous places exist. Our manually crafted Petri Net model of plant defence response currently contains 52 substances and 41 reactions which, according to the Petri Net formalism, correspond to places and transitions, respectively. The model of salicylic acid biosynthesis and signalling pathway which is one of the key components in plant defence response, is shown in Figure 1. Early results of the simulation are already able to show the eects of positive and negative feedback loops in salicylic acid (SA) pathway as shown in Figure 2. The red line represents the level of SA in chloroplast that is out of the positive feedback loop. The blue line represents the same component in cytoplasm that
- 53 -
Fig. 1. A Petri Net model of salicylic acid biosynthesis and signaling pathway in plants. Relations in negative and positive feedback loop are colored red and green, respectively.
- 54 -
is in the positive feedback loop. The peak of the blue line depicts the eect of the positive feedback loop which rapidly increases the amount of the SA. After reaching the peak, the trend of the blue line is negative as the eect of the negative feedback loop prevails.
Fig. 2. Simulation results of the Petri Net model of salicylic acid pathway. The red line represents the level of SA in chloroplast that is out of the positive feedback loop. The blue line represents the same component in cytoplasm that is in the positive feedback loop.
The represented Petri Net model consists of two types of biological pathways: metabolic part and signalling part. The metabolic part is a cascade of reactions with small compounds as reactants, and it was manually obtained from KEGG database. The signalling part is not available in databases and it had to be obtained from the literature. The biology experts have manually extracted relevant information related to this pathway within a period of approximately two months. Having in mind that the salicylic acid pathway is only one out of three pathways that are involved in plant defence response, it is clear that considerable amount of time has to be invested if only manual approach were employed.
3.2 Computer-assisted development of plant defence response models The process of fusing expert knowledge and manually obtained information from the literature as presented in the previous section turns out to be time consuming and non-systematic. Therefore, it is necessary to employ more automated methods of extracting relevant information. Our proposed solution is based on a service-oriented approach using scientic workows. Web services oer platform independent implementation of processing components which makes our solution more general as it can be used in any service-oriented environment. Furthermore, by composing developed web services into workows, our approach oers reusability and repeatability, and can be easily extended with additional components.
- 55 -
Our implementation is based on Orange4WS, a service-oriented workow environment which also oers tools for developing new services based on existing software libraries. For natural language processing we employed functions from the NLTK library [10], which were transformed into web services. Additionally, the GENIA tagger [17] for biological domains was used to perform part-of-speech tagging and shallow parsing. The data was extracted from PubMed and WOS using web-service enabled access. A workow diagram for computer-assisted creation of plant defence models from textual data is shown in Figure 3. It is composed of the following elements: 1. PubMed web service and WOS search to extract article data, 2. PDF-to-text converter service, which is based on Poppler8 , an open source PDF rendering library, 3. NLP web services based on NLTK: tokenizer, shallow parser (chunker), sentence splitter, 4. the GENIA tagger, 5. ltering components, e.g. contradiction removal, synonymity resolver, etc. The idea underlying this research was to extract sets in the triplet form
{Subject, P redicate, Object} from biological texts which are freely available. The defence response related information is obtained by employing the vocabulary which we have manually developed for this specic eld. Subject and Object are biological compounds such as proteins, genes or small compounds, and their names with synonyms are built into the vocabulary, whereas Predicate represents the relation or interaction between the compounds. We have dened four types or reactions, i.e. activation, inhibition, binding and degradation, and the synonyms for these reactions are also implemented in the vocabulary. An example of such a triplet is shown below:
{P AD4 protein, activates, EDS5 gene} Such triplets, if automatically found in text and visualized in a graph, can assist the development and nalization of the plant defence response Petri Net model for the simulation purposes. Triplet extraction is performed by employing simple rules to nd the last noun of the rst phrase as Subject. Predicate is a part of a verb phase located between the noun phrases. Object is then detected as a part of the rst noun phrase after the verb phrase. The triplets are further enhanced by linking them to the associated biological lexicon of synonyms BioLexicon [13]. In addition to these rules, pattern matching from the dictionary is performed to search for more complicated phrases among text to enhance the information extraction. The relevant information (a graph) is then visualized using a native Orange graph visualizer or a Biomine visualization component provided by Orange4WS. An example of such a graph is shown in Figure 4. 8
http://poppler.freedesktop.org/
- 56 -
Fig. 3. Workow schema which enables information retrieval from public databases to support modelling of plant defence response.
While such automatically extracted knowledge currently cannot compete in terms of details and correctness - with the manually crafted Petri net model, it can be used to assists the expert in the process of building and curation of the model. Also, it can provide novel relevant information not known to the expert. Provided that wet lab experimental data are available, some parts of the automatically built models could also be evaluated automatically. This, however, is currently out of the scope of the research presented here.
4 Results: An illustrative example Consultation with biological experts resulted in the rst round of experiments performed on a set of ten most relevant articles from the eld which were published after 2005. Figure 4 shows the extracted triplets, visualized using the Biomine visualizer which is available as a widget in the Orange4WS environment. Salicylic acid (SA) appears to be the central component in the graph, which conrms the biological fact that salicylic acid is indeed one of the three main components in plant defence response. The information contained in the graph of Figure 4 is similar to the initial knowledge obtained by biology experts by manual information retrieval from the literature9 . Such a graph, however, can not provide the cascade network type which is more close to reality (and to the manually crafted Petri Net model). The rst feedback from the biology experts is positive. Even though this approach can not completely substitute human 9
It is worth noting that before the start of joint collaboration between the computer scientists and biology experts, the collaborating biology experts have previously tried to manually extract knowledge from scientic articles in the form of a graph, and have succeeded to build a simple graph representation of the SA biosynthesis and signalling pathway.
- 57 -
Fig. 4. A set of extracted triplets, visualized using the Biomine graph visualizer.
experts, biologists consider it a helpful tool in accelerating the information retrieval from the literature. The presented results indicate the usefulness of the proposed approach but also the necessity to further improve the quality of information extraction.
5 Conclusion In this paper we presented a methodology which supports the domain expert in the process of creation, curation, and evaluation of plant defence response models by combining publicly available databases, natural language processing tools, and hand-crafted knowledge. The methodology was implemented in a service-oriented workow environment by constructing a reusable workow, and evaluated using a hand crafted Petri Net model. This Petri Net model has been developed by fusing expert knowledge, experimental results and literature reading, and serves as a baseline for evaluation of automatically mined plant defence response knowledge, but it also enables computer simulation and prediction. In further work we plan to continue the development and curation of the Petri Net model, and implement additional lters and workow components to improve computer-assisted creation of plant defence response models. As the presented methodology is general, the future work will also concentrate on development of other biological models. Finally, we are preparing a public release of our workow-based implementation. This will provide us the much needed feedback from experts which will help us to improve the knowledge extraction process.
- 58 -
Acknowledgments This work is partially supported by the AD Futura scholarship and the Slovenian Research Agency grants P2-0103 and J4-2228. We are grateful to Lorand Dali and Delia Rusu for constructive discussions and suggestions.
References 1. M. Baitaluk, M. Sedova, A. Ray, and A. Gupta. BiologicalNetworks: visualization and analysis tool for systems biology. Nucl. Acids Res., 34(suppl 2):W466-471, 2006. 2. E. Demir, O. Babur, U. Dogrusoz, A. Gursoy, G. Nisanci, R. Cetin-Atalay and M. Ozturk. PATIKA: An integrated visual environment for collaborative construction and analysis of cellular pathways. Bioinformatics, 18(7):996-1003, 2002. 3. T. Erl. Service-Oriented Architecture: Concepts, Technology, and Design. PrenticeHall. 2006. 4. T. Genoud, M. B. Trevino Santa Cruz, and J.-P. Metraux. Numeric Simulation of Plant Signaling Networks. Plant Physiology, August 1, 2001; 126(4): 1430 - 1437. 5. S. Haider, B. Ballester, D. Smedley, J. Zhang, P. Rice and A. Kasprzyk BioMart Central Portal - unifed access to biological data. Nucleic Acids Res. 2009 Jul 1;37 (Web Server issue):W23-7. Epub 2009 May 6. 6. S. Hariharaputran, R. Hofestädt, B. Kormeier, and S. Spangardt. Petri net models for the semi-automatic construction of large scale biological networks. Springer Science and Business. Natural Computing, 2009. 7. Z. Hu, J. Mellor, J. Wu, and C. DeLisi. VisANT: data-integrating visual framework for biological networks and modules. Nucleic Acids Research, 33:W352-W357, 2005. 8. T. Huan, A.Y. Sivachenko, S.H. Harrison, J.Y. Chen. ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining. BMC Bioinformatics 2008, 9(Suppl 9):S5. 9. J. Köhler, J. Baumbach, J. Taubert, M. Specht, A. Skusa, A. Röuegg, C. Rawlings, P. Verrier and S. Philippi. Graph-based analysis and visualization of experimental results with Ondex, Bioinformatics 22(11), 2006. 10. E. Loper and S. Bird. NLTK: The Natural Language Toolkit. Proceedings of the ACL Workshop on Eective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pp 62-69, Philadelphia, Association for Computational Linguistics. July 2002. 11. Matsuno H, Fujita S, Doi A, Nagasaki M, Miyano S: Towards biopathway modeling and simulation. Lecture Notes in Computer Science 2003, 2679:3-22. 12. N. Le Novère, M. Hucka, H. Mi, S. Moodie, F. Schreiber, A. Sorokin, E. Demir, K. Wegner, M.I. Aladjem, S.M. Wimalaratne, F.T. Bergman, R. Gauges, P. Ghazal, H. Kawaji, L. Li, Y. Matsuoka, A. Villéger, S.E. Boyd, L. Calzone, M. Courtot, U. Dogrusoz, T.C. Freeman, A. Funahashi, S. Ghosh, A. Jouraku, S. Kim, F. Kolpakov, A. Luna, S. Sahle, E. Schmidt, S. Watterson, G. Wu, I. Goryanin, D.B. Kell, C. Sander, H. Sauro, J.L. Snoep, K. Kohn, H. Kitano. The Systems Biology Graphical Notation. Nature Biotechnology, 2009 27(8):735-41. 13. D. Rebholz-Schuhmann, P. Pezik, V. Lee, R. del Gratta, J.J. Kim, Y. Sasaki, J.McNaught, S. Montagni, M. Monachini, N. Calzolari, S. Ananiadou. BioLexicon: Towards a reference terminological resource in the biomedical domain. Poster at 16th International Conference Intelligent Systems for Molecular Biology, 2008.
- 59 -
14. D. Rusu, B. Fortuna, D. Mladeni¢, M. Grobelnik, R. Sipo². Document Visualization Based on Semantic Graphs. In Proceedings of the 13th International Conference Information Visualisation, 2009. 15. P. Sevon, L. Eronen, P. Hintsanen, K. Kulovesi, and H. Toivonen. Link discovery in graphs derived from biological databases. In Proceedings of 3rd International Workshop on Data Integration in the Life Sciences, 2006. 16. P. Shannon, A. Markiel, O, Ozier, N.S. Baliga, J.T. Wang, D. Ramage, N. Amin, B. Schwikowski, T. Ideker. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Research, 13:2498-2504, 2003. 17. Y. Tsuruoka, Y. Tateishi, J. Kim, T. Ohta, J. McNaught, S. Ananiadou, and J. Tsujii. Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382392, 2005
- 60 -
OpenTox: A Distributed REST Approach to Predictive Toxicology Tobias Girschick1 , Fabian Buchwald1 , Barry Hardy2 , and Stefan Kramer1 1
Technische Universit¨ at M¨ unchen, Institut f¨ ur Informatik/I12, Boltzmannstr. 3, 85748 Garching b. M¨ unchen, Germany {tobias.girschick, fabian.buchwald, stefan.kramer}@in.tum.de 2 Douglas Connect, Baermeggenweg 14, 4314 Zeiningen, Switzerland
[email protected]
Abstract. While general-purpose data mining has a role to play on the internet of services, there is a growing demand for services particularly tailored for application domains in industry and science. In the talk, we present the results of the European Union funded project OpenTox [1] (see http://www.opentox.org), which aims for building a webservice based framework specifically for predictive toxicology. OpenTox is an interoperable, standards-based framework for the support of predictive toxicology data and information management, algorithms, (Quantitative) Structure-Activity Relationship modeling, validation and reporting. Data access and management, algorithms for modeling, feature construction and feature selection as well as the use of ontologies are core components of the OpenTox framework architecture. Alongside the extensible Application Programming Interface (API) that can be used by contributing developers, OpenTox provides the end-user oriented applications ToxPredict (http://www.toxpredict.org) and ToxCreate (http://toxcreate.org/create). These are built on top of the API and are especially useful to non-computational scientists. The very flexible component-based structure of the framework allows for the combination of different services into multiple applications. All framework components are API-compliant REST web services, that can be combined to distributed and interoperable tools. New software developed by OpenTox partners like FCDE [2], FMiner [3] or Last-PM [4], that are particularly suited for toxicology predictions with chemical data input, are integrated. The advantages of the framework should encourage researchers from machine learning and data mining to get involved and develop new algorithms within the framework that offers high-quality data, controlled vocabularies and standard validation routines.
Acknowledgements This work was supported by the EU FP7 project (HEALTH-F5-2008-200787) OpenTox (http://www.opentox.org) and the TUM Graduate School.
- 61 -
References 1. Hardy, B., Douglas, N., Helma, C., Rautenberg, M., Jeliazkova, N., Jeliazkov, V., Nikolova, I., Benigni, R., Tcheremenskaia, O., Kramer, S., Girschick, T., Buchwald, F., Wicker, J., Karwath, A., G¨ utlein, M., Maunz, A., Sarimveis, H., Melagraki, G., Afantitis, A., Sopasakis, P., Gallagher, D., Poroikov, V., Filimonov, D., Zakharov, A., Lagunin, A., Gloriozova, T., Novikov, S., Skvortsova, N., Druzhilovsky, D., Chawla, S., Ghosh, I., Ray, S., Patel, H., Escher, S.: Collaborative Development of Predictive Toxicology Applications, accepted. Journal of Cheminformatics (2010) 2. Buchwald, F., Girschick, T., Frank, E., Kramer, S.: Fast Conditional Density Estimation for Quantitative Structure-Activity Relationships. In: Proc. of the 24th AAAI Conference on Artificial Intelligence, AAAI Press (2010) 1268–1273 3. Maunz, A., Helma, C., Kramer, S.: Efficient Mining for Structurally Diverse Subgraph Patterns in Large Molecular Databases. Machine Learning, in press (2010) 4. Maunz, A., Helma, C., Cramer, T., Kramer, S.: Latent Structure Pattern Mining. In: Proc. of ECML/PKDD 2010, accepted. (2010)
- 62 -