Automatic Definition of KDD Prototype Processes ... - Semantic Scholar

1 downloads 0 Views 170KB Size Report
on Description Logics and is decidable. An implementation of KDDONTO is avail- able at http://boole.diiga.univpm.it/kddontology.owl. For further details about ...
Automatic Definition of KDD Prototype Processes by Composition Claudia Diamantini, Domenico Potena and Emanuele Storti

Abstract One of the most interesting challenges in Knowledge Discovery in Databases (KDD) is to support users in the composition of tools in order to form a valid and useful KDD process. As a matter of fact, the design of a KDD experiment implies the combined use of several data manipulation tools that are suited for the knowledge discovery problem at hand. This implies that users possess a considerable amount of knowledge and expertise about functionalities and properties of all KDD algorithms implemented in available tools in order to choose the right ones and their proper composition. In order to support users in this demanding activity, we introduce a goal-driven procedure to automatically discover candidate prototype processes by composition of basic algorithms. The core element of this procedure is the algorithm matching, which is based on the exploitation of an ontology formalizing the domain of KDD algorithms. The present work focuses on the definition and evaluation of algorithm matching criteria.

1 Introduction Knowledge Discovery in Databases (KDD) is aimed to extract from a given database useful, valid, unknown and potentially useful knowledge. To achieve such a goal, a KDD process, involving different tools of various nature and interacting with human experts, has to be designed and executed. Since each tool has its own characteristics, performances, and is suitable for a specific task, users should have technical skills and a strong expertise in order to choose, to set-up, to compose and to execute tools. Such a situation sketches the KDD process design as a highly complex activity, which is hard to manage by both domain managers, with limited knowledge of KDD tools, and KDD experts, which typically master only few KDD techniques. For these reasons, one of the most interesting challenges in KDD field involves the possibility to give support to any kind of users in both tool discovery and process composition. Such two activities are closely related because composition requires to discover suitable tools, which are then linked together in order to build valid and useful knowledge discovery processes. We adopt an automatic top-down process Dipartimento di Ingegneria Informatica, Gestionale e dell’Automazione “M. Panti” Universit`a Politecnica delle Marche - via Brecce Bianche, 60131 Ancona, Italy {diamantini, potena, storti}@diiga.univpm.it

1

2

Claudia Diamantini, Domenico Potena and Emanuele Storti

composition strategy: at first we define a process, at a conceptual level, as composition of algorithms; then this process is instantiated by appropriately substituting each algorithm with one of the tools implementing it. We would like to point out that the proposed strategy allows us to produce prototype KDD processes, which are general and reusable; moreover, generated processes can be themselves considered as useful, valid and unknown knowledge. The design of a process can be ultimately reduced to the sub-problem of algorithm matching, which is a basic step for automatically composing KDD processes. The present work focuses on defining and evaluating algorithm matching criteria. Given two algorithms, the matching is based on the comparison between the output of the first and the input of the second, in order to determine if they can be executed in sequence, i.e. if their interfaces are compatible. In order to guide the matching we described KDD algorithms, their properties and interfaces in KDDONTO, a formal domain ontology [6]. In such a way, it is possible to perform reasoning on the ontology for deducing not explicit and hidden relations among algorithms. In particular, by browsing semantic relations it is possible not only to find suitable algorithms on the basis of an exact match between input and output, but also to define approximate matches. These matches are based on subsumption relations and, unlike many previous works, also parthood relations among a compound data structure and its subcomponents. A score can be assigned to each kind of match according to a semantic distance function. In such a way, the generated processes are ranked and users can choose the most suitable w.r.t. their requests. The remainder of this paper is organized as follows. Section 2 presents KDDONTO and its main concepts and relations. Section 3 introduces algorithm matching criteria, which are used as basic step for the process composition procedure. Finally, Section 4 discusses relevant related works and Section 5 ends the paper.

2 KDD Ontology KDDONTO is an ontology describing the domain of KDD algorithms, conceived for supporting the discovery of KDD algorithms and their composition. In order to build such an ontology, among proposed ontology building methodologies, we choose a formal approach based on the goal-oriented step-wise strategy described in [11]; moreover, quality requirements and formal criteria defined in [8] are taken into account, with the aim to make meaning explicit and not ambiguous. The key class is Algorithm, because it is the basic component of each process. Other top-level classes, from which any other is defined, are the following: • • • •

Method: a technique used by an algorithm to extract knowledge from input data; Phase: a phase of a KDD process; Task: the goal at which aims who executes a KDD process; Data: models (a set of constructs and rules for representing knowledge), datasets (a set of data in a proper format) and parameters (any information required in input or produced in output by an algorithm);

Automatic Definition of KDD Prototype Processes by Composition

3

• DataFeature: specific preconditions/postconditions that an input (or output) must have in order to be used by a method or an algorithm. Such conditions concern format (normalized dataset), type (numeric or literal values), or quality (missing values, balanced dataset) properties of an input/output datum; • PerformanceIndex and PerformanceClass: an index and a class of values about the way an algorithm works; Then, subclasses are defined by means of existential restrictions on main classes, that can be considered as fundamental bricks for building the ontology. Many relations are defined among classes, but for lack of space we introduce only the most relevant for the purposes of this work: • has input/has output (and inverse properties input for/output for), n-ary relations with domain Algorithm, Method or Task and codomain Data. Furthermore, optional relations has condition and is parameter are defined for each I/O data. The former specifies a pre/postcondition on the input/output at hand. The latter is a boolean property allowing us to distinguish between an input on which the algorithm works, and an input which is used for tuning the internal functionalities of the algorithm (i.e. a parameter). For instance, an MLP requests among other as input the dataset and as parameter the number of training epochs: the algorithm elaborates upon the dataset until the chosen number of epochs is reached. As concerns precondition, a value expressing the precondition strength may be provided: a value equal to 1.0 corresponds to a mandatory precondition, whereas lower values to relaxable ones. Two conditions can be linked together with the property in contrast if they are contradictory, e.g. the condition NUMERIC is in contrast with the condition LITERAL, because a datum cannot be numeric and literal at the same time; • uses links an instance of Algorithm to one ore more implemented instances of Method, whereas specifies task assigns an instance of Method to related instances of Task; • in module/out module allow to connect an instance of Algorithm to others, which can be executed respectively before or after it. These relations provide suggestions about process composition, representing in an explicit fashion KDD experts’ experience about process building; • part of (and its inverse has part), between a compound datum and an its component (Data instances). Many different meanings of parthood relation have been studied by works in mereology theory. According to the terminology firstly introduced in [12], in this work we refer to a component/integral part-of, i.e. a configuration of parts within a whole. Hence, such a transitive relation allows to describe a model in terms of the subcomponents it is made of and is useful for identifying algorithms working on similar models, that is models having common substructures, as discussed in next section. At present, KDDONTO is represented in OWL-DL, whose logical model is based on Description Logics and is decidable. An implementation of KDDONTO is available at http://boole.diiga.univpm.it/kddontology.owl. For further details about KDDONTO classes and relations, we refer the interested reader to [5].

4

Claudia Diamantini, Domenico Potena and Emanuele Storti

Fig. 1 An example of approximate match: V Q is part of LV Q and DATASET ≡o DATASET . The only precondition is on the input DATASET and its value is FLOAT that is a specialization of NO LITERAL, which in turn is the value of the postcondition of the output DATASET. The precondition strength is 0.4.

3 Algorithm Matching For the purposes of this work, we define a KDD process as a workflow of algorithms that allows to achieve the goal requested by the user. The basic issue in composition is to define the algorithm matching, that is to specify under which conditions two or more algorithms can be executed in sequence. Each algorithm takes data with certain features in input, performs some operations and returns data in output, which are then used as input for the next algorithm in the process. Therefore, two algorithms can be matched if the output of the first is compatible with the input of the second. An exact match between a set of algorithms {A1 ,...,An } and an algorithm B is defined as: matchE ({A1 , ..., An B) ↔ ∀ iniB (is parameter(iniB ) ∨ ∃Ak ∃outAj k : outAj k ≡o iniB ∧ valid(outAj k , iniB )) where iniB is the ith input of the algorithm B, outAj k is the jth output of the algorithm Ak . The symbol ≡o represents the conceptual equivalence and is defined as follows: let a and b be two data, a ≡o b if Ca v Cb , i.e. a and b refer to the same concept1 or Ca is a subconcept of Cb . The predicate valid(outAj k , iniB ) is satisfied if none of the postconditions of outAj k are in contrast with any of the iniB preconditions. By exploiting properties of algorithms, described in the previous section, it is possible to define a match based not only on exact criteria, but also on similarity among data. We can define a match among algorithms even if their interfaces are not perfectly equivalent: the relaxation of constraints results in a wider set of possible matches. Hence, an approximate match between a set of algorithms {A1 ,...,An } and an algorithm B is defined as: matchA ({A1 , ..., An }, B) ↔ ∀ iniB (is parameter(iniB ) ∨ ∃Ak ∃outAj k : similar(outAj k , iniB ) ∧ valid(outAj k , iniB )) 1

Hereafter we use “class” and “concept” as synonyms.

Automatic Definition of KDD Prototype Processes by Composition

5

where the similarity predicate similar(x, y) is satisfied if a ≡o b, or a and b are similar concepts, i.e. if there is a path in the ontology graph that links them together. An approximate match is useful not only when an exact match cannot be performed, but also for extracting unknown and non-trivial processes. The similarity between concepts can be evaluated on the basis of various ontological relations. Previous works about matching (see Section 4) commonly take into account only relations at hierarchic level: a specific datum is similar to its superclass, e.g. e.g. a Labeled Vector Quantization model (LVQ) is similar to a more general classification model. However, generalization/specialization is not enough for exploiting complex data structures which are common in Data Mining field; for such a reason we consider similarity also at a structural level: a compound datum can be made of simpler data, according to part of/has part relationships. To give a practical example, a LVQ has part a VQ model and a Labeling function (L). As shown in Figure 1, if an algorithm requires VQ model in input, LVQ model can be provided in place of it because the latter contains all the needed information. In order to evaluate degrees of approximation, it is useful to assign a numeric weight to each match such that the higher is the weight, the less accurate is the match. Given two similar data, a path is the shortest sequence of is-a or part of edges that are needed to link them in the ontological graph. We define ontological distance between two data the summation of the weighted edges in the path: |path|

Do (outA , inB ) =



δi

i=1

where δi is the weight of the ith edge in the path, and its value depends on the relation type, such that δspec(ialization) < δ part(hood) < δgen(eralization) . Unlike many previous works (e.g. [1, 3]), we weight differently a generalization and a specialization; such asymmetry is due to the different amount of information carried by the two relations. For instance, since CLASSIFICATION MODEL is a generalization of LVQ, this contains all the information of the superclass and can be easily used when a CLASSIFICATION MODEL is required, but the vice-versa does not occur. For such a reason, δspec ≤ δgen and, since we use subsumption relation in exact match, δspec is weighted 0. The part-of relation could be considered as a kind of specialization and weighted in the same manner, e.g. a LVQ could be also viewed as a specialization of a VQ, because a LVQ adds features to a VQ. However this approach, besides to be incorrect from a conceptual point of view, is also wrong from a pragmatic perspective. As a matter of fact, part-of intrinsically requires an additional manipulation (that implies an additional cost) in order to extract the needed subcomponents from the compound datum; in the example, this means to separate the VQ model from its Labeling function (the dotted square in Figure 1). Finally, given two algorithms A and B, we define cost of their match:

6

Claudia Diamantini, Domenico Potena and Emanuele Storti

Cm (A, B) =

αγ nB

nB

∑ β Dio

i=1

where nB is the number of input of the algorithm B, whereas α , β and γ coefficients are introduced for taking into account other weighting factors, as follows:

α ) use of link modules: given two algorithms linked through the in module and out module properties, Cm is decreased because these relations state that a specific connection among them was proved to be effective; β ) precondition relaxation: in algorithm matching, postconditions of the first algorithm must not be in contrast with preconditions of the second one, as regards the same data. Preconditions on some data can be relaxed if they are non-mandatory, i.e. if they have a condition strenght value lower than 1. Relaxing preconditions increases terms in the Cm summation, because in such cases algorithm execution may lead to lower quality outcomes; γ ) performance evaluation: algorithm performances can affect Cm . For example, in the case of a computational complexity index, the higher the complexity, the higher the cost.

3.1 Process Composition Procedure Based on algorithm matching, we define a goal-driven procedure for composing KDD processes. Our approach is aimed at the generation of an ordered list of all valid prototype processes satisfying the user requests; this allows the user to choose among processes with different characteristics and to experiment more than a single solution. In each phase of the procedure, a reasoner is exploited to make inference on the ontology for searching matchable algorithms. The proposed process composition procedure is formed of the following phases: • dataset and goal definition. The user provides characteristics (i.e. instances of DataFeature class) of the dataset to mine and specifies the goal (i.e. instance of the Task class) to achieve; • process building. Process building is an iterative phase, which starts from the given task and goes backwards adding one or more algorithms to a process. Firstly, through the specifies task and uses relations, all instances of the Algorithm class performing the given task are extracted. From these starting algorithms, prototype processes are generated by means of the algorithm matchmaking criteria defined in the previous section. This backwards procedure goes on until the first algorithm of each process is compatible with the given dataset; • process ranking. Generated processes are ranked on the basis of the process cost, i.e. the summation of all the matching costs Cm in the process. Due to page limitations, we refer the interested reader to [6] for further details about the composition procedure.

Automatic Definition of KDD Prototype Processes by Composition

7

4 Related Works In last years researchers in Data Mining and Knowledge Discovery in Databases fields have shown more and more interest in techniques for giving support in the design of knowledge discovery processes. Some earlier researches were proposed for supporting process composition, but no automatic procedure is defined for most of them, and only recent works have dealt with this issue ([2, 13]). In detail, in [2] authors define a simple ontology of KDD algorithms, that is exploited for designing a KDD process facing with costsensitive classification problems. A forward composition, from dataset characteristics towards the goal, is achieved through a systematic enumeration of valid processes, that are ranked on the basis on accuracy achieved on the processed dataset, and on process speed. [13] introduces a KDD ontology representing concrete implementations of algorithms and any piece of knowledge involved in a KDD process (dataset and model), that is exploited for guiding a forward state-space search planning algorithm in the design of a KDD workflow. Such an ontology describes algorithms in very few classes and a poor set of relationships, resulting in a flat knowledge base. Both in [2] and [13], ontologies are not rich enough to be extensively used both for deducing hidden relations among algorithms and for supporting approximate matches. Our approach, moreover, is aimed to achieve a higher level of generality, by producing abstract and reusable KDD prototype processes. Outside KDD and Data Mining, some related works can be also found within Web Service Composition field. Very few proposals deal with only exact match, while most of them (e.g. [1, 4, 7, 9, 10]) consider also the subsumption relation for matching services. To the best of our knowledge the only previous approach considering also parthood in matching is [3], which describes a procedure for Web Service discovery using semantic matchmaking. To this end, authors exploit relations of the lexical ontology Wordnet, including the meronymy relation, i.e. part-of. As concerns the functions used to assign a score to a match, some of the abovecited works weight in the same way generalization/specialization relations ([1, 3]). On the contrary, like [4, 7, 10], we assign them different weights as previously explained in Section 3. Furthermore, like [1, 3, 4] we define a numeric value for the matching cost instead of a discrete score as in [9, 10]: this allows us to define an overall cost for a process with more accuracy.

5 Conclusion The main contribution of this work is to introduce criteria for matching KDD algorithms, which are used as basic brick for the automatic composition of KDD prototype processes. These matching criteria are based on KDDONTO, a domain ontology formalizing knowledge about algorithms and their interfaces.

8

Claudia Diamantini, Domenico Potena and Emanuele Storti

In particular, we exploit KDDONTO for defining approximate matches, which are based on the subsumption and parthood relations. In the proposed approach we weight each match on the basis of the ontological path linking the algorithms, the kind of ontological relations in the path and other elements which affect the quality of the match, namely preconditions, linkable modules and performance indexes. As future extensions, in order to provide an effective ranking of the generated processes, we are evaluating specific values for weighing both matches and processes. Furthermore, we are working on techniques for translating a KDD prototype process into an executable process of tools.

References 1. Akkiraju, R., Srivastava, B., Ivan, A., Goodwin, R. and Syeda-Mahmood, T. SEMAPLAN: Combining Planning with Semantic Matching to Achieve Web Service Composition. In Proc. of the IEEE International Conference on Web Services (ICWS 2006), pages 37–44, Chicago, USA, 2006. 2. Bernstein, A., Provost, F. and Hill, S. Towards Intelligent Assistance for a Data Mining Process: An Ontology Based Approach for Cost-Sensitive Classification. IEEE Transactions on Knowledge and Data Engineering, 17(4):503–518, 2005. 3. Bianchini, D., De Antonellis, V. and Melchiori, M. Flexible semantic-based service matchmaking and discovery. World Wide Web, 11(2):227–251, 2008. 4. Budak Arpinar, I., Aleman-Meza, B., Zhang, R. and Maduko, A. Ontology-driven Web services composition platform. In Proc. of IEEE International Conference on e-Commerce Technology, pages 146– 152, San Diego, CA, USA, 2004. 5. Diamantini C., Potena D., and Storti E. KDDONTO: An Ontology for Discovery and Composition of KDD Algorithms. In Proc. of the ECML PKDD 2009 Workshop on Service-oriented Knowledge Discovery, Bled, Slovenia, Sep 7–11 2009. 6. Diamantini, C., Potena D. and Storti, E. Ontology-driven KDD Process Composition. In Proc. 8th International Symposium on Intelligent Data Analysis, page to appear, Lyon, France, 2009. 7. Fuji, K. and Suda, T. Dynamic Service Composition using Semantic Information. In Proc. of 2nd International Conference on Service Oriented Computing, pages 39–48, New York City, NY, USA, 2004. 8. Gruber, T. Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum.-Comput. Stud., 43(5-6):907–928, 1995. 9. Lemmens, R. and Arenas, H. Semantic Matchmaking in Geo Service Chains: Reasoning with a Location Ontology. International Workshop on Database and Expert Systems Applications, 0:797–802, 2004. 10. Ni, Q. Service Compositioin in Ontology enabled Service Oriented Architecture for Pervasive Computing. In Workshop on Ubiquitous Computing and e-Research, Edinburgh, UK, 2005. 11. Noy, N. and McGuinnes, D.L. Ontology Development 101: A Guide to Creating Your First Ontology. Stanford University, 2002. 12. Winston, M.E., Chaffin, R. and Herrmann, D. A Taxonomy of Part-Whole Relations. Cognitive Science, 11(4):417–444, 1987. ˇ akov´a, M., Kˇremen, P., Zelezn´ ˇ 13. Z´ y F. and Lavraˇc, N. Using Ontological Reasoning and Planning for Data Mining Workflow Composition. In SoKD: ECML/PKDD 2008 workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery, Antwerp, Belgium, 2008.

Suggest Documents