Fundamenta Informaticae 129 (2014) 341–364
341
DOI 10.3233/FI-2014-975 IOS Press
Using Domain Knowledge in Initial Stages of KDD: Optimization of Compound Object Processing Marcin Szczuka∗† Institute of Mathematics, The University of Warsaw Warsaw, Poland
[email protected]
Łukasz Sosnowski Systems Research Institute, Polish Academy of Sciences and Dituel Sp. z o.o. Warsaw, Poland
[email protected]
´ Adam Krasuski, Karol Krenski Section of Computer Science, The Main School of Fire Service Warsaw, Poland
[email protected],
[email protected]
Abstract. We present a set of guidelines for improving quality and efficiency in initial steps of the KDD process by utilizing various kinds of domain knowledge. We discuss how such knowledge may be used to the advantage of system developer and what kinds of improvements can be achieved. We focus on systems that incorporate creation and processing of compound data objects within the RDBMS framework. These basic considerations are illustrated with several examples of implemented database solutions. Keywords: Domain knowledge, KDD process, database sharding, hierarchical systems, compound data objects, data cleansing. ∗
This work was supported by the Polish National Science Centre grants 2011/01/B/ST6/03867 and 2012/05/B/ST6/03215 as well as by the Polish National Centre for Research and Development (NCBiR) under SYNAT - Grant No. SP/I/1/77065/10 in frame of the strategic scientific research and experimental development program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”. † Address for correspondence: Institute of Mathematics, The University of Warsaw, Banacha 2, 02-097 Warsaw, Poland Received November 2012; revised September 2013.
342
1.
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
Introduction
The process of Knowledge Discovery in Databases (KDD) is traditionally (see e.g., [6]) presented as a sequence of operations which, applied iteratively, lead from the raw input data to high-level, interpretable and useful knowledge. The major steps in KDD process are typically: Selection, Preprocessing, Transformation, DataMining, and Iterpretation/Evaluation. In this paper we focus on first three of these steps, i.e., Selection, Preprocessing, and Transformation. Our goal is to demonstrate how we can improve the entire KDD process by using background (domain) knowledge in first three phases, with special attention paid to selection and preprocessing. During the selection and preprocessing phases of the KDD cycle the original raw data pool is sampled, cleansed, normalized, formatted and stored in a convenient way. The original, raw data is first turned into target data (selection) and then converted into preprocessed, analytic data (preprocessing). At this point we already have the data ready for mining and analysis, however, further transformation may be required, if the data mining and analysis algorithms are to run efficiently. By utilizing various kinds (layers) of knowledge about the problem, the nature and structure of data, the objective, and available computational tools, we want to improve both the processing speed and the overall quality of the KDD results. In general case, not much can be done to optimize the quality of data mining step beforehand, since the knowledge needed to do that is not discovered yet. However, in particular applications we can at least prepare the data for mining algorithms in such a way that computational effort needed to manage data and obtain results is decreased and the chance to discover meaningful knowledge is increased. In this paper we narrow the general task of data preparation for data mining to cases that meet some additional criteria. We assume that it is necessary (required) to use data representation that involves creation and processing of compound (complex) data objects. Such a complex data object can be a structured text (document), a set of images, and so on. The main feature that defines such object is the existence of internal, non-trivial structure that can be used to preprocess and transform data entity for the purposes of data mining algorithms. Another condition for the problem to fit our scheme is the complexity of the problem as a whole. We want to address situations such that there is a room for significant improvement. Therefore, we are mostly interested in using knowledge to deal with data sets that are large and/or complicated. For such data the straightforward, standard approaches may fail due to prohibitive computational cost. Last, but not the least, we mostly (but not exclusively) deal with situations, when storage and processing of data entities involves Relational Database Management System (RDBMS). The use of RDBMS imposes some additional constraints, but at the same time provides more versatile tools for data manipulation. In the paper we use our experience in constructing and using large data warehouses to form a set of hints (guidelines) for a practitioner who needs to deal with tasks that require storing and processing of big data represented with use of compound (complex) data objects. We provide some insights into the ways of utilizing various kinds of knowledge about the data and the application domain in the process of building a data-warehouse-based solution. We use several examples of practical projects to demonstrate what kind of knowledge and how, can be used to improve data processing in KDD process when compound data objects are involved. We explain what kinds of compound/complex objects one can encounter. We attempt to characterize data processing tasks by the way they handle such compound objects. Once the types of computations are defined, we will demonstrate how to tackle them, using examples (case studies) of practical projects we have implemented.
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
343
The paper is organized as follows. First, we explain what constitutes a complex data object, what kinds of operations we want to perform with such objects and what improvements we want to achieve (Section 2). Then, in Section 3, we explain how the knowledge can be used to improve (optimize) data processing at initial stages of KDD process. In particular, we propose a rudimentary classification (categorization) of several types (layers) of knowledge and relate it to processing of compound data objects. The concepts introduced in Sections 2 and 3 are then illustrated by several examples of practical projects (Sections 4, 5, and 6) related to KDD (see [13, 23, 25, 26]) in which utilization of domain knowledge significantly improves the performance. We finish with discussion and conclusions in Section 7.
2.
Definition of the problem
The leading scientists in the field of Data Mining (DM), emphasize the key role of interaction with experts and the usage of domain knowledge in problem solving [20, 27]. The extraction, representation, and usage of domain knowledge is the key to create useful models of complex real-world phenomena. New algorithms, while working, should be able to interact with the experts to create more robust results of operations on complex objects. In our opinion, the domain knowledge plays a pivotal rôle not only in DM, but in the initial steps of KDD as well. This is the key issue when we deal with complex objects, described by the enormous number of attributes that can be poorly quantized, e.g. sensory data. At the stage of data selection the usage of domain knowledge can support selection of most valuable attributes, which leads to creation of more relevant models in further stages of KDD process. Similarly, domain knowledge is crucial during data selection and cleansing. For example, automatic spell checking and error corrections, which is typically based on Levenshtein (editorial) distance1 can lead to introduction of factual errors. Domain knowledge improve this process by narrowing the number of possible word variants to the specific, relevant vocabulary such as list of surnames or street names. In many real-life projects the use of background (domain, expert) knowledge is the only reasonable option. Frequently, to obtain a robust attribute transformation (within KDD stage) we have to involve the domain knowledge in the process. For example, if we deal with sensory data (EEG stream, temperature sensor) we have to attach semantics to this data in order to make further processing viable. Without it, we would be clueless as to how process the given data stream. Similarly, if we process the data related to operation of Fire Service (cf. [11]) for the purpose of predicting the possibility of a new intervention, we need to know, e.g., what resolution of time window (days, hours, minutes, ...) would be most relevant/appropriate for describing the situation to a given unit in the chain of command. Storage and processing of data entities representing compound objects, with use of domain knowledge, needs to be considered in many aspects. Firstly, we have to clearly state what kind of object we consider to be compound (complex). To begin with, by an object we will understand any element of the real world that can be stored as a data object (database entity) represented using a chosen, formal ontology. These general objects posses certain properties, following from their ontological representation: 1. An object is always a member of at least one class of objects in ontology. Single object may belong to several classes. 1
en.wikipedia.org/wiki/Levenshtein_distance
344
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
2. An object has properties defined in the context of the class(es) it belongs to. In particular, object properties may change as the class context changes. 3. An object can be bound by ontology-definable relation to other objects from the same, as well as different, classes. A compound object is an object that combines several (at least two) objects into an ontology-definable object. Compound objects can be characterized by two crucial properties: 1. A compound object can always be decomposed into at least two parts and each of these parts is a proper ontology-definable object. In other words, the object is truly compound. 2. Components that make the compound object are bound by relation(s) from ontology. In other words, the compound object is something more than just a container, it has an internal structure. A compound object as a whole may posses certain properties that are specific for a given domain (given context). I may also be related to other compound objects, not necessarily from the same domain. Using these relations we may construct more compound objects from existing ones. We may describe objects directly by deriving the values of their attributes (properties) in a given domain(s). In some models we may also derive the values of such properties using the relations between objects. Moreover, such descriptions may be parameterized. Properties and attribute values of a compound object may also be derived by examining its structure and sub-objects it contains in relation to other objects, e.g., by measuring the amount of common sub-objects. Please note, that the relation of being a component (sub-object) of a compound object introduces a partial order between objects and may be used to specify a partial object hierarchies. Such hierarchies are crucial for computations on compound objects. For example, when we compare two compound objects we should use the appropriate level of hierarchy, otherwise (see the example in Section 4) we may end up with negative result even in case when objects in discourse are in fact similar, but on a high level of generality. In further considerations we will assume that the ontologies used to define objects are given as a part of the domain knowledge. The problem of choosing the right ontology for a given task is in itself fascinating, but beyond the scope of this paper. From the ontology comes the representation of objects. Such a representation for compound objects is usually complex. It is not just a vector of attribute values corresponding to an object. We may rather expect a mixture of various types of data such as stream of sensor readings combined with its textual description. To select, store, preprocess and transform data that contains compound objects, so that they can be used in further steps of KDD process we need to consider the most probable data processing scenarios. Then, we have to design data structures and algorithms in such a way that they are efficient and produce high quality output. At this point, using all available domain knowledge may be crucial for the overall success of KDD. The aim of the article is to demonstrate that by incorporating knowledge one can improve the KDD process. The improvement we are trying to achieve is measured in terms of computational efficiency of algorithms and versatility of data structures that are the input to two final KDD steps: Data Mining and Result Interpretation. By properly choosing the representation and carefully designing algorithms we want to get the most of technologies we are using. Since we have to facilitate the storage and processing
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
345
that includes both data entities (objects) and relations (inter- and intra-object structures) it comes quite natural to use the RDBMS technology. Since at the same time we are aiming at really complex KDD tasks, that are usually accompanied by large amounts of data, we rely on technologies that are dedicated for use in large data warehouses. For most applications our RDBMS of choice is Infobright2 , a columnbased solution that incorporates data compression and data granulation in the engine.
3.
Solution outline
We claim that the use of domain knowledge in the process of designing and using data structures for compound objects may bring several benefits. We distinguish several categories (layers) of domain knowledge that might be utilized for optimizing storage and processing of complex objects. These are: Layer 1 Knowledge about the underlying, general problem to be solved with use of available resources and the collection of compound objects we have gathered. This kind of knowledge includes also such elements as: optimization preferences (e.g., storage or computation time), number of end– users of the system, availability of data, etc. Layer 2 Knowledge about objects, their internal structure, importance of attributes, orders of them, their types (including knowledge about measurement errors), relations between objects as well as the knowledge about computational methods used to process them. This type of knowledge includes also knowledge of probable computation scenarios: typical queries and processing methods, potential high-level bottlenecks, most frequent elementary data operations. Layer 3 Knowledge about the technologies, models, and data schemes that are used to store the information about objects within database. We can utilize high level knowledge – of general assets and shortcomings of particular technologies as well as some low level aspects of knowledge specific to chosen technology, e.g., about physical representation of objects inside database, such as Infobright’s column-wise data packages. While designing data-based process the optimization steps shall take into consideration all levels mentioned above. These levels are inter-connected and only by considering all of them we may achieve significant improvements in terms of the speed (computational cost) and accuracy of algorithms. The general methodology for designing a solution (a system for doing initial KDD steps) is outlined below, relative to the three layers of knowledge we have just introduced. This methodology is presented in a form of a goal-driven checklist (or recipe) that is typical for best practice guides on project management R (e.g., PRINCE2 3 ). For Layer 1 1. Gather all available pieces of general knowledge about the problem, such as: • “Lessons learned” (previous experience) report(s) describing known solutions for similar problems; 2 3
http://www.infobright.org/ www.prince2.com
346
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
• • • •
identification of the data sources; amount of data to be processed; ontology(-ies) describing data sources; background knowledge that has direct or indirect influence on the format of objects (or its description), its features or the solution (e.g., season, number of banking holidays in a given period, etc.).
GOALS: To use previous experiences and avoid prior mistakes. To include all necessary and accessible data. To select optimal technological framework. To assure uniform representation of data coming from various sources. To eliminate external factors that may create noise or vagueness in data. 2. Gather all available reference data sets, such as: • vocabularies, concept indexes, dictionaries (e.g. lists of diseases, list of relevant pharmaceutical products, catalog of replacement car parts, etc.); • taxonomies, thesauri, ontologies relevant to the domain of the problem; • collection of reference objects associated with the domain of the problem. GOALS: To assure unambiguity of the data. To perform data standardization, cleansing and translation. Elimination of errors/noise in data. 3. State the global targets that need to be achieved. GOALS: To visualize the final effects of your endeavor. To better understand the rôles of particular sub-steps, i.e., not losing the “bigger picture”. 4. Identify mutually independent components of the problem. GOALS: To investigate the possibilities for decomposing the problem into smaller and simpler ones, that can be dealt with independently (e.g. concurrently, in parallel). 5. Select the strategy for solving the problem (global architecture of the system, selection of methods, etc.) GOALS: To create the initial set of means and methodologies that will be used (define the area to work in). 6. Prepare a coherent and flexible environment to store and retrieve the knowledge about the problem. One capable of providing: • convenient and simple access to the gathered knowledge as well as ability to quickly categorize it; • simple and efficient knowledge sharing; • simple and efficient knowledge extension, in particular by adding the new knowledge obtained during experiments. GOALS: To get focused on the problem. To centralize knowledge and to facilitate knowledge sharing/exchange.
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
347
For Layer 2 1. Identify the desired representation for objects to be processed. GOALS: To understand the object description method and its properties. To allow for translation of objects. 2. Identify relations in which the processed object(s) participate. Deal with both the relations that bind the components of the objects and others. GOALS: To identify the relationships, connections, constraints and neighborhood relative to a given compound object. 3. Gather data (information) about feasible solutions for the problem given object representation. GOALS: To identify the possible and plausible (feasible) path(s) leading to the solution. 4. Project relations (from pt. 2) onto selected solution methods (pt. 3) and perform measurement (weigh the complexity/cost of implementation against the degree of membership in the relation). GOALS: To assess the potential gain from performing a given (local) procedure. 5. Select the final method for calculating the (global) solution by considering (some of) methods identified in pt. 4. GOALS: To select the most promising method for achieving the global solution. For Layer 3 1. List available technologies that can be used to construct the solution. GOALS: To identify the technological abilities and limitations. 2. Analyze relevant technologies (from pt. 1 above) with special attention on the areas for which they are dedicated (locate strengths, find out what they are best at, what they are famous for, etc.). Analyze (measure) how the identified strengths of particular technologies can be useful in solving the particular problem at hand, given the knowledge and representation established in previous layers. GOALS: To group the available technologies according to their major strengths and rank them (assign them scores). 3. Project available technologies with the highest ranking one (from pt. 2). GOALS: To select the most appropriate (amongst available) technology for the problem. 4. Design the data model optimal for a selected technology, using the knowledge from Layers 1 and 2. GOALS: To create the synergy of all kinds (layers) of knowledge in order to achieve optimal global solution.
348
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
The general requirements, knowledge of application domain and knowledge of the structure of the problem – as outlined above – have crucial influence on the design of data structures and computational methods. The design patterns that need to be applied in particular situation may be as varied as the kinds of data we are dealing with. There is, however, a certain degree of regularity that can be exploited to our advantage. We show how one can make use of these general observations in particular applications by illustrating them with examples presented in Sections 4 to 6. In first of these examples (Section 4) the knowledge about the nature of the task (shape recognition), the knowledge about data representation (images, shapes, maps), and the knowledge about the existence of hierarchical structure to which compound data objects belong to, led to significant improvements in performance of the implemented solution. Second example (Section 5) shows how the knowledge about dependencies between objects makes it possible to identify optimal decomposition of the computational task (sharding) which led to improvement in computational efficiency of data transformation tasks. Third example (Section 6) showcases the improvements that can be made in the process of data cleansing if we are able to use domain knowledge for the purpose of eliminating irrelevant results and reducing the search space.
4.
Example: Object comparators in identification of contour maps
This example shows practical implementation of comparator theory which was described in [23] and their application in the commercial project aimed at visualization of the results of the 2010 Polish local elections [25] and elections to both chambers of Polish Parliament (Sejm and Senat) in 2011. It demonstrates a path leading from identification of domain knowledge through its skillful use leading to optimization of the implemented solution. The overall task was to assess and visualize (display) the attendance ratios for the administrative areas of Poland. There were three sources of input data: 1. Attendance figures for each area; 2. Contour of each area; 3. The map of Poland divided into areas. Attendance results and contours were labeled with the area’s administrative codes. However, the codes were not present on the map. In order to identify the areas extracted from the map, we had to match them against the repository of reference objects (shapes). The algorithm devised to do that, introduced in [25], consists of the three phases: segmentation, granulation, and identification (see Figure 1). Firstly, we acquire objects, i.e., administrative areas from the map of Poland. Secondly, we extract granules of objects [18], compute their characteristics, and synthesize them into granular descriptions of objects. We consider only two types of characteristics, although the framework is open for adding more types. In case of the first type, referred to as coverage, we use the histogram technique [3] to compute a vector of overlaps of image’s granules with an area that the image represents. In case of the second type, referred to as contour, we produce linguistic description of directions of lines connecting the extremum points of the area’s contour within each of granules. In the third phase, we compute similarities between the input and reference objects with respect to each type of granular description, and we synthesize the similarity scores. For linguistic descriptions of contours we define the following membership function: µcontour (a, b) = 1 − DL(a, b) / max(n(a), n(b))
(1)
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
349
Figure 1. Activity diagram of the described algorithm. We refer to [23] for more details about the third phase. Modification of granulation parameters may change the final outcome from no reference objects to single object or multiple objects.
where DL(a, b) is the Levenshtein distance4 between linguistic descriptions of objects a, b, and n(a), n(b) denote the lengths of these descriptions, respectively. With regard to the granules’ coverage, we may consider the following: P (2) µcoverage (a, b) = 1 − in×m covia − covib / n × m For the purposes of this paper, we also use the aggregated similarity: µ(a, b) =
1 2
(µcontour (a, b) + µcoverage (a, b))
(3)
We implemented an additional procedure for the cases that meet the following conditions: 1. If some reference object was not chosen for any of investigated objects, then use it for the most similar unidentified object even if its degree of similarity is not greater than the activation threshold. 2. If some reference object was chosen for many investigated objects, then use it for the most similar of them and re-identify the remaining ones excluding the already used reference objects.
4.1.
Description of system’s operation
For illustration purposes, let us consider a sub-map of one of the counties in Poland – the Wejherowo County (Figure 2). As a result of the first phase of the algorithm, we obtain 9 image files to be identified. 4
en.wikipedia.org/wiki/Levenshtein_distance
350
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
Figure 2. The Wejherowo County (grey). One of the areas is selected for identification (dark-grey). Its granulation is compared with granulations of the reference objects (right part of the figure). Additionally, we can see the example an area (dark) identified as including a smaller area (white) at the first phase of the algorithm.
The first image has width w = 61 and height h = 69. For parameters m = n = 4, after rounding 61/4 to 15 and 69/4 to 17, we obtain granules g1 = {(x, y) : x ∈ [0, 15), y ∈ [0, 17)}, g2 = {(x, y) : x ∈ [15, 29), y ∈ [0, 17)}, etc. Let us now take a look at how the contour’s description is built. For the analyzed image, not all extreme points are distinguished. For g1 we obtain only two of them: (7, 16) and (14, 2). For g2 we have all four: (15, 1), (17, 0), (29, 15), and (28, 16). This shows that the corresponding strings of directions can vary in length and the formulas for µcontour need to take it into account. Table 1 presents the results for all 9 communes (boroughs) in the Wejherowo County. In this case, our algorithm was 100% accurate, although the numbers reported in brackets might suggest otherwise. Out of two components of function (3), µcoverage looks better. However, our tests show that using µcoverage itself would provide worse results. It seems that µcoverage plays the leading role but µcontour contributes additionally in situations when comparator based only on µcoverage would provide multiple reference objects or no reference objects at all. Table 1. The values of µcontour and µcoverage for 9 communes in the Wejherowo County. The numbers in brackets denote whether correct reference objects are the most similar, 2nd most similar, etc., to particular objects to be identified.
Area
Contour
Coverage
1 2 3 4 5 6 7 8 9
0.670 (2) 0.735 (1) 0.596 (4) 0.632 (2) 0.761 (1) 0.676 (1) 0.660 (1) 0.628 (1) 0.573 (4)
0.974 (1) 0.953 (1) 0.849 (3) 0.972 (1) 0.904 (1) 0.970 (1) 0.936 (1) 0.888 (1) 0.944 (1)
It is important to add that without an extra verification based on two rules outlined above the accuracy would decrease by 0.5 %. One of such cases is actually the 3rd item in Table 1. Indeed, its correct
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
351
identification was possible only because other communes in the Wejherowo County were matched with sufficiently high confidence.
4.2.
Using domain knowledge to increase the accuracy and reduce the processing time
The methodology described in the Section 3 involves action on several layers. The first layer is focused on knowledge about the general problem to be solved. In this case, it is knowledge of the source of the reference objects and the objects to be identified. Objects to be identified come from the decomposition of a large map (see 2). Each of the acquired areas is consistent (or piece-wise consistent). Both reference objects and acquired objects are disjoint. In practice, this leads to significant (sometimes even several times) reduction in computational effort, which is a direct consequence of the independence of the processes of identifying individual objects. The use of knowledge described above affects also the optimization process for the reference set. Using the knowledge we can split the reference set into subsets and then perform processing on these subsets concurrently. In order to obtain correct global results for particular comparators, we need to use the characteristic function for each of the partial results (each of the subsets). The ability to process several sub-tasks at once (through parallelization/concurrency), gives us a significant advantage over single-threaded calculations. Results of experiments confirming the positive effects of this step are presented in Table 2. Table 2. Summary of processing efficiency for different variants of domain knowledge corresponding to layer 1 defined in the methodology (Section 3). Experiments carried out for a set of counties with 10 threads
Variant of knowledge
Time
Efficiency
no domain knowledge
40 s
92 %
subsets of A
15 s
92 %
subsets of reference set
21 s
92 %
both
12 s
92 %
The second knowledge layer (as defined in Section 3) draws attention to other benefits resulting from knowledge about objects, their structures, and relationships which bind them. In this case the reference object and objects to be identified are the areas in the administrative division of Poland (voivodeships, counties, and communes). These objects belong to the hierarchy which has the following properties: 1. Voivodeship consists of the sum of counties; 2. County consists of the sum of communes (boroughs); 3. Each area belongs to exactly one unit of the parent. Knowledge of the hierarchy makes it possible to limit the cardinality of reference set w.r.t. the parent object, i.e., counties selected by voivodeship, communes selected by county. Limiting the cardinality of reference set directly corresponds to reduction of processing time and increase in the efficiency of identification. In order to embed the additional knowledge in implementation one can use the multi-layer network of compound object comparators. Each layer in such network is responsible for identifying the element belonging to the appropriate level of the hierarchy. Finding the object to be on level one, we
352
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
automatically set the reference set for the second level (based on the hierarchical relationships between objects). Such network can be built in many different ways by using different types of compound object comparators [24]. One of possible solutions of this kind is shown in Figure 3. Results of experiments with it are presented in Table 3.
Figure 3. Multi-layer compound comparator network, first layer identifies voivodeship and translate it into a reference set of counties, second layer identifies a county and translates it into minimal reference set of communes
Table 3. Table for process optimization based on the additional domain knowledge. The following abbreviations are used: Ref. set - reference set cardinality, Min - the minimal number of comparisons, Max - the maximal number of comparisons
Variant of knowledge
Ref. set
Min
Max
no domain knowledge
2874
2874
2874
only voivodeships
16
16
16
only counties
379
379
379
only communes
2479
2479
2479
voivodeship
avg: 155
79
314
county
avg: 8
3
19
The third aspect of our methodology is to use the knowledge of the technology, data model or schema. In the example data objects as well as the processing results are stored in a relational database ICE5 (column database) as ROLAP6 cubes in the form of a star schema. ROLAP cubes collectively make a constellation. Constellation is based on certain fixed dimensions arising from the knowledge of objects. Prepared cubes contain different levels of granularity of information. Therefore the processing can take advantage of the most useful data for a specific analysis. Data flow diagram of relationships between the cubes is shown in Figure 4. In addition to an appropriate architecture schema, it is important to have knowledge about the technology used to perform the tasks. In our case, ROLAP cubes are implemented in an ICE, which is a column database and has specific mechanisms for storing data (see [21]). With this knowledge, we can also obtain a significant performance improvement as shown in Table 4 for some of processing parts. 5 6
Infobright’s Community Edition www.infobright.org Relational Online Analytical Processing - http://en.wikipedia.org/wiki/ROLAP
353
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
Figure 4. Data flow diagram of the cube feeding. The following abbreviations are used: Cr stands for fact_images; Cg – fact_granule_pixels; Ce – fact_granule_extrema; Cc – fact_contours; Ccov – fact_area_coverage; Cs – fact_object_similarities.
Table 4. Phases 2 of schema 1 (granulation, including extraction of extremum values) and 3 (two types of comparators: contour and coverage). The DB engines – PostgreSQL 9.0 (pg) and ICE 4.0 (ICE) – are mentioned in brackets. The following abbreviations are used: Gr - granulation, Cov - coverage, Con - contour
5.
# of objects
Gr (ICE)
Gr (pg)
Cov (ICE)
Cov (pg)
Con (ICE)
Con (pg)
1
30 ms
980 ms
30 ms
60 ms
140 ms
520 ms
10
50 ms
1740 ms
140 ms
60 ms
9030 ms
2090 ms
400
1300 ms
4170 ms
5790 ms
2660 ms
4 min 39s
>10 min
2000
4530 ms
14360 ms
19980 ms
7050 ms
>10 min
>10 min
10000
21790 ms
87328 ms
85920 ms
38180 ms
>10 min
>10 min
Example: Knowledge driven query sharding
The SYNAT project (abbreviation of Polish “SYstem NAuki i Techniki”, see [2]) is a large, national R&D program of Polish government aimed at establishment of a unified network platform for storing and serving digital information in widely understood areas of science and technology. Within the framework of a larger project we want to design and implement a solution that will make it possible for a user to search within repositories of scientific information (articles, patents, biographical notes, etc.) using their semantic content. Our prospective system for doing that is called SONCA (abbreviation for Search based on ONtologies and Compound Analytics, see [9, 15, 16]). Ultimately, SONCA should be capable of answering the user query by listing and presenting the resources (documents, Web pages, et cetera) that correspond to it semantically. However, from our (developers’) perspective SONCA is also a platform to search for new solutions in the field of semantic measures. During the research, we tried to develop a new measure of semantic similarity between documents used, among others, to group them into semantically coherent clusters.
354
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
The sample process of knowledge discovery (KDD) with SONCA platform is presented in Figure 5.Our process of finding a new insight into the field of semantic measures, starts with the selection of a set of documents from the repository (e.g. PubMed Central database – PMC, see [1]). The selected document collection is stored in our local warehouse. Next, we pre-process the documents and store in a format convenient for us. This includes, inter alia, extraction of the entities from documents, document and author matching and so on.
External repository (PubMed)
Data pre-processing
Data transformation
Data mining Experts' labels (PubMed)
Figure 5.
The process of knowledge discovery within SONCA system.
The third step is the transformation of the data into the required format, e.g. for a document: bag-ofwords, tf-idf representation, or set of concepts related to the document semantically. Then we perform our processes of data mining using many different methods from the field. The final step produces sets of documents grouped (clustered) w.r.t the chosen measure of similarity. Then we compare the results with the experts’ evaluation. In the case of similarity measures we utilize the labels attached to the documents by PubMed’s editors (taggers). In order to be able to attach semantic labels to the document (data transformation step) we employ a method called Explicit Semantic Analysis (ESA) [7]. This method associates elementary data entities with concepts coming from knowledge base. In our system the elementary data entities are documents with all their contents, abstracts and so on. As a knowledge base we use the Medical Subject Headings (MeSH7 ) – a comprehensive controlled vocabulary created for the purpose of indexing journal articles and books in life sciences. We have field-tested a modified version of the ESA approach on PMC using MeSH (see [8, 22]), and found out that while conceptually the method performs really well, the underlying data processing, especially the part performed inside RDBMS, requires introduction of new techniques. Each of relational database engines used by us had problems with answering the required set of queries in acceptable time. Some database engines were not even able to complete the computation. Therefore, the data mining step of KDD was very jeopardized due to unacceptable processing cost. From the database viewpoint the problem we needed to solve was the one of performing a top-k analytic, agglomerative SQL-query that involves joins on very large data tables (see [14]). 7
http://www.nlm.nih.gov/pubs/factsheets/mesh.html
355
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
While optimizing the execution of queries in our particular case, we have noticed that the problem we are dealing with is of general nature. Our way of solving it, by extraction a domain knowledge and injection it into method of execution, also appears to be generally applicable. That led us to formulation of general scheme for processing certain kind of queries with use of knowledge that we have about underlying data objects and query types.
5.1.
Attaching semantic labels
As we mentioned in Section 5 we are interested in attaching semantic labels to documents with use of the ESA method. In order to achieve this using the analytical relational database, we need first to perform joins on three large tables. Figure 6 outlines the simplified representation of the tables in discourse. STEM_DOC_TF doc_id
tf 0.3 0.23 0.15 0.03
stem
stem concept assoc stem1 concept1 2.423
stem1 stem2 stem3 stem4
stem1 concept2 1.423 stem2 concept1 0.423 stem4 concept2 2.333
JOIN
1 1 1 1
MESH_STEM_CONCEPT
MESH_STEM_IDF stem idf stem1 0.403 stem2 1.12 stem3 0.023 stem4 2.313
Figure 6.
The calculation of ESA on a relational database.
In Figure 6, the set of documents is represented by table stem_doc_tf. In the relational algebra formula below we will refer to this table as R. Each of consecutive document is represented by the columns: doc_id, stem and term_tf – frequency of the stem within document. The MeSH controlled vocabulary is represented by two tables. First of them, mesh_stem_concept (denoted by S), associates quantitatively the concept form MeSH to the determined stem (column assoc). The second one, mesh_stem_idf, determines the idf value for each of stems from MeSH controlled vocabulary (denoted by T). According to ESA, document labeling requires performing the following operations: joining (JOIN) this three tables on column stem, calculation of the association measure for each of the stems from given document, calculation of the sum of this measure for each of the concept within document, and finally ordering of the result according to the measure values for the document. Equation (4) describes this calculation using a formula from relational algebra. U ← Π(R.doc_id,T.concept,measure) τ(measureDESC) γ
√ R.doc_id,S.concept,SU M ( R.tf ∗S.assoc∗T.idf 2 )→measure
R
./
stem=stem
S
./
stem=stem
T
!
(4)
356
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
The queries presented above return the complete information, i.e., for each document they give us the levels of association with each and every concept in our ontology (knowledge base). This is both unnecessary and unwanted in practical applications. Empirical experiments show that if we are to present the results to the user, we shall present no more than top-k most associated concepts, with k ≤ 30. Anything above 30 produces perceptual noise. So, as a last step in calculation we shall prune the result, leaving only top-30 most associated concepts in each documents’ representation. We have observed that while some database engines (e.g. PostgreSQL) partially support the kind of operations we want to perform, others do not. The lack of support for decomposition of query processing was especially problematic for us in case of column-oriented database systems (Infobright, MonetDB), since the column-oriented model is for various reasons recommended for our application.
5.2.
Performance improvement with domain knowledge
To solve the query processing problem we resorted to a scheme presented in Section 3. The characteristic property of our task is the possibility of decomposing it into smaller subtasks (tasks on sub-tables), given some knowledge about the nature and structure of our data set. The fact that the query-answering can be decomposed into largely independent sub-tasks makes it possible to optimize it by using only topk instead of all sub-results most of the time. Inasmuch as sub-tasks are largely independent from one another, we can also create shards and process them concurrently using, e.g., multiple cores (processors). As explained in the previous section, we are not interested in calculating all possible answers to a query. Hence, we propose to split the table and process it piece-wise. Each piece (shard) would contain information about an object in the data, such that it can be processed independently from other objects without distorting the global outcome. Once we have this sharding (task decomposition), we can distribute calculation among multiple threads on a single (multicore) processor or among several machines in a networked environment. The key to success is the knowledge about the composition of and relationships between the objects. If we possess the information (knowledge) that the objects are largely independent then we conclude that they can be processed in parallel, each shard separately. In our current approach this knowledge is derived from the domain by hand. However, it is imaginable that in the future an automated data (structure) analysis tool would make it possible to discover rules (criteria) for detecting situations we discuss here, and implement these rules using database triggers. It is crucial to note, that the approach to processing queries using sharding which we propose, does not require a rewrite of the existing query optimizers. We propose a rewrite of the large query into a collection of smaller ones, that can be executed in parallel. We do not interfere with intra-query parallelization implemented in most RDBMS. Instead we apply a trick, creating a collection of virtual clients that send very simple queries to the database, instead of processing one global query that may be very time-consuming to answer. By running queries for each of the pieces (documents) separately we achieve additional profit. We are able to process queries that require application of LIMIT operator within GROUP BY statement. This functionality was added in SQL:1999 and SQL:2003 standards [4] by introducing windowing functions and elements of procedural languages. Unfortunately, these functionalities are not supported in most column-based systems, such us Infobright8 and MonetDB9 . The ability to limit 8 9
http://www.infobright.org/ http://www.monetdb.org/
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
357
processing to top-k objects (documents) can make a big difference in execution time, as demonstrated experimentally in Section 5.3. In the case of example presented in Section 5.1 sharding corresponds to creation of a separate query for each of the objects (documents), since we have knowledge that there is no interference with other objects during the calculation. Objects correspond to documents, and the boundary of an object can be easily determined by detecting the change of id in column doc_id. Now, the query presented in the Section 5.1, can be decomposed into a series of simpler ones, using the scheme presented in Algoritm 1. 1 2 3 4
N := SELECT DISTINCT doc_id from TABLE for doc_id ∈ N do run SELECT ... WHERE DOC_ID = doc_id in K threads concurrently end Algorithm 1: Query sharding
5.3.
Experimental verification
In order to validate the usefulness of the proposed approach we have performed a series of experiments. In the experiments, we compared the processing of queries with and without the use of sharding. To have a better overview we have included in the experiment the representatives of three major types of database technologies: a) Infobright, which combines column-oriented architecture with Knowledge Grid; b) PostgreSQL10 which represents a row-oriented object-relational architecture with PL/pgSQL – a procedural language extensions of SQL. c) MonetDB which represents purely column-oriented database architecture. The results are summarized in Table 5. Due to the fact that in column-oriented architectures that we use it is not possible to run query with LIMIT within GROUP BY, the comparison with performance of windowing functions and elements of procedural language (LOOP within PL/pgSQL) was performed only with PostgreSQL database. The experiments are based on an external implementation in Java with Apache DBCP11 used for connection pooling that was required in parallel query processing.
5.4.
Discussion
The experiments clearly demonstrate that joining query sharding with parallel execution of sub-tasks has a potential. In some cases, the queries processed with using query sharding were executed from 3 to 23 times faster. Also, in column-oriented databases sharding was the method to get around the problem with enforcing limit LIMIT inside GROUP BY. The experiments, however, also demonstrate that specific conditions must be met in order for query sharding to be beneficial. Other experiments show [13] that due to imbalance between the computational overhead created by the parallelization of the task and the 10 11
http://www.postgresql.org/docs/current/static/plpgsql.html http://commons.apache.org/dbcp/
358
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
Table 5.
Results of the performed experiments.
Vector of concepts calculation (Formula (4)) Database 1 2 3
Infobright PostgreSQL MonetDB
No sharding 22 h 22 m 0.39 s 24 h – no results MALException error
Sharding 8 h 42 m 6.74 s 7 h 3 m 1.74 s 8 h 17 m 30.50 s
Vector of concepts calculation (Formula (4) with LIMIT k = 30) Database 1 2 3 4
Infobright PostgreSQL LOOP within PL/pgSQL PostgreSQL WINDOWING FUNCTION MonetDB
No sharding NA 16 h 58 m 28.03 s 17 h 22 m 30 s NA
Sharding 0 h 29 m 11.98 s 1 h 27 m 51.64 s 1 h 27 m 51.64 s 0 h 35 m 31.34 s
complexity of the task itself. The use of knowledge domain about technology and the way of objects processing is crucial in judging what type of operations are worth to shard. The conclusion from the experimental verification is the set of guidelines that have to be followed in order for the sharding to be effective. These guidelines are specific expression of general ideas stated in Section 3: 1. The query to be decomposed must contain a central, complex, agglomerative task, which involves joins on very large data tables. Typically, we would decide to use sharding if the auxiliary structures used to store GROUP BY data exceed the RAM allocation capabilities. 2. Secondly, all arithmetic operations must occur inside the JOIN operation. We strongly believe that these guidelines can be used to formulate a set of rules for automatic tuning of query execution in database engines. That is, if certain conditions are met, the database engine would transform the query processing from traditional to sharded model. The key to success is the knowledge about the data structure and purpose, which makes it possible to avoid unnecessary calculations. The proposed approach has one more advantage, which was especially valid for us in the context of our SONCA system. The set of smaller queries obtained the result of sharding may be executed independently and concurrently. Thanks to this, we can regulate the number of threads (machines, processors, cores) involved in the calculation at any given point. Since the results of each sub-query execution are stored and do not need to be accessed by others, the entire calculation can be interrupted and then picked up without any loss of information. This ability is usually hard to achieve in database systems that use multi-threading in query processing (see [17]). In our implementation we have achieved good control over load-balancing by performing the scheduling outside of database, using our own code. However, we strongly believe that a similar result can (and should) be achieved by implementing sharding inside the database engine. For the moment, we benefit from query sharding in the SONCA system. It gives us the ability to plan ahead tasks and perform them with optimal use of computing resources. This is not so crucial for simpler tasks, such as document processing (stemming, stem association), which normally
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
359
take less than an hour, but for finding semantical relationships between concepts and sections, sentences or snippets it is of paramount importance, as these calculations may last for several days.
6.
Example: Data cleansing of fire & rescue reporting system
The national fire and rescue services (just like, e.g., police) are typically equipped with the incident data reporting systems (IDRS), which gather the information about conducted actions. Each of approximately 500 Fire and Rescue Units (JRG) of the State Fire Service of Poland (PSP) conducts around 3 fire & rescue actions a day. After every action a report is created in EWID – the internal computerized reporting system of PSP. Currently, the total number of the reports in EWID is around 6 million, of which about 0.3 million records were available for the purposes of this research. Each record contains around 560 attributes. Most of these attributes provide yes/no information about action parameters (stored as binary values), but there are also timestamps, quantities and short text entries. There is one attribute which we consider distinct: the natural language description of the action.
6.1.
The analytical framework
EWID will be a key component of a multi-layer Analytical Framework for computer assisted problem solving in PSP, which we are starting to construct [12]. On top of this framework there will be “Model Layer” [10, 11]. The model layer will be the most abstract one, one that will provide fire service offices with means to specify (define) in natural way the problems that they may encounter in everyday practice. On the bottom there will be “Raw Data Layer” and “Quality Data Layer”, where the unprocessed “dirty” data will be fixed (cleansed), formatted and preprocessed. Obviously, the quality of the data affects the performance of upper layers. At this moment it is not possible to quantify improvement in the performance of the system as a whole resulting from the data cleansing described here. Due to bottom-up character of the framework construction process we cannot currently measure the overall performance, since it requires the consecutive layers to be constructed and evaluated. These layers are still under development. However, once the data cleansing is complete, we may proceed with further data processing and transformation steps, including: semantic enhancements, formulation of the information granules, establishment of relations between data objects, etc. These steps will eventually lead to creation of higher-level models.
6.2.
Knowledge driven data cleansing
The collection of the 0.3 million EWID descriptions contains about 60 MB of texts, written in seminatural, technical language similar to the following (awkward vocabulary and misspelled words are intentional): “After arriving at the fire scene the undergrowth fire was observed. Two firefighting jets ware applied and suction line from the nearby lake was created. After putting out the fire, appliance crew came back to fire station". The concern is that over the years this large corpus which contains valuable information has been collected with limited validation of the input. Thus, the text input in description field requires data cleansing followed by some semantic improvements. This case study is focused on the data cleansing only. Yet, it may be beneficial for other text corpora, which are affected by typographic errors. We know of projects
360
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
where data cleansing step was explicitly skipped, as the main option to deal with it appeared to be a laborious human work [5, 19]. In this work we propose a mostly automatic, iterative process supervised by domain experts. It is important to mention that we are more interested in having the entities in the text unified (disambiguated), than grammatically correct. For the purpose of further operations on EWID corpus, such as clustering, statistical analysis, and so on, this unification may be crucial. 6.2.1.
Removal of redundant characters
The lack of the validation for the description section results in various characters being wrongly inserted. This may result in creation of alternate forms of the same entities (e.g., GBA3 vs GBA-3). The selection of the characters which should be removed requires the input from the expert – this step can be done by searching for the words containing non-alphanumeric characters and deciding which of the characters should be dropped. It the case of EWID corpus most of non-alphanumeric characters were replaced by space. Additionally, digits at the beginning of a word boundary and hyphens within word boundaries after a letter and before a digit were removed. 6.2.2.
Frequent words not recognized by the dictionary
At this stage there were 8,044,535 words including 309,036 words which were not recognized by the popular aspell spell-checker (3.8% error ratio). 500 most frequent (the reasonable number for a human to process) of the 309,036 not recognized words were extracted from the corpus. 200 of these entries proved to be the correct w.r.t domain vocabulary. Automatic corrections by aspell were proposed for the remaining 300 and they were later manually adjusted by the expert. The knowledge from the domain expert was instrumental in achieving reasonable outcome, as some cases were not quite obvious. For example, aspell proposed ‘dzielenie’ (division) as the correction for the misspelled word ‘dzialenie’, which was overruled by expert’s ‘dzialanie’ (action). The corrections were applied to the corpus and spell-checker was rerun. The error ratio dropped to 2.9% as a result of application of fixes described above. 6.2.3.
The additional dictionaries
In order to extend the spell-checker, the search for the domain vocabulary was started. There exists a number of texts collections which could serve as a reference in composing the domain vocabulary (domain knowledge). Ultimately, the expert decided that the resources of domain journal “Przeglad Pozarniczy” (PP12 ) would contain the texts that are most relevant for the operational content of EWID corpus. PP publishes fire & rescue related articles, and being a journal it contains (hopefully) very small amount of misspellings. We spell-checked the acquired PP corpus and all the aspell-reported misspellings (words not found in standard dictionary) were treated as candidates domain vocabulary, thus domain dictionary was created. The extended spell-checker reported error rate of 2.15%. By knowing the content of EWID the expert added more elements to dictionary. Geographical entities – streets, cities and districts were obtained from the external public sources and became another extension to the dictionary. The spell-checker extended with the domain vocabulary and geographical entities was rerun and the error ratio dropped to 1.75%. 12
ISSN 0137-8910, http://www.ppoz.pl/
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
361
Another key step was the inclusion of surnames. Surnames in EWID are frequently reported as misspellings by aspell. Fortunately most first names are recognized by aspell. The public database of Polish surnames and first names was acquired and roughly checked for the completeness against our students surnames database (around 200 entries). 97% of the surnames was recognized, so the completeness of the surnames database seems quite reasonable. The spell-checker extended with domain vocabulary, geographical entities, names and surnames reported 1.56% of errors. 6.2.4.
The n-gram approach
The knowledge-based extensions of spell-checker’s dictionary exhausted the inventory of easy fixes. Remainder of the corpus required more extensive approach. The method that we have applied replaces (corrects) a misspelled word using its nearest correct neighbor. The neighbor(s) of a given word needs to be identified in meaningful way. For this purpose, the list of all 3-grams of words form corpus was created. This list was spell-checked with use of all of the dictionaries introduced above and, as a result, split in two. The two parts, 3-grams.correct and 3-grams.errors contain 3-grams recognized as correct and erroneous, respectively. Then the 3-grams.errors list was iterated to find the nearest entry on the 3-grams.correct list. The measure we use is the Levenshtein (editorial) distance. The correction was applied if the distance between neighbors was at most 2. The threshold of 2 (inclusive) for measure was determined by domain expert. At the moment of writing, this stage is not completed for entire EWID corpus, due to computational effort required. However, the experiments done on random samples of 30 elements taken from 3-grams.errors indicate, that we may expect the overall error rate to drop to around 1% when the method will be applied to the entire collection.
6.3.
Discussion
The knowledge-based data (text) correction method that we propose makes it possible to reduce error (typo) ratio from 4% down to 1% (four-fold) in the EWID corpus. The method based on n-grams has the potential to improve the results even further if add 2-grams to neighborhood analysis. After all the methods involving domain and expert knowledge are applied, the plain aspell’s auto correction of single words (1-grams) may be used. The cleansing/correction methods described above may also be be used for clearing the corpus of sensitive and private data. For example, there is an issue with sharing EWID corpus because of personal data (names, addresses, etc.) included there. These sensitive data are not always easy to pinpoint, and the presented methods may help in this task, making anonimization of the text corpus feasible.
7.
Conclusions
Our approach to utilization of domain (background) knowledge in the initial stages of KDD process is not an answer to each and every problem. The area of KDD is too diversified and complicated for any methodology to work for each and every case equally well. Our approach to selection, preprocessing and transformation steps in KDD is an attempt to identify characteristic features that may be used to identify the right, knowledge-based tool for the task at hand. Sometimes the results of these attempts may be difficult to ascertain, as we only operate on initial steps of KDD. It may be hard, if not impossible to know in advance if all operations performed on initial stages of KDD will bring significant improvement
362
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
after data mining and result interpretation is concluded. It is, after all, a discovery process, and we usually don’t know what exactly we shall expect to be discovered. The series of examples presented in the paper illustrates both the variety of issues that have to be addressed, and the apparent existence of the overall scheme behind (see [24]). It appears that by properly identifying the level of complication of the task and the kind (level) of domain knowledge we posses one can achieve significant improvements in efficiency and quality of the solution. It has to be stated that the knowledge base methods demonstrated in examples are demonstrating improvements on very diversified scale. They range from observable improvement for the entire KDD result in the case of identification of contour maps (Section 4), through data cleansing and knowledge acquisition (Section 6), up to optimization of query processing in relational database for the purpose of data transformation (Section 5). Another issue that should be taken into account and possibly analyzed in detail in the future is the cost of adopting the domain knowledge. A big problem with evaluation of knowledge-driven approaches is that it is very hard to make comparison with other methods in terms of classical cost measures. The additional computational and mental effort of using domain knowledge is hard to quantify as the only other method we can compare with is, in most cases, the one that uses no knowledge at all. Hence, to evaluate costs and profits resulting form using knowledge one should first think how to introduce costto-effect measures as well as benchmarks that would make it possible to quantify the cost associated with knowledge acquisition, storage, and usage. The establishment of a sufficiently general and versatile measurement framework for doing that is a challenge for the future.
References [1] Jeff Beck and Ed Sequeira. PubMed Central (PMC): An archive for literature from life sciences journals. In J. McEntyre and J. Ostell, editors, The NCBI Handbook, chapter 9. National Center for Biotechnology Information, Bethesda, 2003. [2] Robert Bembenik, Łukasz Skonieczny, Henryk Rybi´nski, and Marek Niezgódka, editors. Intelligent Tools for Building a Scientific Information Platform, volume 390 of Studies in Computational Intelligence. Springer, Berlin / Heidelberg, 2012. [3] Sagarmay Deb. Multimedia Systems and Content-Based Image Retrieval. ITPro collection. IGI Global, 2004. [4] Andrew Eisenberg, Jim Melton, Krishna Kulkarni, Jan-Eike Michels, and Fred Zemke. SQL:2003 has been published. SIGMOD Rec., 33(1):119–126, March 2004. [5] Paul Elzinga, Jonas Poelmans, Stijn Viaene, Guido Dedene, and Shanti Morsing. Terrorist threat assessment with formal concept analysis. In Intelligence and Security Informatics (ISI), 2010 IEEE International Conference on, pages 77–82. IEEE, 2010. [6] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining, pages 1–34. 1996. [7] Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 6–12, 2007. ´ [8] Andrzej Janusz, Wojciech Swieboda, Adam Krasuski, and Hung Son Nguyen. Interactive document indexing method based on explicit semantic analysis. In Proceedings of the Joint Rough Sets Symposium (JRS 2012), Chengdu, China, August 17-20, 2012, Lecture Notes in Computer Science. Springer, 2012.
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
363
´ ¸ zak, Krzysztof Stencel, Przemysław Pardel, Marek Grzegorowski, and Michał [9] Marcin Kowalski, Dominik Sle Kijowski. RDBMS model for scientific articles analytics. In Bembenik et al. [2], chapter 4, pages 49–60. [10] Adam Krasuski, Karol Kre´nski, and Stanisław Łazowy. A method for estimating the efficiency of commanding in the state fireservice of poland. Fire Technology, 48(4):795–805, 2012. [11] Adam Krasuski, Karol Kre´nski, Piotr Wasilewski, and Stanisław Łazowy. Granular approach in knowledge discovery - real time blockage management in fire service. In Tian rui Li, Hung Son Nguyen, Guoyin Wang, Jerzy W. Grzymala-Busse, Ryszard Janicki, Aboul Ella Hassanien, and Hong Yu, editors, RSKT, volume 7414 of Lecture Notes in Computer Science, pages 416–421. Springer, 2012. ´ ezak, Karol Kre´nski, and Stanisław Łazowy. Granular knowledge discovery [12] Adam Krasuski, Dominik Sl˛ framework. In New Trends in Databases and Information Systems, volume 185 of Advances in Intelligent Systems and Computing, pages 109–118. Springer, 2013. [13] Adam Krasuski and Marcin Szczuka. Knowledge driven query sharding. In Louchka Popova-Zeugmann, editor, CS&P 2012, volume 225 of Informatik Berichte, pages 182–190, Berlin, Germany, 2012. Humboldt Univeristät zu Berlin. [14] Sebastian Michel, Peter Triantafillou, and Gerhard Weikum. KLEE: a framework for distributed top-k query algorithms. In Proceedings of the 31st international conference on Very large data bases, VLDB ’05, pages 637–648. VLDB Endowment, 2005. [15] Anh Linh Nguyen and Hung Son Nguyen. On designing the SONCA system. In Bembenik et al. [2], chapter 2, pages 9–35. ´ ¸ zak, Andrzej Skowron, and Jan Bazan. Semantic search and analytics over [16] Hung Son Nguyen, Dominik Sle large repository of scientific articles. In Bembenik et al. [2], chapter 1, pages 1–8. [17] Victor Pankratius and Martin Heneka. Parallel SQL query auto-tuning on multicore. Karlsruhe Reports in Informatics 2011-5, Karlsruhe Institute of Technology, Faculty of Informatics, 2011. [18] Witold Pedrycz, Andrzej Skowron, and Vladik Kreinovich, editors. Handbook of Granular Computing. John Wiley & Sons, 2008. [19] Jonas Poelmans, Paul Elzinga, Stijn Viaene, and Guido Dedene. An exploration into the power of formal concept analysis for domestic violence analysis. In Petra Perner, editor, ICDM, volume 5077 of Lecture Notes in Computer Science, pages 404–416. Springer, 2008. [20] Andrzej Skowron. Discovery of processes and their interactions from data and domain knowledge. In Piotr Jedrzejowicz, Ngoc Thanh Nguyen, Robert J. Howlett, and Lakhmi C. Jain, editors, KES-AMSTA (1), volume 6070 of Lecture Notes in Computer Science, pages 12–21. Springer, 2010. ´ ¸ zak and Victoria Eastwood. Data warehouse technology by infobright. In Ugur Çetintemel, [21] Dominik Sle Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul, editors, SIGMOD Conference, pages 841–846. ACM, 2009. ´ ¸ zak, Andrzej Janusz, Wojciech Swieboda, ´ [22] Dominik Sle Hung Son Nguyen, Jan G. Bazan, and Andrzej Skowron. Semantic analytics of PubMed content. In Andreas Holzinger and Klaus-Martin Simonic, editors, Information Quality in e-Health - 7th Conference of the Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society, USAB 2011, Graz, Austria, November 25-26, 2011. Proceedings, volume 7058 of Lecture Notes in Computer Science, pages 63–74. Springer, 2011. ´ ¸ zak and Łukasz Sosnowski. Sql-based compound object comparators: A case study of images [23] Dominik Sle stored in ice. In Tai-Hoon Kim, Haeng-Kon Kim, Muhammad Khurram Khan, Kiumi Akingbehin, Wai-Chi ´ ¸ zak, editors, Advances in Software Engineering - International Conference, FGITFang, and Dominik Sle ASEA 2010. Proceedings, volume 117 of Communications in Computer and Information Science, pages 303–316. Springer, 2010.
364
M. Szczuka et al. / Using domain knowledge in initial stages of KDD
[24] Łukasz Sosnowski. Identification of compound objects using comparators. In A. My´sli´nski, editor, Information technology theory and application, volume 2, pages 114–135. IBS PAN, 2012. In Polish. ´ ¸ zak. Comparators for compound object identification. In Sergei O. [25] Łukasz Sosnowski and Dominik Sle Kuznetsov, Dominik Slezak, Daryl H. Hepting, and Boris Mirkin, editors, RSFDGrC 2011. Proceedings, volume 6743 of Lecture Notes in Computer Science, pages 342–349. Springer, 2011. ´ ¸ zak. RDBMS framework for contour identication. In Marcin Szczuka, [26] Łukasz Sosnowski and Dominik Sle Ludwik Czaja, Andrzej Skowron, and Magdalena Kacprzak, editors, CS&P 2011, pages 487–498, Pułtusk, Poland, 2011. Białystok University of Technology. Electronic edition. [27] Yingxu Wang, Witold Kinsner, James A. Anderson, Du Zhang, Yiyu Yao, Phillip C.-Y. Sheu, Jeffrey J. P. Tsai, Witold Pedrycz, Jean-Claude Latombe, Lotfi A. Zadeh, Dilip Patel, and Christine W. Chan. A doctrine of Cognitive Informatics (CI). Fundamenta Informaticae, 90(3):203–228, 2009.