A Pattern Based Data Mining Approach - Springer Link

A Pattern Based Data Mining Approach Boris Delibaši´c1 , Kathrin Kirchner2 and Johannes Ruhland2 1

2

University of Belgrade, Faculty of Organizational Sciences, Center for Business Decision-Making, 11000 Belgrade, Serbia [email protected] Friedrich-Schiller-University, Department of Business Information Systems, 07737 Jena, Germany {k.kirchner, j.ruhland}@wiwi.uni-jena.de

Abstract. Most data mining systems follow a data flow and toolbox paradigm. While this modular approach delivers ultimate flexibility, it gives the user almost no guidance on the issue of choosing an efficient combination of algorithms in the current problem context. In the field of Software Engineering the Pattern Based development process has empirically proven its high potential. Patterns provide a broad and generic framework for the solution process in its entirety and are based on equally broad characteristics of the problem. Details of the individual steps are filled in at later stages. Basic research on pattern based thinking has provided us with a list of generally applicable and proven patterns. User interaction in a pattern based approach to data mining will be divided into two steps: (1) choosing a pattern from a generic list based an a handful of characteristics of the problem and later (2) filling in data mining algorithms for the subtasks.

1 Current situation in data mining The current situation in the data mining area is characterized by a plethora of algorithms and variants. The well known WEKA collection (Witten and Frank (2005))implements approx. 100 different algorithms. However, there is little guidance in selecting and using the appropriate algorithm for the problem at hand as each algorithm may also have its very specific strengths and weaknesses. As Figure 1 shows for large German companies, the most signifact problems in data mining are application issues and the management of the process as a whole and not the lack of algorithms (Hippner, Merzenich and Stolz (2002)). Standardizing the process as proposed by Fayyad et.al (1996) and later refined into the CRISPDM model (Chapman et.al. (2000)) has resulted in a well established phase model with preprocessing, mining and postprocessing steps, but has failed to give hints for chosing a proper sequence of processing tools or avoidance of pitfalls. Design has always elements of integrated and modular solutions. Integrated solutions provide us with simplicity, but the lack of adaptability. Modular solutions give us the ability to have greater influence on our solution, but ask for more knowledge

328

Boris Delibaši´c, Kathrin Kirchner and Johannes Ruhland

Fig. 1. Proposals for improvement of current data mining projects (46 questionnaires, average scores, 1 = no improvement, 5 = highly improvable)

and human attendance. In reality all solutions are between full modularity and full integrality (Eckert and Clarkson (2005)). We believe that for solving problems in the data mining area, it is more appropriate to use a modular solution, than an integrated one. Patterns are meant to be experience packages that give a broad outline on how to solve specific aspects of complex problems. Complete solutions are built through chaining and nesting of patterns. Thus they go beyond the pure structuring goal. They have proven their potential in diverse fields of science.

2 Introduction to patterns Patterns are already very popular in software design as the well known GOF-patterns for Object Oriented Design exemplify. (Gamma et.al. (1995)). Patterns we envisage are, however, applicable to a much wider context. With the development of pattern theories in various areas (architecture, IS, tele-communications, organization) it seems that also the problems of adaptability and maintenance of DM algorithms can be solved using patterns. The protagonist of the pattern movement, Cristopher Alexander defines a pattern as a three-part rule that expresses the relation between a certain context, a problem and a solution. It is at the same time a thing that happens in the world (empirics) and a rule that tells us, how to create that thing (process rule) and when to create it (context specificity). It is at the same time a process, a description of a thing that is alive, and a process that generates that thing (Alexander (1979)). Alexander’s work was concentrated in identifying patterns in architecture covering a broad range from urban planning to details of interior design. The patterns are shells, which allows various realizations, all of which will solve the problem.

A Pattern Based Data Mining Approach

329

Fig. 2. Small public square pattern

We shall illustrate the essence and power of C. Alexander style patterns by two examples. On Figure 2 a pattern named Small public squares is presented. Such squares enable people in large cities to gather, communicate and develop a community feeling. The core of the pattern is to make such squares not too large, lest they will be deserted and look strange to people. Another example is shown on Figure 3. The pattern Entrance transition advocates and enables a smooth transition between the outdoor and indoor space in a house. People do not like instant transition. It makes them feel uncomfortable, and the house ugly.

Fig. 3. Entrance transition pattern

Alexander (2002b) says: 1. Patterns contain life. 2. Patterns support each other: the life and existence of one pattern influences the life and existence of another pattern. 3. Patterns are built of patterns, this way their composition can be explained. 4. The whole (the space in which patterns are implemented to) gets its life depending on the density and intensity of the patterns inside the whole.

330


We want to provide the user with the abilty to make data mining (DM) solutions by nesting and pipelining of patterns. In that way, the user will concentrate on the problems he wants to solve through the deployment of some key patterns. He may then nest patterns deep enough to get the job done at the data processing level. Current DM algorithms and DM process paradigms don’t provide users with such an ability, as they are typically based on the data flow diagrams approach principle. A standard problem solution in the SPSS Clementine system is shown on Figure 4; it is a documentation of a chosen solution rather than a solution guide.

Fig. 4. Data flow principle in SPSS Clementine

3 Some data mining patterns We have already developed some archetypical DM patterns. For their formal representation the J.O. Coplien Pattern Formalization Form has been used (Coplien and Zhao (2005), Coplien (1996), p 8). This form consists of the following elements: Context, Problem, Forces, Solutions and Resulting context. A pattern is applicable within a Context (description of the world) and creates a Resulting Context, as the application of the Pattern will change the state space. Problem describes what produces the uncomfortable feeling in a certain situation. Forces are keys for pattern understanding. Each force will yield a quality critereon for any solution, and as forces can be (and generally are) conflicting, the relative importance of forces will drive a good solution into certain areas of the solution


331

space, hence their name. In many contexts, for instance, the relative importance of the conflicting forces of economic, time and quality considerations will render a particular solution a good or a bad compromise. When a problem, forces as problem descriptors, are well understood, then a solution is most often easily evaluated. Understanding of a problem is crucial for finding a solution. Patterns are functions that transform problems in a certain context into solutions. Patterns are familiar and popular concepts, because they systematize repeatedly occuring solutions in nature. The solution, the pattern itself, resolves forces in a problem and provides a good solution. On the other hand, a pattern is always a compromise, it is not easy to recognize. Because it is a compromise it resolves some forces, but may add to the context space new ones. A pattern is best recognized through solving and generalizing real problems. The quality and applicability of patterns may change over time as new forces gain relevance or new solutions become available. The process of recognizing and deploying patterns is continuous. For example, house building changed very much when concrete was invented. 3.1 The Condense Pattern: a popular DM pattern The pattern is shown in the Coplien form is 1. Context: The collection of data is completed. 2. Problem: Data matrix is too large for efficient handling. 3. selected Forces: Efficiency of DM algorithms depends upon the number of cases and variables. Irrelevant cases and variables will hamper learning capabilities of DM algorithm. Leaving out a case or a variable may lead to errors and delete special, but important cases. 4. Solution: Condense the data matrix 5. Resulting context: manageable data matrix with some information loss The Condense pattern is a typical preprocessing pattern that has found diverse applications, for example on variables (by – for example – calculating a score, choosing a representative variable or through clustering of variables), on cases (through sampling, clustering with subsequent use of centers only, etc) or in transformation of continuous variables (e.g. through equal width binning, equal frequency binning). 3.2 The Divide et Impera Pattern A second pattern which is widely used in data mining is Divide et Impera. It can also be described in the Coplien pattern form: 1. Context: A data mining problem is too large/complicated to be solved in one step. 2. Problem: Structuring of the task

332


3. Forces: It is not possible to subdivide the problem, there are many strongly interrelated facets influencing the problem. The sheer combination of subproblem solutions may be grossly suboptimal. Subproblems may have very different relevance for the global problem. Complexity of a generated subproblem may be grossly out of proportion to its relevance. Solution: Divide the problem into subproblems that are more easily solved (and quite often structurally similar to the original one) and build the solution to the complete problem as a combination. 4. Resulting context: a set of smaller problems, more palatable to solution - It is possible that the problem structure is bad or the effort has not been reduced in sum. - The effort has not been reduced in sum. The Divide et Impera pattern can be used for problem structuring where the problem is too complex to solve it in one step. It is found as a typical meta heuristic in many algorithms such as decision trees or divisive clustering. Other application areas, which also vouch for its broad applicability, are segmented marketing (if an across-the-board marketing strategy is not feasible, try to form homogeneous segments and cater to their needs), or the division of labor within divisional organizations. 3.3 More patterns in data mining We have already identified a lot of other patterns in the field of data mining. Some of them are: • • • • •

Combine voting(with boosting, bagging, stacking, etc. as corresponding algorithms), Training / Retraining (supervised mining, etc.), Solution analysis, Categorization and Normalization

This list is in no way closed. Every area of human interest has its characteristic patterns. However, there is not an infinite number of patterns, but always a limited one. Collecting them and making them available for users gives the users the possibility to model the DM process, but also to understand the DM process through patterns.

4 Summary and outlook Pattern based data-mining offers some attractive features 1. The algorithm creators and the algortihm users have different interests and different need. These sides often don’t understand each others needs and, quite often, do not need to know about specific details relevant to the other side. A pattern is something that is understandable to all people.


333

2. Science converges. Concepts in one area of science is applicable in another area. Patterns support these processes. This potential is comparable to the promises of Systems Theory. 3. Decision for a specific algorithm can be postponed to later stages. A solution path as a whole will be sketched through patterns and algorithms need only be filled in immediately prior to processing. Using differnet algorithms in places will not invalidate the solution path, creating “late binding” at the algorithm level. Current Data Mining applications occasionally provide the user with first traces of pattern based DM. Figure 5 shows the example of Bagging of Classifiers within the TANAGRA project and its graphical user interface (Rakotomalala (2004)). Bagging cannot be described with a pure data flow paradigm, rather a nesting of a classifier pattern within the bagging pattern is needed. This nested structure will then be pipelined with pre- and postprocessing patterns.

Fig. 5. Screenshot of Tanagra Software

Further steps in our project are to • • •

collect a list of patterns which are useful in the whole knowledge discovery process and data mining (list will be open-ended). integrate these patterns into data mining software to help design ad-hoc algorithms, choose an existing one or have guidance in the data mining process. develop a software prototype with our pattern and make experiments with users: how it works and what are the benefits.

334


References ALEXANDER, C. (1979): The Timeless Way of Building, Oxford University Press. ALEXANDER, C. (2002a): The Nature of Order Book 1: The Phenomenon of Life, The Center for Environmental Structure, Berkeley, California. ALEXANDER, C. (2002b): The Nature of Order Book 2: The Process of Creating Life, The Center for Environmental Structure, Berkeley, California. CHAPMAN, P., CLINTON, J., KERBER, R., KHABAZA, T., REINARTZ, T., SHEARER, C. and WIRTH, R. (2000): CRISP-DM 1.0. Step-by-step data mining guide, www.crispdm.org. COPLIEN, J.O.(1996): Software Patterns, SIGS Books & Multimedia. COPLIEN, J.O. and ZHAO, L. (2005): Toward a General Formal Foundation of Design Symmetry and Broken Symmetry, Brussels: VUB Press. ECKERT, C. and CLARKSON, J. (2005): Design Process Improvement: a review of current practice, Springer Verlag London. FAYYAD, U.M., PIATETSKY-SHAPIRO, G. and UTHURUSAMY, R. (Ed.) (1996): Advances in Knowledge Discovery and Data Mining, MIT Press. GAMMA, E., HELM, R., JOHNSON, R. and VLISSIDES, J. (1995): Design Patterns. Elements of Reusable Object-Oriented Software, Addison-Wesley. HIPPNER, H., MERZENICH, M. and STOLZ, C. (2002): Data Mining: Einsatzpotentiale und Anwendungspraxis in deutschen Unternehmen, In: WILDE, K.D.: Data Mining Studie, absatzwirtschaft. RAKOTOMALALA, R. (2004): Tanagra – A free data mining software for research and education, www.eric.univ-lyon2.fr/∼rico/tanagra/. WITTEN, I.H. and FRANK, E. (2005): Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann, San Francisco.