A System Architecture for Database Mining Applications - CiteSeerX

0 downloads 0 Views 89KB Size Report
for Database Mining Applications. Vijay V. Raghavan ..... [1] Agrawal R., Ghosh S., Imielinski T., Iyer B., and Swami B. An Interval Classifier for. Database Mining ...
A System Architecture for Database Mining Applications 1

Vijay V. Raghavan

Hayri Sever

University of Southwestern Louisiana1 Lafayette, LA 70504, USA

1

Jitender S. Deogun2 University of Nebraska2 Lincoln, NE, 68588, USA

Abstract The problem of enhancing a database management system (DBMS) to support mining applications is twofold. First DBMSs of today have limited functionality for supporting mining applications. Second scaling traditional knowledge discovery techniques for large data sets is not straightforward. We propose a mining kernel that could be incorporated into future DBMSs. The mining kernel provides a common knowledge base encapsulated by a complete set of knowledge management operators such that interactive modules for database mining applications can be built on top of it. The novelty of our approach is to study how the concept-based retrieval, relevance feedback, and information dissemination techniques used in intelligent information retrieval systems relate to each other and to apply the result of this study to the database mining problem.

1 Introduction Following rapid advances in computer technology, it has been feasible to extend database systems into new application areas. One such application that is likely to get considerable attention in the near future is database mining [1,2,3]. The goal of database mining is to discover valuable information from a very large database that is hidden in the data and a user is unaware of this before discovery. The kinds of database mining queries that we are specifically interested in are classification, hypothesis testing, and association. We believe that knowledge discovery is one of the areas of interest shared by both database and information retrieval research communities. One of the current research issues in Intelligent Information Retrieval Systems (IIRSs) is to equip IIRS with following capabilities [4,5,6,7]: i. Retrieval based on the presence or absence of descriptors or keywords in a desired combination, ii. Retrieval based on learning the characterization of relevant documents through user feedback, iii. Categorization of text based on preselected concepts or topical areas, iv. Automatic extraction of concepts or facts from text, and v. Concept-based retrieval of text.

Although, at present time, practically no text retrieval aims to provide all of these functions, there is a great deal of commonality among the functional components of information retrieval, concept-based retrieval and fact extraction systems. For example, the relevance feedback techniques needed in retrieval systems to characterize a user's concept of relevance can also provide a mechanism for deriving characterization of the class of documents each element of which contains a fact of interest. Similarly, the process of concept structuring based on relationship among concepts is important for both concept-based retrieval as well as fact extraction. Our research is directed towards the idea of organizing accumulated knowledge for use by future queries, which will substantially reduce the time needed for searching existing knowledge and in generating new knowledge. Our approach is to design and implement a kernel consisting of a set of constructs encapsulated by primitive operations to create, maintain and manipulate the knowledge. These mining operations developed will be orthogonal to each other and complete enough to allow a user to build interfaces on top of the kernel. This design choice is motivated by the fact that a concept based system is an area of investigation in its own right; that is, it is not affected by particular data analysis methods if a common knowledge base is used. This is in contrast to EXIS [8] which makes the assumption that knowledge discovery is based on some specific methods. Our approach is similar to INLEN system [9], which clearly differentiates Knowledge Management Operators (KMOs) from Knowledge Generation Operators (KGOs); however, it organizes the KMOs around KGOs. This paper is organized as follows. In Section 2, we describe the types of mining queries in which we are interested. The notion of a concept, and concept hierarchy as persistent knowledge are discussed in Section 3. In Section 4, a brief outline of our extended DBMS architecture for database mining applications is developed. Concluding remarks are provided in Section 5.

2 The Database Mining Queries Let A be the set of attributes for that relation. In database mining, we use rules to specify how the value of an attribute of interest or the class label of a tuple is determined by the values of other attributes. Rules are also used to specify functional dependencies among attributes. A rule may be associated with a confidence factor. Each attribute a ∈A in the antecedent part of a rule is called a condition attribute. There exists only one attribute at the consequent part of that rule, and it is called a decision attribute. If a decision attribute is an element of A then it is said to be persistent; otherwise it is nonpersistent. We are interested in three kinds of mining queries as shown below. i. Association: Given a value of a persistent decision attribute and a set of condition attributes, generate a rule that can be used to determine how values of certain condition attributes are associated with the given value of the decision attribute. For that kind of query, a user may also omit to give the set of condition attributes if he or she wishes to do so.

ii. Hypothesis testing: Given a decision rule and its confidence factor, test if it is validated by the tuples in the population. The confidence factor is assumed to constitute a lower boundary for the validation process of a hypothesis. iii. Classification: Upon the specification of positive and negative samples with respect to a nonpersistent decision attribute (i.e., a classification label), determine the potential condition attributes and/or generate a decision rule. A rule can be labeled as a dynamic or a static rule depending on the nature of the established relationship. For example, if we state that “a desktop is a personal computer,” then it is a static rule and thus is not affected by the change of database population. However, if we have a rule such as: “if a customer buys a personal computer, then it is a desktop with 0.6 certainty,” then this rule depends on the database population at a given time. Hence, the DBMS should offer some facilities (e.g., daemon processes for if-condition-then-fire triggers) to support dynamic rules.

3 The Representation and Organization of Concepts We have used the language of predicate logic to represent concepts and a reader not familiar with basic notations of this mathematical discipline may refer to [10]. Let W be a vocabulary. For each symbol t ∈W, we assume there is an integer δ(t ) called the degree of t. For t a function symbol, δ(t ) ≥ 0, while for t a relation symbol, δ(t ) > 0. In W, the symbols r, v, and g (with or without subscript) are reserved for relations, constants, and functions, respectively. We call a relation symbol whose degree is 1 an attribute symbol. We reserve the subscripted symbols c, p and d in W for attributes. Every atomic W-formula is supposed to stand for a relation, and we denote it by a capital case of corresponding relation symbol. The projection of a relation R onto the components 1,2,.., k is denoted by π1,2,...,k(R), where k ≤ δ( r ) . Views are crucial in the provision of logical data independence and also represent a form of data security. Hence, ideally, a mining component integrated with a DBMS must facilitate the specification of a user context that is consistent with the user’s view. In this paper, we consider, however, a global context; that is, a W-sentence in the language is interpreted using entire database population.

3.1 Notions of a Concept A basic concept corresponds to either a subset of attribute values or a subset of tuples. We obtain a derived concept by applying association, classification, or generalization to the existing concept(s). For example, in Table 1, “corel_draw” is a basic concept obtained by grouping all versions of Corel Draw Package in the domain of the attribute PNAME. Similarly “graphics” is a derived concept whose domain is the set of print_shop, harvard_graphic, and corel_draw.

software | hardware → product operating_system | user_environment | application → software accounting | spread_sheet | education | database | graphics → application accessories | peripherals | personal_computer → hardware PTYPE(window) → user_environment PNAME[window/3.1 | window/3.0] → PTYPE(window) PTYPE[harvard_graphic | print_shop | corel_draw] → graphics PNAME[corel_draw/3.0 | corel_draw/3.1] → PTYPE(corel_draw) PTYPE[ms_dos_system | macintosh] → personal_computer PNAME[286PC | 386PC | 486PC] → ms_dos_system Table 1. A partial concept hierarchy from the perspective of generalization of the computer products First, we specify the association query for a relation symbol r ∈W, in predicate logic by the formula as shown below: ∀(x1,x2,..,xk-1) ∃(xk,certainty_factor)[ r 1,k-1(x1,x2,...,xk-1) ⇒ d(xk) & certainty_factor = g(x1,x2,..,xk-1, xk)], where k ≤ δ( r ) , D = πk(R), and R1,k-1 = π1,2,...,k-1(R). We are not usually interested in all possible interpretations of the formula given above. Hence, we define the association query by using following notation: r c1(v1) & c2(v2) & ... & ck-1(vk-1)  → d(vconcept), certainty_factor, It means that the values of condition attributes of the relation R functionally determine the value of the decision attribute of the relation R with some certainty factor. Using rough set theory, we may assign a value to the certainty factor or eliminate superfluous condition attributes. The second way to derive a concept is a classification method. We specify the classification query in predicate logic by the formula as shown below: ∀(x1,x2,..,xk) [(p1(x1) & p2(x2) & ... & pk(xk)) ⇒ dummy(vconcept, g(x1,x2,..,xk))]. We say that the values of patterns are related to vconcept with some value of confidence factor returned by the classification function g. The concept learned by this method is indeed dynamic. From time to time, a decision may be made to incorporate a dynamic concept into the concept hierarchy. In that case we drop the tag of “dummy” from the concept and interpret it as a relation symbol. A generalization, the last method to derive a concept, groups some sub-concepts into a more abstract concept. and its notation is defined as “vsub-concept → vconcept.” If either sub-concept or concept is drawn from the domain of an attribute, then it is surrounded by brackets following the attribute name. For the sake of simplicity, we

combine sub-concepts by disjunction if they generalize to the same concept. Table 1 contains a partial list of generalization rules drawn from the domain of PNAME, product names, PTYPE, product types, and some other derived concepts. These rules are constructed for a hypothetical warehouse that sells computer hardware and software products by mail order. We explained what a concept is and how to derive a new concept. As a final note, the type of a concept can be either basic or derived. The status of a concept is persistent if it is associated with an attribute’s domain; nonpersistent otherwise. In the next subsection, we present two essential structures used for organizing the concepts. First one, a concept-set, contains the description of concepts, and second one, a concept hierarchy, holds the relationships between concepts.

3.2 Organization of the Concepts A concept-set contains the concepts and their description. A description gives various information depending on a concept’s type and its status. For example, If the concept is a basic concept, then the description keeps the set of corresponding values. For a derived concept, we keep the type of the rule and related information (e.g., access path to the classification function if the classification rule is used). Similarly, if a concept is persistent, then the name of the attribute associated with that concept is kept in the description. To access a concept in the concept-set, the name of the concept is used. A synonym list, which keeps a set of equivalent concepts, is used to provide alternative access paths to a concept. For example in Table 1, the concept “drawing” is a synonym for “graphics”. A concept hierarchy is a weighted AND/OR polytree (i.e., singly connected graph) and is used to define relationships between concepts. In a polytree, OR arcs are used to generalize a concept, and AND arcs are used to either classify or associate a concept to other concepts. The value of a weight between two nodes gives confidence factor for the corresponding relationship. Every leaf and intermediate node of the graph correspond to a basic and a derived concept, respectively. We introduce the notion of perspective that gives the type of a link established between concepts. The concept hierarchy can be viewed from three perspectives, which are association, classification, and generalization. If we had treated all concepts without any perspective, then the semantic power of our mining model would have been reduced to the power of syntactically driven models. For example suppose we sale → PTYPE(window),0.9” and have two derived concepts: “PTYPE(486PC)   “PTYPE[window] → user_environment.” Then, without having any knowledge of their perspective, there is nothing in the model to avoid from making the inference of “PTYPE(486PC) → user_environment, 0.9.” The reflection of a perspective on the concept hierarchy gives a tree whose root is possibly a dummy node. The rules or restrictions applicable to all members of a perspective (e.g., if the propagation of confidence factors is allowed) is kept in the description of the perspective. The perspective of generalization allows, however, us to detect either association rules or classification rules that are closely related.

4

The System Architecture

The idea of a concept is central to our system. On the basis of this idea, the notions of a concept-set and concept-set structure are defined. The various functional modules are then described by specifying what kinds of operations need to be performed with respect to concepts and concept-set structures. As it can be seen in Figure 1, the mining system consists of two major subsystems: mining kernel and interactive user interface. This two-level design paradigm provides a common concept-base structure and user interface to various mining methods and makes the mining system customizable. The user controls the concepts either directly through interactive user interface or implicitly through query language interface of the DBMS.

4.1 Mining Kernel This kernel system provides primitive operations related to the management and manipulation of concepts. As mentioned before a concept can be specified either by generalization or association, and more than one rule can be associated to a concept. The relationship between database functions and mining kernel depends on how they are coupled. We are currently investigating pros and cons of different coupling choices. 4.1.1 Concept Discovery Subsystem This subsystem provides operations to be used for discovering (or deriving) a rule with respect to a concept. The functionality of operations is listed below:

• • • • • • • • • •

Dropping concept(s) from the condition part of a concept’s rule, Joining concept(s) to the condition part of a concept’s rule, Adding a value to the basic concept, Finding a distance between two concepts, Generating the concept hierarchy for the set of concepts within a particular perspective, Constructing a rule for a concept, Proposing relevant domains (i.e., either attribute or relation names) for a concept, Deleting a rule of the concept, Accessing a particular rule of a concept, and Testing a hypothesis.

4.1.2 Concept-Set Structuring Subsystem This subsystem is responsible for the management of concepts. The functionality of the operations can be listed as • Realizing dictionary operations for a concept (and synonyms), • Making a non-persistent concept persistent, • Accessing description fields of a concept, • Moving forward/backward from a concept in the concept hierarchy for given perspective, • Creating or revoking a view.

4.2 Interactive User Interface Interactive user interface provides browsers and an interactive query language for mining applications in terms of primitive operations defined in the mining kernel. The concept hierarchy browser offers following features. With respect to a persistent concept, it retrieves the associated rule(s) and then navigate the concept hierarchy. With respect to a dynamic concept, one can select concept(s) in the concept hierarchy that are "close" and browse the condition attributes involved in the associated rules.

5 Conclusion In this paper, we develop a system architecture that emphasizes functions of a database system that are essential for data mining queries. Our work on user oriented clustering [11] and on organizing of concepts discovered through past interactions with users [12], in the context of text retrieval systems, provides us the motivation for our proposal. In particular, we provide subsystems to organize persistent concepts into a hierarchy and to enable users to perform a variety of operations on the concept-set structure. As a result, our approach can support discovery of knowledge through co-operation among a group of users. In addition, our approach offers opportunities for concept discovery phase to be made more efficient since the process can begin with the rule associated with a similar concept already in the hierarchy.

Our eventual goal is to integrate such functionality with currently available DBMS software. Although we have not specifically discussed how the proposed subsystems would impact design principles of the future DBMSs, we currently favor the choice of embedding them into a DBMS package because of the need to use trigger facilities and to access status information, such as authorization privileges, of the DBMS. This design choice can offer important benefits. For example, mining queries are strictly read-oriented long running transactions. Since we deal with classification of the data, in most cases, it is unnecessary and even harmful to exclude other transactions from write accesses to the data to which a mining query has acquired a read lock (or vice-versa). The tradeoff is here the performance of the system versus possible amount of classification error. If error tolerance, say ε, is known in advance, then we can use ε_serializibity protocol instead of the traditional one.

References [1] Agrawal R., Ghosh S., Imielinski T., Iyer B., and Swami B. An Interval Classifier for Database Mining Applications. In: Proc. of the 18th VLDB conf., Vancouver, British Columbia, Canada 1992, pp 560-573. [2] Krishnamurty R and Imielinski T. Research Directions in Knowledge Discovery. SIGMOD RECORD 1991; 20: 76-78. [3] Shapiro D. G. and Frawley W. J. (Editors). Knowledge Discovery in Databases. AAAI/MIT Press, Cambridge, 1991. [4] Belkin N. J. and Croft W. B. Information Filtering and Information Retrieval: Two Sides of the Same Coin. Comm. of ACM 1992; 35:29-39. [5] Deogun J. S. and Raghavan V. V. Description of the UNL/USL System used for MUC3. In: Proc. of DARPA's 3rd Message Understanding Conference (MUC-3). MorganKauffmann Pub., San Diego, 1991, pp. 234-242. [6] Hayes P. and Weinstein S. CONSTRUE/TIS: a system for content-based indexing of a database of news stories. Second Annual Conf. on Innovative Applications of Artificial Intelligence, 1990. [7] McCune B. P., Tong R. M., Dean J. S., and Shapiro D. G. RUBRIC: A system for rule-based information retrieval. IEEE Trans. Software Engineering 1985; SE-11:939945. [8] Yasdi R. Learning Classification Rules from Database in the Context of Knowledge Acquisition and Representation. IEEE Trans. Knowl. Data Eng. 1991; 3:293-306. [9] Kaufman K. A., Michalski R. S., and Kerschberg L. Mining for Knowledge in Databases: Goals and General Description of the INLEN System. Knowledge Discovery in Databases. AAA/MIT, Cambridge, MA, 1991. [10] Davis M. D. and Weyuker E. J. Computability, Complexity, and Languages. Academic Press, New York, 1983. [11] Bhatia S. K., Deogun J. S. and Raghavan V. V. Automatic Rule-Base Generation for User-oriented Information Retrieval. In: Proc. of the Fifth Int'l. Symposium on Methodologies for Intelligent Systems. Knoxville,1990, pp 118-125. [12] Zhang Y., Raghavan V. V. and Deogun J. S. An Object-Oriented Modeling of History of Optimal Retrievals. In: Proc. of the 14th Int'l. ACM-SIGIR Conf. on Research and Development in Information Retrieval. Chicago, Oct 1991, pp. 241-250.

Suggest Documents