Rule Generation Methods Based on Logic Synthesis - CiteSeerX

16 downloads 0 Views 45KB Size Report
(Muselli & Liberati, 2002), to better comprehend the peculiarities of the rule generation process. ... for a Boolean function f, starting from a portion of its truth table, they can be directly used to ... It proceeds by grouping together binary strings with the same output ... if x1 > 300 AND x3∈{1,2} then y = 1 (he/she buy the product).
Rule Generation Methods Based on Logic Synthesis

Marco Muselli Institute of Electronics, Computer and Telecommunication Engineering Italian National Research Council via De Marini, 6 16149 Genoa Italy voice: +39 010 6475213 fax: +39 010 6475200 email: [email protected]

Rule Generation Methods Based on Logic Synthesis

Marco Muselli, Italian National Research Council, Genoa, Italy

INTRODUCTION One of the most relevant problems in artificial intelligence is allowing a synthetic device to perform inductive reasoning, i.e. to infer a set of rules consistent with a collection of data pertaining to a given real world problem. A variety of approaches, arising in different research areas such as statistics, machine learning, neural networks, etc., have been proposed during the last 50 years to deal with the problem of realizing inductive reasoning. Most of the developed techniques build a black-box device, which has the aim of solving efficiently a specific problem generalizing the information contained in the sample of data at hand without caring about the intelligibility of the solution obtained. This is the case of connectionist models, where the internal parameters of a nonlinear device are adapted by an optimization algorithm to improve its consistency on available examples while increasing prediction accuracy on unknown data. The internal structure of the nonlinear device and the training method employed to optimize the parameters determine different classes of connectionist models: for instance, multilayer perceptron neural networks (Haykin, 1999) consider a combination of sigmoidal basis functions, whose parameters are adjusted by a local optimization algorithm, known as back-propagation. Another example of connectionist model is given by support vector machines (Vapnik, 1998), where replicas of the kernel of a reproducing

kernel Hilbert space are properly adapted and combined through a quadratic programming method to realize the desired nonlinear device. Although these models provide a satisfactory way of approaching a general class of problems, the behavior of synthetic devices realized cannot be directly understood, since they generally involve the application of nonlinear operators, whose meaning is not directly comprehensible. Discriminant analysis techniques as well as statistical nonparametric methods (Duda et al., 2001), like k-nearest-neighbor or projection pursuit, also belong to the class of black-box approaches, since the reasoning followed by probabilistic models to perform a prediction cannot generally be expressed in an intelligible form. However, in many real world applications the comprehension of this predicting task is crucial, since it provides a direct way to analyze the behavior of the artificial device outside the collection of data at our disposal. In these situations the adoption of black-box techniques is not acceptable and a more convenient approach is offered by rule generation methods (Duch et al., 2004), a particular class of machine learning techniques that are able to produce a set of intelligible rules, in the if-then form, underlying the real world problem at hand. Several different rule generation methods have been proposed in the literature: some of them reconstruct the collection of rules by analyzing a connectionist model trained with a specific optimization algorithm (Setiono, 2000; Setnes, 2000); others generate the desired set of rules directly from the given sample of data. This last approach is followed by algorithms that construct decision trees (Quinlan, 1993; Hastie et al., 2001) and by

techniques in the area of Inductive Logic Programming (Boytcheva, 2002; Quinlan & Cameron-Jones, 1995). A novel methodology, adopting proper algorithms for logic synthesis to generate the set of rules pertaining to a given collection of data (Hong, 1997; Boros et al., 1997; Boros et al., 2000; Sanchez et al., 2002; Muselli & Liberati, 2000), has been recently proposed and forms the subject of the present chapter. In particular, the general procedure followed by this class of methods will be outlined in the following sections, analyzing in detail the specific implementation followed by one of these techniques, Hamming Clustering (Muselli & Liberati, 2002), to better comprehend the peculiarities of the rule generation process.

BACKGROUND Any logical combination of simple conditions can always be written as a Disjunctive Normal Form (DNF) of binary variables, each of which takes into account the fulfillment of a particular condition. Thus, if the inductive reasoning to be performed amounts to making a binary decision, the optimal set of if-then rules can be associated with a Boolean function f that assigns the most probable output to every case. Since the goal of methods for logic synthesis is exactly the determination of the DNF for a Boolean function f, starting from a portion of its truth table, they can be directly used to generate a set of rules for any pattern recognition problem by examining a finite collection of examples, usually called training set. To allow the generalization of the information contained in the sample at hand, a proper logic synthesis technique, called Hamming Clustering (HC) (Muselli & Liberati, 2000; Muselli & Liberati, 2002), has

been developed. It proceeds by grouping together binary strings with the same output value, which are close among them according to the Hamming distance. Theoretical results (Muselli & Liberati, 2000) ensure that HC has a polynomial computational cost O(n2cs+nc2), where n is the number of input variables, s is the size of the given training set, and c is the total number of

AND

ports in the resulting digital

circuit. A similar, more computationally intensive, methodology has been proposed by Boros et al. (2000). Every method based on logic synthesis shows the two following advantages: •

It generates artificial devices that can be directly implemented on a physical support, since they are not affected by problems connected with the precision used when numbers are stored.



It determines automatically the significant inputs for the problem at hand (feature selection).

MAIN THRUST OF THE CHAPTER A typical situation, where inductive reasoning has to be performed, is given by pattern recognition problems. Here, vectors x ∈ℜn, called patterns, have to be assigned to one of two possible classes, associated with the values of a binary output y, coded by the integers 0 and 1. This assignment task must be consistent with a collection of m examples (xi,yi), i = 1, …, m, called training set, obtained by previous observations for the problem at hand. The target is to retrieve a proper binary function g(x) that provides the correct answer y = g(x) for most input patterns x.

Solving pattern recognition problems through logic synthesis Inductive reasoning occurs if the form of the target function g is directly understandable; a possible way of achieving this result is to write g as a collection of intelligible rules in the if-then form. The conditions included in the if part of each rule act on the input variables contained in the vector x; consequently, they have different form depending on the range of values assumed by the component xj of the vector x to which they refer. Three situations can be devised: 1. continuous variables: xj varies within an interval [a,b] of the real axis; no upper bound on the number of different values assumed by xj is given. 2. discrete (ordered) variables: xj can assume only the values contained in a finite set; typically, the first positive integers are considered with their natural ordering. 3. nominal variables: as for discrete variables xj can assume only the values contained in a finite set, but there is no ordering relationship among them; again, the first positive integers are usually employed for the values of xj. Binary variables are considered as a particular case of nominal variables, but the values 0 and 1 are used instead of 1 and 2. Henceforth, only threshold conditions of the kind xj < c, being c a real number, will be considered for inclusion in the if part of a rule, when xj is a continuous or a discrete variable. On the other hand, a nominal variable xj ∈ {1,2,…,k} will participate in g only through membership conditions, like xj ∈ {1,3,4}. Separate conditions are composed only by means of AND operations, whereas different rules are applied as if they were linked by an OR operator.

As an example, consider the problem of analyzing the characteristics of clients buying a certain product: the average weekly expense x1 is a continuous variable assuming values in the interval [0,10000], whereas the age of the client x2 is better described by a discrete variable in the range [0,120]. His/her activity x3 gives an example of nominal variable; suppose to consider only four categories: farmer, worker, employee, and manager, coded by integers 1, 2, 3, and 4, respectively. A final binary variable x4 is associated with the gender of the client (0 = male, 1 = female). With this notation, two rules for the problem at hand can assume the following form:

if x1 > 300 AND x3∈{1,2} then y = 1 (he/she buy the product) if x2 < 20 AND x4 = 0 then y = 0 (he does not buy the product)

Note that x3∈{1,2} refers to the possibility that the client is a farmer or a worker, whereas x4 = 0 (equivalent to x4∈{0}) means that the patient must be a male to verify the conditions of the second rule. The general approach followed by techniques relying on logic synthesis is sketched in Fig. 1.

Coding the training set in binary format At the first step the entire pattern recognition problem has to be rewritten in terms of Boolean variables; since the output is already binary, we have to translate only the input vector x in the desired form. To this aim, we consider for every component xj a proper binary coding that preserves the basic properties of ordering and distance. However, as

1.

The input vector x is mapped into a binary string z by using a proper coding ß(x) that preserves the basic properties (ordering and distance) of every component xj.

2.

The AND-OR expression of a Boolean function f(z) is retrieved starting from the available examples (xi,yi) (coded in binary form as (zi,yi), being zi = ß(xi)).

3.

Each logical product in the AND-OR expression of f(z) is directly translated into an intelligible rule underlying the problem at hand. This amounts to write the target function g(x) as f(ß(x)).

Figure 1: General procedure followed by logic synthesis techniques to perform inductive reasoning in a pattern recognition problem. the number of bits employed in the coding depends on the range of values assumed by the component xj, it is often necessary to perform a preliminary discretization step to reduce the length of the resulting binary string. Given a specific training set, several techniques (Dougherty et al., 1995; Liu & Setiono, 1997; Boros et al., 2000) are available in the literature to perform the discretization step while minimizing the loss of information involved in this task. Suppose that in our example, concerning a marketing problem, the range for input x1 has been split into five intervals [0,100], (100,300], (300,500], (500,1000], and (1000,10000]. We can think that the component x1 has now become a discrete variable assuming integer values in the range [1,5]: value 1 means that the average weekly expense lies within [0,100], value 2 is associated with the interval (100,300] and so on. Discretization can also be used for the discrete input x2 to reduce the number of values it may assume. For example, the four intervals [0,12], (12,20], (20,60], and (60,120] could have been determined as an optimal subdivision, thus resulting in a new discrete component x2 assuming integer values in [1,4]. Note that after discretization any input variable can be either discrete (ordered) or nominal; continuous variables no more occurs. Thus, the mapping required at Step 1 of

Fig. 1 can be realized by considering a component at a time and by employing the following codings: 1. thermometer code (for discrete variables): it adopts a number of bits equal to the number of values assumed by the variable minus one and set to 1 the leftmost k–1 bits to code the value k. For example, the component x1, which can assume five different values, will be mapped into a string of four bits; in particular, the value x1 = 3 is associated with the binary string 1100. 2. only-one code (for nominal variables): it adopts a number of bits equal to the number of values assumed by the variable and set to 1 only the kth bit to code the value k. For example, the component x3, which can assume four different values, will be mapped into a string of four bits; in particular the value x3 = 3 is associated with the binary string 0010. Binary variables do not need any code, but are left unchanged by the mapping process. It can be shown that these codings maintain the properties of ordering and distance, if the Hamming distance (given by the number of different bits) is employed in the set of binary strings. Then, given any input vector x, the binary string z = ß(x), required at Step 1 of Fig. 1, can be obtained by applying the proper coding to each of its components and by taking the concatenation of the binary strings obtained. As an example, a 28 years old female employee with an average weekly expense of 350 dollars is described (after discretization) by a vector x = (3,2,3,1) and coded by the binary string z = 1100|100|0010|1 (the symbol ‘|’ has only the aim of subdividing the contribution of

different components). In fact, x1 = 3 gives 1100, x2 = 2 yields 100, x3 = 3 maps into 0010 and, finally, x4 = 1 is left unchanged.

Hamming Clustering Through the adoption of the above mapping, the m examples (xi,yi) of the training set are transformed into m pairs (zi,yi) = (ß(xi),yi), which can be considered as a portion of the truth table of a Boolean function to be reconstructed. Here, the procedure for rule generation in Fig. 1 continues at Step 2, where a suitable method for logic synthesis, like HC, has to be employed to retrieve a Boolean function f(z) that generalizes the information contained in the training set. A basic concept in the procedure followed by HC is the notion of cluster, which is the collection of all the binary strings having the same values in a fixed subset of components; for instance, the four binary strings 01001, 01101, 11001, 11101 form a cluster since all of them only have the values 1, 0, and 1 in the second, the fourth and the fifth component, respectively. This cluster is usually written as *1*01, by placing a don’t care symbol ‘*’ in the positions that are not fixed, and it is said that the cluster *1*01 covers the four binary strings above. Every cluster can be associated with a logical product among the bits of z, which gives output 1 for all and only the binary strings covered by that cluster. For example, the cluster *1*01 corresponds to the logical product z 2 z 4 z 5 , being z4 the complement of the fourth bit z4. The desired Boolean function f(z) can then be constructed by generating a valid collection of clusters for the binary strings belonging to a selected class. The procedure employed by HC consists of the four steps shown in Fig. 2.

1. Choose at random an example (zi,yi) in the training set. 2. Build a cluster of points including zi and associate that cluster with the class yi. 3. Remove the example (zi,yi) from the training set. If the construction is not complete, go to Step 1. 4. Simplify the set of clusters generated and build the AND-OR expression of the corresponding Boolean function f(z). Figure 2: Procedure employed by Hamming Clustering to reconstruct a Boolean function from examples. Once the example (zi,yi) in the training set has been randomly chosen at Step 1, a cluster of points including zi is to be generated and associated with the class yi. Since each cluster is uniquely associated with an AND operation among the bits of the input string z, it is straightforward to build at Step 4 the

AND-OR

expression for the reconstructed

Boolean function f(z). However, every cluster can also be directly translated into an intelligible rule having in its if part conditions on the components of the original input vector x. To this aim, it is sufficient to analyze the patterns covered by that cluster to produce proper consistent threshold or membership conditions; this is the usual way to perform Step 3 of Fig. 1. An example may help understanding this passage: suppose the application of a logic synthesis method, like HC, for the marketing problem has produced the cluster 11**|***|**00|* for the output y = 1. Then, the associate rule can be generated by examining the parts of the cluster that do not contain only don’t care symbols. These parts allow to obtain as many conditions on the corresponding components of vector x. In the case above the first four positions of the cluster contain the sequence 11**, which covers the admissible binary strings 1100, 1110, 1111 (according to the thermometer code), associated with the intervals (300,500], (500,1000], and

(1000,10000] for the first input x1. Thus, the sequence 11** can be translated into the threshold condition x1 > 300. Similarly, the sequence **00 covers the admissible binary strings 1000 and 0100 (according to the only-one code) and corresponds therefore to the membership condition x3∈{1,2}. Hence, the resulting rule is

if x1 > 300 AND x3∈{1,2} then y = 1

FUTURE TRENDS Note that in the approach followed by HC several clusters can lead to the same rule; for instance, both the implicants 11**|***|**00|* and *1**|***|**00|* yield the condition x1 > 300

AND

x3∈{1,2}. On the other side, there are clusters that do not

correspond to any rule, such as 01**|***|11**|*. Even if these last implicants were not generated by the logic synthesis technique, they would increase the complexity of reconstructing a Boolean function that generalizes well. To overcome this drawback, a new approach is currently under examination: it considers the possibility of removing the NOT operator from the resulting digital circuits, which is equivalent to employing the class of monotone Boolean functions for the construction of the desired set of rules. In fact, it can be shown that such an approach leads to a unique bi-directional correspondence between clusters and rules, thus reducing the computational cost needed to perform inductive reasoning, while maintaining a good generalization ability.

CONCLUSION Inductive reasoning is a crucial task when exploring a collection of data to retrieve some kind of intelligible information. The realization of an artificial device or of an automatic procedure performing inductive reasoning is a basic challenge that involves researchers working in different scientific areas, such as statistics, machine learning, data mining, fuzzy logic, etc. A possible way of attaining this target is offered by the employment of a particular kind of techniques for logic synthesis. They are able to generate a set of understandable rules underlying a real-world problem starting from a finite collection of examples. In most cases the accuracy achieved is comparable or superior to that of best machine learning methods, which are however unable to produce intelligible information.

REFERENCES Boros, E., Hammer, P. L., Ibaraki, T. and Kogan, A. (1997). Logical analysis of numerical data. Mathematical programming, 79, 163-190. Boros, E., Hammer, P. L., Ibaraki, T., Kogan, A., Mayoraz, E. and Muchnik, I. (2000). An Implementation of Logical Analysis of Data. IEEE Transactions on Knowledge and Data Engineering, 12, 292-306. Boytcheva, S. (2002). Overview of inductive logic programming (ILP) systems. Cybernetics and Information Technologies, 1, 27-36. Dougherty, J., Kohavi, R. and Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In ML-95: Proceedings of the Twelfth

International Conference on Machine Learning, pp. 194-202, San Francisco, CA: Morgan Kaufmann. Duch, W., Setiono, R. and Zurada, J. M. (2004). Computational intelligence methods for rule-based data understanding. Proceedings of the IEEE, 92, 771-805. Duda, R. O., Hart, P. E. and Stork, D. G. (2001). Pattern Classification. New York: John Wiley. Haykin, S. (1999). Neural Network: A Comprehensive Foundation. London: Prentice Hall. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. New York: Springer. Hong, S. J. (1997). R-MINI: An Iterative Approach for Generating Minimal Rules from Examples. IEEE Transactions on Knowledge and Data Engineering, 9, 709-717. Liu, H. and Setiono, R. (1997). Feature Selection via Discretization. IEEE Transactions on Knowledge and Data Engineering, 9, 642-645. Muselli, M. and Liberati, D. (2000). Training Digital Circuits with Hamming Clustering. IEEE Transactions on Circuits and Systems—I: Fundamental Theory and Applications, 47, 513-527. Muselli, M. and Liberati, D. (2002). Binary Rule Generation via Hamming Clustering. IEEE Transactions on Knowledge and Data Engineering, 14, 1258-1268. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann. Quinlan, J. R. and Cameron-Jones, R. M. (1995). Induction of Logic Programs: Foil and Related Systems. New Generation Computing, 13, 287-312.

Sanchez, S. N., Triantaphyllou, E., Chen, J. and Liao, T. W. (2002). An incremental learning algorithm for constructing Boolean functions from positive and negative examples. Computers & Operations Research, 29, 1677-1700. Setiono, R. (2000). Extracting m-of-n Rules from Trained Neural Networks. IEEE Transactions on Neural Networks, 11, 512-519. Setnes, M. (2000). Supervised Fuzzy Clustering for Rule Extraction. IEEE Transactions on Fuzzy Systems, 8, 416-424. Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley.

TERMS AND THEIR DEFINITION Inductive Reasoning: The task of extracting intelligible information from a collection of examples pertaining a physical system. Rule Generation: An automatic way of performing inductive reasoning through the generation of understandable rules underlying the physical system at hand. Pattern Recognition Problem: A decision problem where the state of a system (described by a vector of inputs) has to be assigned to one of two possible output classes, generalizing the information contained in a set of examples. The same term is also used to denote classification problems, where the number of output classes is greater than two. Boolean Function: A binary function that maps binary strings (with fixed length) into a binary value. Every Boolean function can be written as an expression containing only AND, OR,

and NOT operations.

Truth Table: The collection of all the input-output pairs for a Boolean function.

Logic Synthesis: The process of reconstructing an unknown Boolean function from (a portion of) its truth table. Hamming Distance: The distance between two binary strings (with the same length) given by the number of different bits.

Suggest Documents