Discovering Knowledge by Fuzzy Predicates in Compensatory Fuzzy ...

1 downloads 518 Views 248KB Size Report
“Jose Antonio Echeverria” Higher Technical Institute, Havana, Cuba; 8. Department of Computer Science and Automation, University of Salamanca, Salamanca, ...
Discovering Knowledge by Fuzzy Predicates in Compensatory Fuzzy Logic Using Metaheuristic Algorithms Marlies Martínez Alonso, Rafael Alejandro Espín Andrade, Vivian López Batista, and Alejandro Rosete Suárez

Abstract. Compensatory Fuzzy Logic (CFL) is a logical system that enables an optimal way of modeling knowledge. Its axiomatic character enables the work of natural language translation of logic, so it is used in knowledge discovery and decision-making.Obtaining LDC predicates with high values of truth is a general and flexible approach that can be used to discover patterns and new knowledge from data. This work proposes a method for knowledge discovery from obtaining LDC predicates, to obtain different structures of knowledge using a metaheuristic approach. A series of experiments and results descriptions of certains advantages for representing several patterns and tendencies from data is used to prove the proposed method.

1

Introduction

Multivalent Logics generalize Bivalent Logic (BL) to a whole range of intermediate values between true (1) and false (0). Thus, it is possible to work with sets that do not have perfectly defined limits, where the transition between membership and non-membership of a variable in a set is gradual [3].

Marlies Martínez Alonso · Rafael A. Espín Andrade “Jose Antonio Echeverria” Higher Technical Institute, Cuba e-mail: [email protected], [email protected], [email protected] Vivian López Batista · Alejandro Rosete Suárez Department of Computer Science and Automation, University of Salamanca, Spain e-mail: [email protected] R.A. Espín Andrade et al. (eds.), Soft Computing for Business Intelligence, Studies in Computational Intelligence 537, DOI: 10.1007/978-3-642-53737-0_11, © Springer-Verlag Berlin Heidelberg 2014

161

162

M.M. Alonso et al.

Although the many advantages of multivalent systems to model ambiguous or vague knowledge have certain difficulties modeling knowledge in a natural way. For that reason is frequent in these applications the use of free operators and the extralogical resource called defuzzification [5]. The main difficulties of multivalent systems in the work of knowledge modeling are that not generalize completely all formulas of the Bivalent Logic. This difficulty sometimes causes bad behaviors for some interpretations of the logical variables. Another difficulty becomes evident when modeling decision-making problems, where not achieve the most appropriate behavior. In this case, the deficiency is given by the associative character of the operators of conjunction and disjunction, and the lack of sensibility to changes in the value of true of basic predicates when calculating the value of truth of compound predicates [7]. CFL is a multivalent system distinguished by its quality to generalize all formulas of BL. It has been demonstrated using Kleene´s axiomatic that the valid formulas of BL are exactly the formulas that have truth value greater than 0.5 in the context of the CFL. It has the characteristic of being a sensible system which ensures that any variation in the truth value of the compound predicates. CFL waiver of compliance with the associative classical properties of conjunction and disjunction to achieve a knowledge representation closer to human thinking. Its ability to formalize the reasoning makes possible its use in situations requiring multicriteria evaluations and ambiguous verbal descriptions of knowledge. Therefore, offers an opportunity to use language in the construction of semantic models that facilitate evaluation, decision making and knowledge discovery [7, 8]. The knowledge discovery by finding CFL predicates makes use of metaheuristics for exploring the space of possible predicates, identifying those that have high value of true. This is a way to generalize a part of the genetic programming dedicated to the collection of predicates that represent logical formulas; part that is intended mainly to circuit design [1, 6]. In this case, unlike genetic programming, the learning is not supervised; neither have we used the classical binary trees that have been used by genetic programming to prevent the deficiencies of the binary representation. This work begins by giving some basic notions of the CFL. Then is describes some of the basic notions of the proposed approach. Finally, show some experimental results which show certain advantages of the proposed approach as a tool for knowledge discovery.

2

Basic Notions of CFL

The main difficulties of multivalent logical approaches in the modeling of knowledge are: • The associative property of conjunction and disjunction operators used. • The lack of sensitivity to changes in the values of truth of the basic predicates when calculating the truth value of the compound predicate.

Discovering Knowledge by Fuzzy Predicates in Compensatory Fuzzy Logic

163

• The total absence of compensation of truth values of predicates when calculating the truth value of the compound predicates using operators. The association is a characteristic present in a great part of the operators used for aggregation. This characteristic is not good for data mining due to the existence of equality of hierarchies of objectives and preferences. Moreover, the sensitivity is the ability to react to the values of the predicates. This makes that different situations in the state of the systems generate different assessments in the veracity of knowledge and different models have different behaviors. Finally, the compensation is the capacities of the values of basic predicates are compensated to each other when calculating the truth value of the compound predicate. Classical approaches of Decision Theory include models such as additives, which accept compensation without limits, and the descriptive, which accept partially compensation, the latter being more akin to the reasoning of the actual agents [7]. In the CFL operation conjunction ( (and)) is given by the geometric mean of the truth value that takes the predicate of the analyzed variable. In equation 1 is represents the conjunction operator of the CFL. In this case c is the operator representing conjunction and is the true value of the variable n. ,

,

,…,

,……,

(1)

The disjunction ( (or)) is represented by the complement of the geometric mean of denials of the truth values of the variables. It is calculated according to is the truth value having the equation 2, where d is the disjunction operator and the variable n. ,

,…,

1

1

1

,.., 1

(2)

The negation (¬), as in the rest of the operators in the Fuzzy Logic, calculated using the complement of the value of the variable denied. In equation 2 shows how to calculate it. In this case, n represents the negation operator and represents the value of the variable i. 1

(3)

The implication ( ) can be defined in either of two ways shown in equations 4 and 5. In these equations, i represents the implication operator and x, y represent any two variables. ,

,

(4)

or ,

,

,

(5)

Where d is the disjunction operator, c is the conjunction operator and n is the negation operator mentioned above.

164

M.M. Alonso et al.

The equivalence or double implication ( ) remains as the conjunction of the implication and its reciprocal. For any two variables x and y can be defined equivalence as shown in equation 6. ,

,

,

,

(6)

Where c and i are the conjunction and implication operators presented above. The universal and existential quantifiers are calculated according to equations 7 and 8. For any fuzzy predicate p in the universe U, universal propositions and existential are respectively defined as: (7) (8)

3

Search Method of CFL Predicates

The method proposed in this paper is based on providing flexibility for different knowledge structures, using different search algorithms. To achieve this flexibility we use a declarative approach, which consists in separating the mechanism to express requirements, of the mechanism used to satisfy them. The declarative approach is based on the use of optimization methods and general purpose searches such as: Genetic Algorithms, Evolutionary Algorithms, Simulated Annealing, the Search Tabu and the classical methods of Artificial Intelligence such as the Stochastic Hill Climbing (SHC) [1, 6]. To discover knowledge by obtaining CFL predicates, we need to find good predicates in the space of possible predicates. A predicate is considered good if it has a high truth value in the set of examples. Therefore, the problem is oriented to the use of a metaheuristic approach, which consists in optimizing a function in a space (objective function) [2, 6]. The metaheuristic approach to perform searches enables the separation of the evaluation of solutions of the search mechanism used. The separation of both mechanisms increases the flexibility of the method for possible changes to the selected search algorithm and it is related to the function evaluates predicates (function to be optimized). The proposed search algorithm is composed of three key elements: • knowledge representation in the form of predicates • evaluation of predicates using the operators of the CFL • metaheuristic approach to perform searches Representation: For the representation of predicates we use general trees. A general tree is defined as non-empty finite set T, of elements called nodes, such as: • T contains an element distinguished R, called the root of T. • The remaining elements of T form an ordered collection of zero or more disjoint trees T1, T2, .., Tn.

Discovering Knowledge by Fuzzy Predicates in Compensatory Fuzzy Logic

165

In this case, the terminal nodes of the tree are related variables with the problem and the internal nodes of the tree are the operators (negation, ¬), (conjunction, ), (disjunction, ), (implication, →) , (double implication or equivalence (↔)) of the CFL. Evaluation: To evaluate predicates one of the fundamental elements is the truth value that takes the predicate in the set of data studied. In the Predicate Logic, universal and existential quantifiers are frequently used. The universal quantifier determines whether a formula (predicate) is true for all values within the domain. The existential quantifier indicates that a formula is true for any values within the domain. In this paper we use the universal quantifier CFL (see equation 7) to calculate the true value that acquires a predicate in the dataset. The proposed method takes into account other important characteristics when evaluating a predicate. They are: 1. 2. 3. 4.

Not to repeat (to avoid obvious predicates as ). To have a specific structure (if one wishes to obtain rules (implies) as the root node). Involving as many variables in the examples used. To be small.

The first feature is important because it allows avoiding finding predicates evident, which acquire high truth value but provide little knowledge new. To achieve the goal we define to penalize with 0 those predicates that have repeated variables. The second characteristic is used for association rules. In this case is assessed that the root node of the tree is an implication (implies) and are penalized with 0 to those that are not rules. This makes the search is directed to obtain only rules. Feature number three is defined in order to obtain more relations between the variables analyzed and therefore more knowledge. The four features is a natural interest from the point of view of knowledge engineering, because small predicates are easier to interpret than large and complex predicates. In equation 9 shows the function that evaluates the quality of a predicate: 1

(9)

Where: • Evaluation (E): evaluation of the predicate. • Value of true (T): truth value of the predicate in the set of examples. • Number of different variables (DV): number of different variables present in the tree. • Size with constant (SC): number of nodes (terminal and internal) having the tree counting constants. • Size without constant (SSC): number of nodes (terminal and internal) that has the tree without the constants.

166

M.M. Alonso et al.

Equation 9 represents the objective function to be optimized for search predicates. As shown, this feature aims to achieve to find predicates with high values of truth, small, and with the most variables involved in the examples. In this equation, the evaluation is directly proportional to the value of truth and the number of variables to be used, and inversely proportional to the length.

4

Metaheuristic Approach for Searching

The metaheuristic approach uses two basic mechanisms: the evaluation mechanism and the search mechanism. Both mechanisms are implemented separately and independently. This facilitates scalability and flexibility to changes in the requirements of the predicates to obtain and in the algorithm to use. The method of obtaining predicates proposed in this paper uses three fundamental and independent components: • Mutations (generate new solutions from others). • Evaluation (seen in the previous section). • Search algorithm. To mutations defined a set of mutation operators more general and more specific according to the structure of knowledge that will be obtained. The general mutation operators are as follows: Operators 1 • • • •

A terminal node is replaced by another terminal node, taken at random. An internal node is replaced by another internal node, taken at random. A subtree is replaced by a terminal node, both selected at random. A terminal node is replaced by a subtree, both selected at random.

To obtain different knowledge structures we define the following more specific operators: Operators 2 Operators to obtain classification rules: 1. 2. 3. 4.

Set the implication operator (→) as the root node. Set the variable representing the class as a consequent of the rule. Use the operators conjunction ( ) in the antecedent of the rule. Only mutate the antecedent of the rule using the Operators 1, except mutation number 2, since no change of logic connective, always is conjuntion ( ).

Operators for supervised learning: 1. 2. 3.

Set the double implication operator ( ) as the root node. Set one of the ends of the equivalence the variable representing the class. Use the conjunction ( ) as logical connective between variables.

Discovering Knowledge by Fuzzy Predicates in Compensatory Fuzzy Logic

4.

167

Mutate the extreme of equivalence in that does not appear the variable representing the class using the Operators 1, except mutation number 2, since no change of logic connective, always is conjuntion ( ).

Operators to obtain cluster: 1. 2. 3.

Set the disjunction operator ( ) as the root node of the tree. Set as subtrees clauses in conjunction ( ). Only mutate the clauses in conjunction using Operators 1, except mutation number 2, since no change of logic connective, always is conjunction ( ).

The search of CFL predicates begins to generate initials solutions. These solutions are generated depending on the knowledge one wishes to obtain. If we wish to obtain classification rules, cluster, etc. Then used mutation operators allowing obtaining such a structure directly. Otherwise, the initial solution is generated entirely at random, from an initial tree representing a predicate with only one node: the node terminal 0.5. Subsequently, 10 mutations are applied to the starting node to obtain a tree, using mutation operators Operator 1. In both cases, at each iteration a new solution is generated from the previous solutions comparing the evaluation of the current tree (current solution) with the mutated tree (candidate solution) and select the tree with better evaluation. The search ends when it finds a tree with the desired truth value or when finishes running the predefined number of iterations.

5

Experiments

To prove the proposed method it was necessary to implement an experimental tool that would enable experiments using real data. For this implementation was defined using Visual Prolog programming language. This is a logic programming language objects oriented based on Prolog, which like Prolog has a high capacity of deduction to find relationships between the objects created, variables and lists [9]. Experiments were designed to obtain association rules, general predicates, classification rules and supervised learning. We used 800 iterations of SHC to search each of the predicates. The main metric used to assess the quality of the results is the truth value that acquires the predicates in the data set. The diabetes database is derived from a study in a group of persons identified as diabetics or at risk for it. This database works with actual data taken from the results of a survey conducted by the Health Center of Jaruco, Mayabeque province, Cuba. It is noteworthy that the data used in the experiments were to be processed, since the method works with linguistic labels and truth value and not the actual values of the variables. It was therefore necessary assigning degrees of membership in each of the variables, with respect to joint fuzzy previously defined.

168

M.M. Alonso et al.

The diabetes database contains 7 variables that describe a set of characteristics associated with patients. These variables are: 1. 2. 3. 4. 5. 6. 7.

Age Race Hypertension Body Mass Index (BMI) Cardiovascular and/or Cerebral Vascular Accident (CVA) antecedents (both known for the expression: “Antecedents”) Sex Classification of diabetes (Diabetes)

For each of the variables the following labels and membership functions are established: Age: • Universe of discourse U ={Set of all ages} • Membership function: Sigmoid • Tag: “Old Age” Sigmoid function has the following equation: 1 1 ln 0.9

ln 0.1

where: is the value 0.5 (as true as false). is the value 0.1 (almost false). The Sigmoid function parameters are fixed by two values. First, the value at which it is considered that the statement in the predicate is true, which is set from 0.5. The second is the value for which the data is unacceptable, which is set from the value 0.1. To define the patient has an “Old Age” was used as the value “0.5” at age 40 years and the value “0.1” is = 19 years. Race: • Universe of discourse U = {White, Mixed race, Black} • Membership function: Function Singleton • Tag: “White Race” To define a patient with “White Race” is assigned a truth value to each element as follows: White Race= {White|1, Mixed race|0.5}. Black race represents the value zero.

Discovering Knowledge by Fuzzy Predicates in Compensatory Fuzzy Logic

169

Hypertension: • Universe of discourse U = {Hypertensive Detected, Risk of Hypertension, Group of no risk} • Membership function: Function Singleton • Tag: “Significant Hypertension” To define a patient with “Significant Hypertension” is assigned a truth value to each element as follows: Significant Hypertension = {Hypertensive detected|1, Risk of Hypertension|0.5}. No Risk Group represents the value zero BMI: • Universe of discourse U = {Set of all possible values of BMI } • Membership function: Function Singleton • Tag: “High BMI” To define a patient with “High BMI” was used as the value “0.5” at BMI kg/m2 and the value “0.1” is = 17 kg/m2.

25

Antecedents: • Universe of discourse U = {Antecedents Detected, Mild Antecedents, No Risk Group} • Membership function: Function Singleton • Tag: “Significant Antecedents” To define a patient with “Significant Antecedents” is assigned a truth value to each element as follows: Significant Antecedents= {Antecedents Detected|1, Mild Antecedents|0.5}. No Risk Group represents the value zero. Sex: • Universe of discourse U = {Possible Sexes} • Membership function: Function Singleton In the case of sex as there are only two possible values we assign value 1 to males and 0 females. Diabetes: • Universe of discourse U = {Detected Diabetic, Risk Group, Alteration of Fasting Glucose, No Risk Group} • Membership function: Function Singleton. • Tag: “Degree of Diabetes”

170

M.M. Alonso et al.

To define the “Degree of Diabetes” is assigned a truth value to each element as follows: Degree of Diabetes= {Detected Diabetic|1, Risk Group|0.8, Alteration of Fasting Glucose|0.5}. No Risk Group represents the value zero. After processing all the data, we proceeded to perform the experiments using in this case the Stochastic Hill Climber as Metaheuristic Algorithm. To obtain each predicate is performed 800 iterations of Climber Hills. The following are some of the general findings predicates obtained. All have truth values above 0.80. These results can be noted that frequently appear together, the variables advanced age and certain antecedents. It also shows a strong relationship between mass high, certain antecedents, advanced age, hypertension true and diabetes classification. General Predicates:

True value: 0.9256

True value: 0.9205

True value: 0.8603

True value: 0.8200

True value: 0.8131

True value: 0.8124 The following are some rules of association obtained. All predicates have truth values above 0.80. One feature that stands out in these results is the presence in some predicates of the male sex and white race. Another feature to note is that in almost all predicates it appears influence that advanced age. They also frequently appear together the advanced age, antecedents true, diabetic classification, and hypertension true.

Discovering Knowledge by Fuzzy Predicates in Compensatory Fuzzy Logic

171

Association Rules:

True value: 0.9532

True value: 0.9189

True value: 0.9166

True value: 0.9125

True value: 0.9036

True value: 0.8320 Are shown below a few rules of classification obtained in experiments. From these results it is important to note the influence of high mass, advanced age, hypertension and certain antecedents in having diabetes. Moreover, it also shows the presence of the male sex and race white in the patients suffering from this disease. Classification Rules:

True value: 0.9741

True value: 0.9014

True value: 0.8907

172

M.M. Alonso et al.

True value: 0.8713

True value: 0.8347

True value: 0.7173 The predicates obtained by supervised learning are shown below. The truth values achieved are also lower than in the previous cases. The main feature shown in these results is the influence of being old, have antecedents, and suffer from hypertension, with the onset of diabetes. Also observed certain influences of the white race in the problem. Supervised learning: True value: 0.8941

True value: 0.8632

True value: 0.8574

True value: 0.7041

True value: 0.7014

True value: 0.6912 Observations: In these experiments the majority of predicates reach of the truth value above 0.80. In predicates obtained the main relationships found are: • Relationship between advanced age, presence of hypertension and obesity. • Relationship between suffering from antecedents and have advanced age.

Discovering Knowledge by Fuzzy Predicates in Compensatory Fuzzy Logic

173

• Relationship between suffering from antecedents, hypertension, obesity and have diabetes. • Relationship between the male sex and the presence of diabetes. • Relationship between the white race and the presence of diabetes. Besides taking as measure the high values of truth, we conducted a study of the real characteristics of diabetes for comparison with the results obtained. According to investigations, obesity increases the risk of diabetes and the risk of developing hypertension. Diabetes and hypertension commonly coexist; the appearance of both is common in elderly [5, 9]. Comparing the predicates obtained and the real characteristics of diabetes observed many similarities. Therefore obtained predicates truthfully describe the relationships associated with diabetes. With respect to sex and race, irrespective of which any person may have diabetes. Therefore, the influence of these two features in the analyzed data can be considered novel discovery.

6

Conclusion

The experimental result indicates that this approach has facilities for obtaining logical predicates that reflect reality. This proposal does not replace existing methods for the discovery of knowledge, but provides a general and flexible approach that enables a new way to extract knowledge. In the near future the plan is to optimize the tool to obtain different structures of knowledge and combine the use of different metaheuristics algorithms. In addition it’s intended to investigate in greater volumes of data and make comparisons with results from other knowledge discovery tools.

References 1. Konar, A.: Artificial Intelligence and Soft Computing: Behavioral and Cognitive Modeling of the Human Brain. CRC Press LLC (2000) 2. Rosete, A.S.: Una solución flexible y eficiente para el trazado de grafos basada en el Es-calador de Colinas Estocástico. PhD thesis, ISPJAE (2000) 3. Dubois, D., Prade, H.: Fuzzy sets and systems: theory and applications. Academic Press, New York (1980) 4. Messerli, F., Bell, D., Bakris, G.: El carvedilol no modifica el peso ni el Índice de masa corporal de los pacientes con diabetes tipo 2 e hipertensión. American Journal of Medicine 120(7), 3–62 (2007) 5. Zimmermann, H.J.: Fuzzy Set Theory and its applications. Kluwer Academic Publishers (1996) 6. Koza, J.R.: Genetic Programming II: Automatic Discovery of Reusable Programs. The MIT Press (1994)

174

M.M. Alonso et al.

7. Espín, R.A., Fernández, E.G.: La lógica difusa compensatoria: Una plataforma para el razonamiento y la representación del conocimiento en un ambiente de decisión muticriterio. In: Plaza, Valdés (eds.) Multicriterio para la Toma de Decisiones: Métodos y Aplicaciones, pp. 338–349 (2009) 8. Espín, R.A., Mazcorro, G.T., Fernández, E.G.: Consideraciones sobre el carácter normativo de la lógica difusa compensatoria. In: Evaluación y Potenciación de Infraestructuras de Datos Espaciales para el desarrollo sostenible en América Latina y el Caribe, Idict edn., pp. 28–40 (2007) 9. Randall, S.: A Guide to Artificial Intelligence with Visual Prolog. OutskirtsPress (2010) 10. Zegarra, T., Guillermo, G., Caceres, C., Lenibet, M.: Características sociodemográficas y clínicas de los pacientes diabéticos tipo 2 con infecciones adquiridas en la comunidad admitidos en los servicios de medicina del hospital nacional cayetanoheredia. Scielo 11(3), 3–62 (2000)

Suggest Documents