confirmation measure c1. â0.15. â0.05. 0.05. A*10^20''. G*10^6'. D G ads' g CMC'' ... Thus the first point of increasing line is indicating PPC of an ensemble of ...
315
Fundamenta Informaticae 132 (2014) 315–330 DOI 10.3233/FI-2014-1046 IOS Press
Application of Rough Set Theory to Prediction of Antimicrobial Activity of Bis-Quaternary Imidazolium Chlorides ´ Łukasz Pałkowski∗ , Jerzy Krysinski Nicolaus Copernicus University Collegium Medicum Department of Pharmaceutical Technology, Jurasza 2, 85-089 Bydgoszcz, Poland {lukaszpalkowski, jerzy.krysinski}@cm.umk.pl
´ ´ † Jerzy Błaszczynski, Roman Słowinski Pozna´n University of Technology Institute of Computing Science Piotrowo 2, 60-965 Pozna´n, Poland {jblaszczynski, rslowinski}@cs.put.poznan.pl
Andrzej Skrzypczak, Jan Błaszczak Pozna´n University of Technology Institute of Chemical Technology Skłodowskiej-Curie 2, 60-965 Pozna´n, Poland
Eugenia Gospodarek, Joanna Wróblewska Nicolaus Copernicus University Collegium Medicum Department of Microbiology Skłodowskiej-Curie 9, 85-094 Bydgoszcz, Poland
1
Abstract. The paper investigates relationships between chemical structure, surface active properties and antibacterial activity of 70 bis-quaternary imidazolium chlorides. Chemical structure and properties of imidazolium chlorides were described by 7 condition attributes and antimicrobial properties were mapped by a decision attribute. Dominance-based Rough Set Approach (DRSA) was applied to discover a priori unknown rules exhibiting monotonicity relationships in the data, which hold in some parts of the evaluation space. Strong decision rules discovered in this way may enable creating prognostic models of new compounds with favorable antimicrobial properties. Moreover, relevance of the attributes estimated from the discovered rules allows to distinguish which of the structure and surface active properties describe compounds that have the most preferable and the least preferable antimicrobial properties. Keywords: Structure Activity Relationship (SAR), Rough set theory, Dominance-based Rough Set Approach (DRSA), Confirmation measures ∗
Address for correspondence: Nicolaus Copernicus University, Collegium Medicum, Department of Pharmaceutical Technology, Jurasza 2, 85-089 Bydgoszcz, Poland † Also works: Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447 Warsaw, Poland
316
1.
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
Introduction
In this paper, we propose a new methodology of Structure Activity Relationship (SAR) analysis employing the Dominance-based Rough Set Approach (DRSA) [9]. SAR analysis is based on an assumption that a change in structure of chemical compound causes changes in its physicochemical properties and its biological activity. SAR analysis leads to explanation of important structure activity relationships in some of the existing compounds on the one hand, and construction of models able to predict activity of new entities, on the other hand [11, 21]. The interest in this kind of research is motivated by search for new compounds having desirable properties. The type of compounds being analyzed, in this paper, is a group of bis-quaternary imidazolium chlorides with a good antimicrobial activity [18]. The quaternary imidazolium compounds have good antielectrostatic properties and they are used in cosmetic, textile and pharmaceutical industries [8]. The broad spectrum of antimicrobial activity, low toxicity, good stability, lack of odour of usable solutions cause those compounds to be widely used as disinfectant and cleaning agents in food and pharmaceutical industries and in hospitals [15, 16]. Numerous tests indicated that antimicrobial properties of bis-quaternary imidazolium chlorides depend on their structure (e.g. length of alkyl group) [12]. These tests were based on statistical methods (regression) and non-statistical methods (artificial neural networks) [21, 22, 23]. Classical rough set approach was also applied in SAR analysis [14]. The main goal of the analysis presented in this paper is identification of relationship between structure and surface active properties of newly synthesized n-alkyl-bis-N-alkoxy-N-alkyl imidazolium chlorides. The choice of DRSA, for this type of analysis, is motivated by the aim of discovering synthetic rules that exhibit monotonic relationships between structure and surface active properties of the compounds on the one hand, and their biological activity on the other hand. DRSA is able to deal with possible inconsistencies in the information table prior to discovery of knowledge. Then, the sufficiently consistent part of the data can be used to discover decision rules showing important relationships between chemical structure, surface active properties of compounds and their antimicrobial properties. These rules are called monotonic for their syntax of the form: “if evaluation of object a is greater (or smaller) than given values of some condition attributes, then a belongs to at least (at most) given class”. The above syntax takes into account that condition and decision attributes are ordinal and monotonically related. In case when monotonic relationships are not known a priori and may change from positive to negative in many points of the range of variation of condition attributes one should be able to discover local monotonicity relationships. A local monotonicity relationship becomes global if it positive or negative in the whole evaluation space. For example, considering room temperature as condition attribute, and comfort as decision attribute, instead of assuming that the higher (or the lower) the room temperature, the higher the comfort, it is reasonable to allow splitting this monotonicity relationship into two local relationships: until some value of the temperature the monotonic relationship is positive, and it is negative over this value. Even in case of binary attributes, corresponding to presence/absence of a property indicated by condition and decision attributes, the concept of monotonicity makes sense, because the presence of one property may be more credible when another property holds, or vice versa. For example, considering weather condition (sunny/rainy) as condition attribute, and playing golf (yes/no) as decision attribute, it is reasonable to expect that “yes” decision is more credible under sunny than
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
317
rainy weather, which corresponds to monotonicity relationship among 0-1 coded condition and decision attributes. As it was shown in [5], using DRSA on a properly (non-invasively) transformed information table, one is able to discover rules showing local monotonic relationships in some parts of the evaluation space, which is not possible using classical rough set approach. We apply this transformation to discover local and/or global monotonicity relationships in the analyzed data. The rough set approach appears to be more suitable for analysis of qualitative data than many of statistical methods that are suited well only for quantitative data. In the analyzed information table, surface active properties of compounds were described by quantitative parameters, whereas their chemical structure by qualitative structural parameters. While the discovered rules show relationships between condition attributes and the decision attribute, it is also important to inquire what is the relevance of individual condition attributes in these relationships. To measure this relevance, we apply a Bayesian confirmation measure. In the course of the analysis jRS and jMAF1 software were used. The paper is organized as follows. In the next section, material and methods are presented, including description of analyzed compounds, transformation of the information table, application of DRSA for rule discovery, and attribute relevance measurement by Bayesian confirmation. In the third section, results are presented and discussed. The paper is concluded in the last section.
2.
Material and Methods
2.1.
Material
For the sake of the analysis 70 objects that represent bis-imidazolium quaternary chlorides were examined. Surface active properties of analyzed chlorides were described by the following parameters: • CMC - critical micelle concentration (mol/L) • γCMC - value of surface tension at critical micelle concentration (mN/m) • Γ · 106 – value of surface excess (mol/m2 ) • A · 1020 – molecular area of a single particle (m2 ) • ∆Gads – free energy of adsorption of molecule (kJ/mol)
Figure 1.
Structure of analyzed compounds
Structure properties of chlorides were described by parameters (see Figure 1): • n - number of carbon atoms in n-substituent • R - number of carbon atoms in R-substituent 1
http://www.cs.put.poznan.pl/jblaszczynski/Site/jRS.html
318
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
Table 1.
Numerical coding of the structure condition attributes
Code 1 2 3 4 5 6 7 8 9 10 11 12 14 16
n C2 H5 C3 H7 C4 H9 C5 H11 C6 H13
R CH3 C2 H5 C3 H7 C4 H9 C5 H11 C6 H13 C7 H15 C8 H17 C9 H19 C10 H21 C11 H23 C12 H25 C14 H29 C16 H33
Staphylococcus aureus ATCC 25213 microorganisms were used to evaluate antibacterial activity of compounds by minimal inhibitory concentration (MIC). Detailed results of biological evaluation can be found in [19]. In the classical rough-set approach it is necessary to perform discretization procedure for all attributes with continuous values. In the case of this analysis made by DRSA continuous attributes which reflect surface active properties are not discretized. Domains of structure attributes, are presented in Table 1. According to the value of MIC objects were sorted into three decision classes: • good - good antimicrobial properties: MIC ≤ 0.02 µM/L, • medium - medium antimicrobial properties: 0.02 < MIC < 1 µM/L, • weak - weak antimicrobial properties: MIC ≥ 1 µM/L.
2.2.
Discovery of decision rules using DRSA
In the Dominance-based Rough Set Approach (DRSA) (for a complete presentation of DRSA see, for example, [9, 20]), information about objects (classification examples) is represented in the form of an information table. The rows of the table are labeled by objects, whereas columns are labeled by attributes and entries of the table are attribute-values. The set of attributes is, in general, divided into set C of condition attributes and set D of decision attributes (in most of the cases singleton decision attribute d. Condition attributes whose value sets are ordered are called ordinal attributes. Without loss of generality, for ordinal attribute q ∈ C, φ : U → R, for all objects x, y ∈ U, φ(x) ≥ φ(y) means “x is evaluated
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
319
at least as high as y on ordinal attribute q”, which is denoted x q y. Therefore, it is supposed that q is a complete preorder, i.e. a strongly complete and transitive binary relation, defined on U on the basis of evaluations φ(·). Ordinal attribute q may have positive or negative monotonic relationship with the decision attribute d (which is also ordinal). Positive relationship means that the greater the value of the condition attribute the higher the class label (i.e. the value of decision attribute), and negative relationship means that the greater the value of condition attribute the lower the class label. Furthermore, values of decision attribute d make a partition of U into a finite number of decision classes, X = {Xt , t = 1, . . . , n}, such that each x ∈ U belongs to one and only one class Xt ∈ X. It is supposed that the classes are ordered, i.e. for all r,s ∈ {1, . . . , n}, such that r > s, the objects from Xr are in higher class than the ones from Xs . More formally, if is a comprehensive weak order relation on U , i.e. if for all x,y ∈ U , xy means “x is ranked at least as high as y”, it is supposed: [x∈Xr , y∈Xs , r>s] ⇒ [xy and not yx]. If it is not so, then we observe an inconsistency between x and y. The above assumptions are typical for consideration of ordinal classification problems with monotonicity constraints, also called multiple criteria sorting problems. As it was shown in [5], non-ordinal classification problems can be analyzed by DRSA. Such problems need a proper transformation of information table. This transformation is non-invasive, i.e. it does not bias the matter of discovered relationships. The intuition which stands behind this transformation is the following. In case of ordinal condition attributes, for which the presence and the sign of the monotonicity relationship between values of condition and decision attributes is known a priori, no transformation is required and DRSA can be applied directly. Each non-ordinal condition attribute, for which the presence or absence and the possible sign of the monotonicity relationship is not known a priori, is doubled and for the first attribute in the pair it is supposed that the monotonicity relationship is potentially positive, while for the second attribute, that it is potentially negative. Due to this transformation, using DRSA one will be able to find out if the actual monotonicity is global or local, and if it is positive or negative. The decision attributes are transformed such that: • in case of a non-ordinal attribute, it is replaced by a new decision attribute making partition of U into two subsets of objects: class Xt and its complement ¬Xt , for t = 1, . . . , n, • in case of an ordinal attribute, it is replaced by a new decision attribute making partition into two subsets of objects: those belonging to class Xt or better (at least t), and those belonging to class Xt−1 or worse (at most t − 1), for t = 2, . . . , n. To discover rules relating values of condition attributes with class assignment, in case of non-ordinal classification problems, we have to consider n ordinal binary classification problems with two sets of objects: class Xt and its complement ¬Xt , t = 1, . . . , n, which are number-coded by 1 and 0, respectively. We also assume, without loss of generality, that the value sets of all non-ordinal condition attributes are number-coded. While this is natural for numerical attributes, nominal attributes must be binarized and get 0-1 codes for absence or presence of a given nominal value. In this way, the value sets of all non-ordinal attributes get ordered (as all sets of numbers are ordered). Now, to apply DRSA, we transform the data table such that each number-coded attribute is cloned (doubled). It is assumed that the value set of each original number-coded attribute q 0 is positively monotonically dependent on the decision, i.e. the greater the value of the condition attribute, the higher the number code (rather 1 than 0) of the class assignment, and the value set of its clone q 00 is negatively monotonically dependent on the decision, i.e. the greater the value of the condition attribute, the lower the number code (rather
320
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
0 than 1) of the class assignment. Then, using DRSA, we get rough approximations of class Xt and its complement ¬Xt , t = 1, . . . , n. These approximations serve to induce “if..., then...” decision rules recommending assignment to class Xt (argument pros) or to its complement ¬Xt (argument cons). Due to cloning of attributes with opposite monotonicity relationships, we can have rules that cover a subspace in the condition attribute space, which is bounded from the top and from the bottom. More precisely, for an attribute q, assuming positive and negative monotonicity, q 0 and q 00 , respectively, the elementary conditions employing these cloned attributes in a rule can be as follows: q 0 (x) ≥ s and q 00 (x) ≤ r, where x ∈ U , and s ≤ r, belong to the value set of q. This means, that the joint elementary condition expressed in terms of q will be: q(x) ∈ [s, r]. In consequence, all of this leads (without discretization) to more synthetic rules than those resulting from induction techniques specific to non-ordinal classification problems. The syntax of decision rules is the following: if E then H, which can also be denoted as E → H. A rule consists of a condition part (called also premise, or evidence) E, and decision part (called also conclusion, prediction or hypothesis) H. The condition part of the rule is a conjunction of elementary conditions concerning individual attributes, and the decision part of the rule suggests assignment to a decision class or to a union of decision classes. The rules that are considered in DRSA analysis are mainly certain rules. This kind of rules cover only consistent objects. Decision rules represent the most important cause-effect dependencies between values of condition attributes and value of decision attribute. These rules are not only confined to important condition attributes, but also they include minimal number of elementary conditions indispensable for presentation of the dependencies. The set of decision rules may be understood as the presentation of cause-effect dependencies in the information table, from which all inessential and redundant information was removed. The rules are characterized by various parameters, such as strength (i.e. the proportion of objects covered by premise that are also covered by conclusion), or confirmation (i.e. measure that is quantifying the degree to which premise provides evidence for conclusion; see [10]).
2.3.
Attribute relevance
We consider attribute relevance measures that satisfy the property of Bayesian confirmation [4]. These measures take into account interactions between attributes represented by decision rules. In this case, the property of confirmation is related to quantification of the degree to which the presence of an attribute in the premise of a rule provides evidence for or against the conclusion of the rule. The measure increases when more rules involving an attribute suggest a correct decision, or when more rules that do not involve the attribute suggest an incorrect decision, otherwise it decreases. Let us first give some basic definitions. Considering, a decision rule, and a finite set of condition attributes A = {a1 , a2 , . . . an }, we can define the condition part of the rule as a conjunction of elementary conditions on a particular subset of attributes: E = ei1 ∧ ei2 ∧ . . . ∧ eip ,
(1)
where {i1 , i2 , . . . ip } ⊆ {1, 2, . . . , n}, p ≤ n, and eih is an elementary condition defined on the value set of attribute aih , h ∈ {i1 , i2 , . . . ip } (e.g., eih ≡ aih ≥ 0.5). The set of rules R induced from data set L can be applied to objects from L or to objects from a testing set T . A rule r ≡ E → H, r ∈ R, covers object x (x ∈ L or x ∈ T ) if x is satisfying the
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
321
condition part E. We say that the rule is correctly classifying x if it both covers x and x satisfies the decision part H. If the rule covers x, however, x does not satisfy the decision part H, then we say that the rule classifies x incorrectly. In other words, we say that rule r is true for object x if it classifies this object correctly, and it is not true otherwise. By ai . E we denote the fact that E includes an elementary condition ei involving attribute ai , i ∈ {1, 2, . . . n}. An opposite fact will be denoted by ai 7 E. Let us consider object x (x ∈ L or x ∈ T ), and set of rules R. We use the following notation throughout the paper: a = |H ∧ (ai . E)| - the number of rules that correctly classify x and involve attribute ai in the condition part, b = |H ∧ (ai 7 E)| - the number of rules that correctly classify x and do not involve attribute ai in the condition part, c = |¬H ∧ (ai . E)| - the number of rules that incorrectly classify x and involve attribute ai in the condition part, d = |¬H ∧ (ai 7 E)| - the number of rules that incorrectly classify x and do not involve attribute ai in the condition part. Given a set of objects T on which the set of rules R is applied, the values of a, b, c, d are defined over set T : a is then interpreted as a number of all rules that correctly classify objects from T and involve attribute ai . Interpretation of the remaining parameters is analogous. The values of a, b, c, and d can be also treated as frequencies that may be used to estimate probabilities, e.g., Pr(H ∧ (ai . E)) = a/(a + b + c + d), or Pr(ai . E) = (a + c)/(a + b + c + d). Formally, a relevance measure c(H, (ai . E)) has the property of Bayesian confirmation if and only if it satisfies the following conditions: > 0 if Pr(H|(ai . E)) > Pr(H), c(H, (ai . E)) = = 0 if Pr(H|(ai . E)) = Pr(H), (2) < 0 if Pr(H|(ai . E)) < Pr(H). The conditions of definition (2) thus equate the confirmation with an increase of the probability of the hypothesis caused by the evidence while disconfirmation with a decrease of the probability of the hypothesis caused by the evidence. Finally, neutrality is identified in case of lack of influence of evidence on hypothesis. Now, there are at least three logically equivalent ways to express the fact that ai . E confirms H [10] in the context of the Kolmogorov theory of probability [13]. Namely: Pr(H|(ai . E)) > Pr(H), Pr(H|(ai . E)) > Pr(H|(ai 7 E)), and Pr(H|(ai . E)) > Pr((ai . E)|¬H). The second way is especially interesting for our purposes. It allows for a redefinition of the relevance measure satisfying the Bayesian confirmation (2) to the form of the following conditions: > 0 if Pr(H|(ai . E)) > Pr(H|(ai 7 E)), c(H, (ai . E)) = = 0 if Pr(H|(ai . E)) = Pr(H|(ai 7 E)), (3) < 0 if Pr(H|(ai . E)) < Pr(H|(ai 7 E)). When probabilities are estimated in terms of frequencies, (2) and (3) may be expressed in terms of a, b, c, and d. In this study, we use normalized confirmation measure c1 , defined in [10].
322
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
Measure c1 has desirable properties, like, e.g., proper handling of extreme situations (no counterexamples or no positive examples for the hypothesis). It is a combination of two Bayesian confirmation measures Z(H, (ai . E)) and A(H, (ai . E)) defined as follows: ( P r(H|(a .E)−P r(H)) ad−bc i = (a+c)(c+d) in case of confirmation, 1−P r(H) Z(H, (ai . E)) = P r(H|(a (4) ad−bc i .E)−P r(H)) = (a+c)(a+b) in case of disconfirmation. P r(H) ( P r(H)−P r(H|(a 7E)) i
A(H, (ai . E)) =
P r(H) P r(H)−P r(H|(ai 7E)) 1−P r(H)
= =
ad−bc (a+b)(b+d) ad−bc (b+d)(c+d)
in case of confirmation, in case of disconfirmation.
(5)
Remark that these measures are complementary with respect to measuring how much P r(H|(ai . E) is greater or smaller than P r(H|(ai 7 E). Measure Z(H, (ai . E)) is reaching maximum when P r(H|(ai . E) = 1 and the minimum when P r(H|(ai . E) = 0, while measure A(H, (ai . E)) is reaching maximum when P r((ai . E)|H) = 1 and the minimum when P r((ai . E)|H) = 0. Due to this complementarity, it is reasonable to consider a specific combination of measures Z(H, (ai . E)) and A(H, (ai . E)), denoted in [10] by c1 (H, (ai . E)), defined as follows: α + β A(H, (ai . E)) in case of confirmation if c = 0, α Z(H, (a . E)) in case of confirmation if c > 0, i c1 (H, (ai . E)) = (6) −α + β A(H, (a . E)) in case of disconfirmation if a = 0, i α Z(H, (a . E)) in case of disconfirmation if a > 0. i Using the terms a, b, c, d, measure c1 (H, (ai . E)) can be written as: ad−bc α + β (a+b)(b+d) α ad−bc (a+c)(c+d) c1 (H, (ai . E)) = −α + β ad−bc (b+d)(c+d) ad−bc α (a+b)(a+c)
if if if if
a a+c a a+c a a+c a a+c
> > <
0, ∧ a = 0,
(7)
∧ a > 0.
where α > 0, β > 0, and α + β = 1. Analysis of c1 (H, (ai . E)) with respect to desirable properties considered in [10] indicates that the best values of α, and β are equal to 0.5.
3. 3.1.
Results and Discussion Experimental setup
The experimental procedure is composed of the following steps: Step 1 Setting up the information table that includes data for SAR analysis. Step 2 Transformation of the information table to the form required by DRSA. Step 3 Induction of decision rules from the transformed information table. Step 4 Analysis of the relevance of condition attributes.
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
Table 2. ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
Information table
Condition
attributes
Classification
MIC
n
R
-logCMC
γCMC
Γ·106
A·1020
∆Gads
µM/L
Decision class
2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6
1 2 3 4 5 6 7 8 9 10 11 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 14 16
2.15 2.23 2.38 2.41 2.49 2.58 2.65 2.72 2.81 2.92 3.04 3.15 3.34 3.52 2.18 2.26 2.32 2.44 2.52 2.61 2.68 2.75 2.84 2.95 3.07 3.18 3.37 3.55 2.22 2.3 2.39 2.45 2.53 2.64 2.71 2.79 2.88 3 3.13 3.22 3.4 3.58 2.25 2.33 2.41 2.49 2.58 2.67 2.75 2.83 2.91 3.07 3.16 3.25 3.43 3.61 2.29 2.37 2.45 2.54 2.62 2.71 2.79 2.87 2.95 3.07 3.2 3.29 3.47 3.65
61.9 60.1 59.8 57.4 55.5 53.4 51.2 48.9 47.5 45.3 42.5 41.4 37.5 33.9 60.8 58.9 57.8 56.3 54.4 52.3 50.1 47.9 46.4 44.2 42.4 40.3 36.4 32.8 59.6 57.7 56.6 55.1 53.2 51.1 48.9 46.7 45.2 43 41.2 39.2 35.2 31.5 58.5 56.6 55.7 53.9 52.1 50.1 47.8 45.6 44 41.9 40.1 38.1 33.9 29.6 47.1 44.8 42.6 39.4 37.6 36.3 35.1 34.2 33.3 32.2 31.2 30.3 28.3 26.4
2.75 2.71 2.69 2.65 2.61 2.57 2.53 2.49 2.45 2.41 2.37 2.33 2.25 2.17 2.74 2.7 2.66 2.62 2.58 2.54 2.59 2.46 2.42 2.38 2.34 2.3 2.22 2.14 2.71 2.67 2.63 2.59 2.55 2.51 2.47 2.43 2.39 2.35 2.31 2.27 2.19 2.11 2.68 2.64 2.6 2.56 2.52 2.48 2.44 2.4 2.36 2.32 2.28 2.24 2.16 2.08 2.73 2.66 2.59 2.53 2.47 2.41 2.34 2.28 2.22 2.15 2.09 2.03 1.9 1.77
52 54 56 58 60 62 64 66 68 70 72 74 78 82 54 56 58 60 62 64 66 68 70 72 74 76 80 84 56 58 60 62 64 66 68 70 72 74 76 78 82 86 58 60 62 64 66 68 70 72 74 76 78 80 84 88 56 59 62 65 68 71 74 77 80 83 86 89 95 101
20.2 20.8 21.3 21.7 22.3 22.7 23.5 23.9 24.3 24.8 25.6 26.3 27.5 28.8 20.5 21.1 21.6 22 22.6 23 23.8 24.2 24.6 25.1 25.9 22.6 27.7 29.1 20.9 21.4 21.9 22.3 22.9 23.3 23.9 24.5 24.9 25.4 26.2 26.9 28 29.3 21.2 21.7 22.2 22.7 23.2 23.7 24.2 24.8 25.3 25.9 26.5 27.1 28.3 29.6 29.3 30.2 30.9 31.5 32.1 32.6 33.1 33.6 34 34.4 34.8 35.2 36.1 37.1
30.95975 28.46797 13.18131 6.142506 5.741438 1.337692 0.162754 0.038492 0.01826 0.01737 0.016563 0.031655 0.058147 0.107525 29.65203 27.37484 25.42252 5.932535 5.562255 1.298397 0.158239 0.00937 0.069436 0.008479 0.016187 0.007742 0.055561 0.105535 28.46797 6.590654 12.28501 0.711938 0.345211 0.079343 0.009623 0.00913 0.008685 0.008282 0.031655 0.060619 0.027934 0.051809 27.37484 25.42252 5.932535 0.177992 2.617735 0.158239 0.01874 0.142432 0.008479 0.004047 0.015484 0.029679 0.027398 0.203537 6.590654 0.761671 0.711938 0.668846 0.040689 0.076984 0.01826 0.008685 0.016563 0.003957 0.015155 0.029074 0.026881 0.099985
weak weak weak weak weak weak medium medium good good good medium medium medium weak weak weak weak weak weak medium good medium good good good medium medium weak weak weak medium medium medium good good good good medium medium medium medium weak weak weak medium weak medium good medium good good good medium medium medium weak medium medium medium medium medium good good good good good medium medium medium
323
324
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
In result of Step 1, we get the information table presented in Table 2. This table is the basis of the structure–activity relationship analysis of the bis-imidazolium quaternary chlorides. In this case, condition attributes describe surface active properties and structure (see Section 2.1 for details). The decision attribute concerns antimicrobial properties of bis-quaternary imidazolium chlorides represented by some limit values of MIC for Staphylococcus aureus ATCC 25213. In the classical rough-set approach it is necessary to perform discretization procedure for all attributes with continuous values. In the case of this analysis made by DRSA continuous attributes which reflect surface active properties are not discretized. In Step 2 of the procedure, we assumed that all condition attributes should be transformed into pairs of attributes having positive or negative monotonicity relationship with the decision attribute, which is ordinal. In Step 3, we applied DRSA to induce decision rules representing cause-effect relationships discovered from the information table. From the practical point of view, the most interesting are rules that describe decision class good, and decision class weak. The rules that assign to class good provide guidelines for synthesis of compounds with preferable antimicrobial properties. Analogously, the rules that assign to class weak outline features of compounds with least preferable antimicrobial properties, i.e, the compounds that should not be taken into consideration. We do not induce rules for class medium since these rules are not interesting from the view point of SAR analysis (it is more important to know what are the features of chlorides with definitely good or weak antimicrobial properties). Note, however, that the presence of chlorides from the class medium is important in the rule induction process. The rules with conclusion good discriminate chlorides with good antimicrobial properties from those chlorides which have medium or weak properties (analogously for rules with conclusion weak). Thus, the analysis, which we present in the following, was made on two binary partitions of the decision table. The first binary partition was made between class good and class ¬good. Class ¬good consisted of all compounds that do not belong to class good (i.e., compounds belonging either to class medium or class weak). The second binary partition was made analogously for class weak. In Step 4, we identified relevant attributes. To perform this analysis, we constructed independently multiple sets of decision rules. More precisely, we constructed ensembles of VC-DomLEM rule classifiers [3]. The constructed ensembles were composed of rule classifiers induced on bootstrap samples of objects from the information table. The samples of objects used in the induction process were controlled by consistency measures. The approach applied to this end is called VC-bagging [1, 2]. It extends the standard bagging scheme proposed by Breiman [6]. Let us remark that in the standard bagging, several classifiers, called component or base classifiers, are induced using the same learning algorithm over different distributions of input objects, which are bootstrap samples obtained by uniform sampling with replacement. Bagging has been extended in a number of ways in attempt to improve its the predictive accuracy. These extensions focused mainly on increasing diversity of component classifiers. Random forest [7] ensemble, which is using attribute subset randomized decision tree component classifiers, is a well known example. Other extensions of bagging take advantage of random selection of attributes. In some cases, the random selection of attributes was combined with standard bootstrap sampling (see [17]). The motivation behind application of VC-bagging in our analysis is as well to increase diversity of component classifiers by changing the sampling phase. The increased diversity of classifiers suits exploration for relevant decision rules and attributes, which is the main goal of the presented study. We take, moreover, into account the postulate saying that base classifiers used in bagging are expected to have sufficiently high predictive accuracy apart from being diversified [7]. As we have shown in
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
325
previous studies [1, 2], this requirement can be satisfied by privileging consistent objects when generating bootstrap samples. Presence of inconsistent objects in bootstrap samples may lead to overfitting of the base classifies, which decreases their classification accuracy. We change the standard bootstrap sampling, where each object is sampled with the same probability, into more focused sampling, where consistent objects, in the sense presented in Section 2.2, are more likely to be selected than inconsistent ones. To identify consistent objects we use the same consistency measures as those used to define probabilistic lower approximations in VC-DRSA [1]. In addition we consider consistency of objects with respect to description of objects by a random subset of attributes (criteria), instead of the whole set of attributes. In consequence, the consistency will be measured in evaluation space constructed on random subsets of the whole set of attributes. In this way, we introduce another level of randomization into method, which should lead to more diversified samples The ensembles, used in presented study, were constructed in the following setting. The number of rule component classifiers in each of the ensembles was 100. To get reliable estimates of confirmation the rules were constructed on training data sets resulting from a stratified 3-fold cross validation. For better reproducibility of the results we repeated the cross validation 100 times. Detailed results of this experiment are discussed in two following subsections.
3.2.
Decision rules
Strong decision rules supported by a large number of objects provide guidelines which may facilitate synthesis of new compounds with preferable antimicrobial properties. In the Table 3, we presented selected strong, most informative, certain decision rules induced from the information table. These are rules that cover at least the half of objects in class, and have high c1 confirmation. On the basis of these rules, we can recognize the following. The most active compounds are bisquaternary imidazolium chlorides, which posses -log CMC value in the range of 2.71 and 3.13. For these chlorides, value of surface excess is below 2.42 mol/m2 , value of surface tension at CMC is below 46.4 mN/m, value of free energy of adsorption of molecule is greater than 24.6, and the length of R substituent ranges from 6 to 11 (the alkyl chain longer than hexyl and shorter than undecyl one). The worst class is predominated by bis-quaternary imidazolium chlorides, for which value of -log CMC is below 2.65, value of surface tension at CMC is over 44.8 mN/m, value of molecular area of single particle is below 62, value of free energy of adsorption of molecule is below 30.2, and value of surface excess is over 2.59 mol/m2 . To this class belong chlorides with short n constituent (below 6, and thus, shorter than dioxadecane). The obtained results of structure–activity relationship analysis of 70 bis-imidazolium chlorides indicate strong influence of attributes describing surface properties on antimicrobial activity of compounds. The results clearly show CMC is the most differentiating attribute which decides whether surface active compound has good antimicrobial properties or not. The results of estimation of attribute relevance shows that surface active properties are more relevant than structure properties when one wants to identify antimicrobial properties of a compound. Nevertheless, both surface active properties and biological activity of a chemical entity depend on its structure, what is more on its length of hydrophobic hydrocarbon chain. This fact is reflected by the discovered decision rules, which show how structure properties and surface active properties are related to antimicrobial properties. The influence of attributes concerning chemical structure on biological activity was revealed in decision rules: the length of R-substituent in class of good chlorides, and the length of n-substituent in class of weak chlorides. The results of SAR
326
Ł. Pałkowski et al. / Application of RST to Prediction of Antimicrobial Activity of Bis-Quaternary...
Table 3.
Decision rules, supporting examples, strength, and value of c1 measure
ID
Condition attributes n
R
-logCMC
γCMC
Γ·106
A·1020
Support
Strength(%)
c1
12 12 11 11 11 11 11 11 11 11 11 11 10 10 10
60 60 55 55 55 55 55 55 55 55 55 55 50 50 50
0.7586 0.7586 0.7330 0.7330 0.7330 0.7330 0.7330 0.7330 0.7330 0.7330 0.7330 0.7330 0.7083 0.7083 0.7083
16 16 15 15 15 15 15 15 14 14 14 13 13 12 12
80 80 75 75 75 75 75 75 70 70 70 65 65 60 60
0.8703 0.8703 0.8409 0.8409 0.8409 0.8409 0.8409 0.8409 0.8125 0.8125 0.8125 0.7850 0.7850 0.7586 0.7586
∆Gads
Decision class good 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
>6
>6
>8 >8 (6, 11)
(2.71, 3.13)