The OneClass Classification Approach to Data ... - Wiley Online Library

4 downloads 9710 Views 240KB Size Report
Aug 30, 2010 - scribe the data domain for the training instances and to check whether the test instances belong to this domain. A data domain of the ensemble ...
DOI: 10.1002/minf.201000063

The One-Class Classification Approach to Data Description and to Models Applicability Domain Igor I. Baskin,[a] Natalia Kireeva,[b] and Alexandre Varnek*[b]

Abstract: In this paper, we associate an applicability domain (AD) of QSAR/QSPR models with the area in the input (descriptor) space in which the density of training data points exceeds a certain threshold. It could be proved that the predictive performance of the models (built on the training set) is larger for the test compounds inside the high density area, than for those outside this area. Instead of searching a decision surface separating high and low density areas in the input space, the one-class classification 1-SVM approach looks for a hyperplane in the associated feature space. Unlike other reported in the literature AD

definitions, this approach: (i) is purely “data-based”, i.e. it assigns the same AD to all models built on the same training set, (ii) provides results that depend only on the initial descriptors pool generated for the training set, (iii) can be used for the huge number of descriptors, as well as in the framework of structured kernel-based approaches, e.g., chemical graph kernels. The developed approach has been applied to improve the performance of QSPR models for stability constants of the complexes of organic ligands with alkaline-earth metals in water.

Keywords: One-class classification approach · Models applicability domain · Structure-property relationships · Structure-activity relationships

1 Introduction The problem of defining the applicability domain (AD) of a QSAR/QSPR model is one of the hottest topics in chemoinformatics (see reviews[1–3]). Surprisingly that this concept is currently used almost exclusively in chemoinformatics and still very little has been discussed in mathematical statistics and machine learning theory. As mentioned in the textbook by Vapnik,[4] statistical models are directly applicable to any test instance drawn from the statistical distribution describing a training set, i.e., the training and test instances should belong to the same data domain. If these data domains are rather close, statistical models could indirectly be applied to the test instances using dataset shift techniques.[5] Hence, in order to identify an AD, one should describe the data domain for the training instances and to check whether the test instances belong to this domain. A data domain of the ensemble of instances could be described using one-class classification or novelty detection approaches, see reviews.[6–8] In the input (descriptor) space, they separate the higher data density areas (typically enclosed by some hypersurfaces) from the open area of lower data density. The ability of novelty detection models to be used as AD of machine learning models was earlier demonstrated by Bishop.[9] The use of 1-class SVM novelty detection method to assess the applicability domain of models based on structured graph kernels has recently been suggested by Fechner et al.[10] In this article, we formulate AD in terms of novelty detection approach. In contrast to other AD, this is not confined to any model-specific information; the whole initial pool of Mol. Inf. 2010, 29, 581 – 587

descriptors can be used to define an applicability domain. This approach efficiently uses the kernel techniques which allows one to handle very big and even infinite number of descriptors.[11] Thus, a disconnection with model-related information (selected descriptors, modelled property) allows one to develop a “universal” AD, common for all models which could potentially be built on a given dataset using initial pool of descriptors. Here, AD is associated with highdensity regions in the descriptors space which could be identified by 1-SVM approach. This paper is organized as follows. First, the main idea of using one-class classification for defining AD of QSAR/QSPR models is formulated, and, in this context, a short survey of one-class classification and density estimation methods is given. Then the quantile diagrams and quantile analysis adapted to the AD calculation are introduced and applied to regression models for stability constants (log K) of the complexes of organic ligands with alkaline-earth cations in solution.

[a] I. I. Baskin Department of Chemistry, Moscow State University Moscow 119991, Russia [b] N. Kireeva, A. Varnek Laboratoire d’Infochimie, UMR 7177 CNRS, Universit de Strasbourg 4, rue B. Pascal, Strasbourg 67000, France *e-mail: [email protected]

 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

581

Full Paper

I. I. Baskin et al.

2 The Use of One-Class Classification to Define Models’ Applicability Domain 2.1 Data Description Domain

Inductive statistical (machine learning) approaches usually assume that the objects in the training and test sets are drawn from the same probability distribution F. This means that the sampling of a new object described by a vector of attributes (descriptors) x, is treated as a random event, characterized with some probability. In this case, all objects in the training and test sets can be considered as realizations of the same multidimensional random vector variable X, probability distribution function F(x): Fðx 1 , . . . , x d Þ ¼ PðX 1  x 1 , . . . , X d  x d Þ,

ð1Þ

where d is the dimensionality of x, P(·) means probability. In the framework of the probability theory, for real-valued x the probability density function g(x) can be defined by Equation 2:

Fðx1 ; . . . ; xd Þ ¼

Zx1

Zxd ...

1

gðt1 . . . td Þdt1 . . . dtd

ð2Þ

1

In this context, the problem to define the model’s applicability domain (AD) can be reduced to the problem of belonging of a new test example to the same probability distribution F. If this is not a case, any model developed on the training set should not be applied to test examples. Two alternative ways could be suggested to determine the AD in this formulation. First, one can estimate the probability density of the training set g(t1,…,td) at a given point and then comparing this with some threshold. If g(t1,…,td) exceeds the threshold, the test example is considered as belonging to the statistical distribution F. The second way is to estimate p%-quantile corresponding to the probability mass in the highest probability density region Rp in the input space X: p¼

Z

gðt1 . . . td Þdt1 . . . dtd

ð3Þ

ðt1 ;...td Þ2Rp

If a test example lies within Rp, one can considered it belonging to the p%-quantile of statistical distribution F. Thus, the data description domain Rp can be consider as the AD of models built on the training sets drawn from the probability distribution F. To locate the region Rp in the input space X, one can define a function f which is positive in Rp and negative elsewhere: f ðxÞ > 0; if x 2 Rp f ðxÞ < 0; if x 2 = Rp

582

www.molinf.com

where x is a vector in the input space. In machine learning, this analysis is called one-class classification problem.[12] Thus, the goal of one-class classification is to differentiate the objects belonging to a certain class (target class) and all other objects (outlier class). Unlike two-class and multi-class classification identifying distinction patterns for different classes, in one-class classification only objects belonging to one target class are used in the learning. Seeking common features for the target class objects, the one-class classification, on one hand, defines a boundary around the target class (decision surface), and, on the other hand, minimizes the chance to accept the outlier objects. In the literature, the one-class classification is also called outlier detection, novelty detection, concept learning and data description.[12–15]

ð4Þ

2.2 One-Class Classification and Probability Density Estimation

Three main approaches have been suggested to solve the one-class classification problem[6–8]: (1) density estimation, (2) reconstruction, and (3) boundary methods. Density estimation methods evaluate the probability density function of the training data and define a boundary for the target objects according to density threshold. The most popular methods for density estimation involve an application of a Gaussian model,[16] a mixture of Gaussians,[17] and Parzen density estimator.[18] In practice, this approach can be used only for small number of descriptors; an estimation of the probability density in multidimensional space is a hard problem requiring a big number of training examples.[19] The reconstruction approach is based on the ability to reconstruct objects from their encoded representations. The encoding and decoding rules depend on the training sets objects. Thus, the error of reconstruction of a test object depends on its similarity to the training set. A user defined threshold of the reconstruction error could play an important role to define AD. Two main groups of methods are considered: clustering[6, 16, 20–23] and encodingbased[11, 14, 24–26] techniques. Unlike density estimation approaches, boundary methods directly identify the training objects located in high density regions without aiming to reproduce the density distribution function. In the descriptors (input) space, one can use as AD a threshold for the distance between a test object and specially selected (as in the case of the k-center method[27]) or the closest training set objects, as it is realized in different version of kNN approach.[6, 28, 29] One can also define a kernel-representing (feature) space, onto which the input space is implicitly mapped.[11] In this case, the boundary between high and low density regions is associated with the support vectors. In the feature space, one should speak about position of the test object with respect to the hyperplane (in one-class SVM (1-SVM))[30]) or inside the hypersphere (in Support Vector Data Description

 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2010, 29, 581 – 587

One-Class Classification Approach to Data Description and to Models Applicability Domain

(SVDD)[6, 31]) separating “high density” and “low-density” training objects. In this work, 1-SVM[30] has been chosen among other one-class classification techniques because of the following reasons: (i) it efficiently works in high-dimensional input space; (ii) it does not require an overly large number of training examples; (iii) for the Gaussian RBF kernel it is a consistent density estimator[32] (the latter means that the decision surfaces computed with 1-SVM converge towards isosurfaces of the true probability density with the increase of the number of training examples); (iv) software package LibSVM,[33] in which 1-SVM is efficiently implemented, is freely available. 2.3 One-Class Support Vector Machines (1 -Class SVM)

1-SVM[30] is a kernel-based boundary one-class classification method. It solves the problem of one-class classification and data description by constructing a hyperplane in a feature space H, to which original vectors x from the input space 4 are mapped by means of a nonlinear mapping F: 4!H. The mapping F is chosen in such a way that for all pairs of vectors xi and xj the dot product of their images F(xi) and F(xj) in the feature space could be computed as a function K(xi, xj) (so-called kernel) in the input space[11]: Kðx i , x j Þ ¼ hFðx i Þ; Fðx j Þi

ð5Þ

A one-class decision hyperplane H(w, 1)[30] separates objects in the feature space belonging to the target class from all other objects (outliers) Hðw, 1Þ : hw,FðxÞi1 ¼ 0

ð6Þ

It is built in such a way to maximize the distance between H(w, 1) and the origin, on one hand, and to leave a prescribed fraction of training examples beyond H(w, 1), on the other hand. Here, w is a vector perpendicular to the hyperplane, whereas 1 jj w jj is its offset (see Figure 1). One can prove that for the Gaussian Radial Basis Function (RBF) kernel (KRBF) K RBF ðx i , x j Þ ¼ expðgjjx i x j jj2 Þ

ð7Þ

all points in the feature space lie on the surface of a hypersphere of radius 1 centered at the origin. Indeed, the distance from the origin to any point F(x) is equal to the square root of the dot product of its radius-vector with itself: hFðxÞ, FðxÞi ¼ Kðx, xÞ ¼ expðgjjx i x j jj2 Þ ¼ expð0Þ ¼ 1 ¼ const

ð8Þ

Since all points F(x) are separated from the origin by the same distance equal to 1, they lie on the hypersphere of radius 1. In this case, the hyperplane H(w, 1) cuts from this Mol. Inf. 2010, 29, 581 – 587

Figure 1. Schematic representation of the training set objects in the initial space defined by two descriptors (a) and in the feature space (b). a) Contours C1 and C2 correspond to different density thresholds separating target class objects from outliers. The region R2 inside C2 corresponds to higher average density than that in the region R1 inside C1 Thus, prediction accuracy of the test object falling into R2 is expected to be higher than that for R1. b)Separation of target class objects from outliers in 1-SVM. In the feature space, initial data lie on hypersphere of radius R = 1 (for Gaussian RBF kernel). An increase of the distance 1/jj w jj between hyperplane H(w, 1) and the origin results in reduction of selected target class objects and, consequently, to the decrease of coverage q. The hyperplanes H1 and H2 correspond to contours C1 and C2 on Figure 1a, as proven in the Literature.[31]

hypersphere a segment which radius becomes smaller with the increase of the distance 1/jj w jj between H(w, 1) and the origin (see Figure 1). A decision function f(x) identifying two areas separated by hyperplane H(w, 1) in the feature space, or by hypersurface in the input space, is determined as

f ðxÞ ¼ sgn

N X

! ai Kðxi ; xÞ  1

ð9Þ

i¼1

where 0  ai  1=vN,

 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

ð10Þ

www.molinf.com

583

Full Paper

I. I. Baskin et al.

N is the number of data points and v is a 1-SVM parameter, varying from 0 to 1. Notice that Equation 9 presents a sparse solution with ai ¼ 6 0 only for a certain number of objects (support vectors). Both parameters of the method, n and g, control the number of outliers, but differently. The parameter g (see Equation 7) identifies a metric of the feature space. Its variation leads to redistribution of data points over the surface of the hypersphere. For the fixed g, the increase of n (Equation 10) leads to the increase of both the number of support vectors and outliers. In the feature space this corresponds to the shift of the hyperplane out of the origin, leading to the increase of its offset 1/jj w jj and therefore to the decrease of the radius of the segment defining highdensity data region on the surface of the hypersphere (see Figure 1). It should be noted that variation of n may lead also to the change of the angular orientation of the hyperplane H (i.e., slight rotation of the vector w) if the data are distributed non-symmetrically. Thus, a simple shift of the hyperplane H without its tilting (as it is done in Reference[10]), could result in non-optimal separation of the feature space into the low-density and high-density regions.

sensitivity ¼ q ¼ tp ðtp þ fnÞ1

ð11Þ

If one considers these examples as those accepted by AD, q measures the coverage of the whole data set. It could be shown that for a data set drawn from some statistical distribution F, sensitivity q estimates the gross probability mass (quantile) of F enclosed by the decision surface.[32] Moving to the high density area enclosed by the corresponding decision surface leads to the decrease of q values. If a regression model is built on the training set objects situated in high density area and then applied to the test set objects also within this area, then the Root Mean Square Error (RMSE) of predictions depends on the average value of the training set data density g(·) over it. This takes place because in such high-density areas each test example has more training set examples in close neighborhood to it, and this leads to more precise interpolation. Further, the plot of RMSE (or any other statistical measure) as function of q will be called regression quantile diagrams.

3 Method 3.1 Data Preparation

2.4 One-Class Classification and Prediction Errors

In practical applications, it is difficult to figure out whether training and test sets are drawn from the same probability distribution. In this case, Equation 9 is particularly useful to determine whether a test example is “in” or “out” of AD (Figure 1). Indeed, the decision function f(x) in Equation 9 selects into the “target” class examples located in the higher probability density regions in the descriptor space, leaving the “outliers” in low density regions. Numerous studies in mathematical statistics and machine learning areas show that the accuracy of predictions increases if the test example falls into those dense regions. Therefore, moving decision surface towards more dense areas in the descriptor space (or increasing the distance 1 jj w jj from the origin and hyperplane H(w, 1) in the feature space (Figure 1)), one can expect increase of predictive performance, at the same time decreasing the number of compounds falling inside AD.

2.5 Quantile Analysis for 1-Class Classification

Since in the one-class classification problem only one (target) class examples is considered, only the true positive (tp) and false negative (fn) counts can be computed. In this case, tp counts of target class examples are correctly classified as belonging to the target class, whereas fn counts of target class examples are classified as outliers. Thus, a sensitivity (q) parameter, measuring a percentage of examples enclosed by the decision surface can be calculated: 584

www.molinf.com

Three data sets of stability constant (log K) of Ca2 + , Sr2 + and Ba2 + complexes with organic ligands in water at 298 K and ionic force 0.1 M have been selected from the IUPAC Stability Constants database.[34] These are structurally diverse data sets containing acyclic and macrocyclic, acidic and neutral compounds. These sets were split into training/ test sets containing 100/22, 105/25 and 188/47 for Ca2 + , Sr2 + and Ba2 + , respectively. The NCI diversity data set containing 1990 compounds (derived from a large database of more than 250.000 compounds[35]), a priori dissimilar to training sets, was used in assessments of the relative probability density (Section 4.2).

3.2 Descriptors

Two types of the ISIDA’s fragment descriptors[36] – sequences and augmented atoms – have been used. The sequences include either atoms and bonds, atoms only, or bonds only. The number of atoms in a fragment varies from 2 to 15. For the sequences with selected minimal, nmin, and maximal, nmax, number of atoms all intermediate sequences involving n atoms (nmin < n < nmax) have been generated. An augmented atom represents a central atom with its environment including either neighbouring atoms and bonds, or atoms only, or bonds only. The atom hybridization was taken into account for both classes of the fragments. The hydrogen atoms were omitted. The value of descriptor “i” was calculated as the number of fragments of i-type in the molecule. Typically, for each particular data set, the initial pool of descriptors contained thousands of fragments all of

 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2010, 29, 581 – 587

One-Class Classification Approach to Data Description and to Models Applicability Domain

which were used to build both Support Vector Regression and 1-SVM models. 3.3 Computational Procedure

Calculations on the target class objects only were performed using Support Vector Regression (SVR) and 1-SVM methods. In SVR modelling, the property (log K) values have been predicted for each compound of training set (in 5-fold external cross-validation procedure) and of the additional test set. In 1-SVM modelling, all compounds were ordered with respect to their belonging to more dense regions in the input space. Combination of SVR and 1-SVM results allowed us to build regression quantile diagrams displaying a variation of RMSE as a function of q. SVR models were obtained using the LIBSVM software[33] with C-SVR (see[37] and Section 1.2.3) and RBF kernel function. The performance of each model has been optimized in the grid search varying three parameters: C = 25, 23,…, 215 and e = 0.0001, 0.001,…, 10 (internal parameters of method) and g = 215, 213…, 23 (parameter of RBF kernel). 1-SVM classification has been performed using LIBSVM with the RBF kernel function. Two parameters of the method, v and g, have been varied within the following ranges: n = 0.001, 0.051, …, 0.451 (internal parameter of method) and g = 215, 213…, 23 (parameter of RBF kernel). These parameters define the angular orientation and position of hyperplane H with respect to the origin as well as a metric of the feature space, and, hence, the number of objects Ntarget classified as belonging to the target class (Figure 1). At given g in the grid, we selected only one n providing the maximal coverage q = Ntarget/Ntot., where Ntot is the total number of the objects in the data set. Here, we build a 1-SVM model for each combination of parameters n and g, while optimizing the orientation and offset of the separating hyperplane in the feature space. In Reference[10] a single 1-SVM model was built for the default value of parameter v = 0.5; thus, the hyperplane H was shifted without any change of its orientation. Although this simplification seems reasonable to speed up virtual screening, generally it doesn’t necessarily lead to an optimal separation between high- and low-density areas. It should also be noted that the consistency of 1-SVM as a density estimator method has been proved only for RBF-kernel[32] used in our work. This however is not a case in reference,[10] in which structured graph kernels were used. The models were validated in an external 5-fold external cross-validation (CV) performed on the training set as well as in additional validation (AV) performed on the test set. In 5-fold validation, the models were built on 4/5 of the training set followed by their validation on the remaining 1/5 of the data. This procedure has been repeated 5 times, thus, each molecule from the training set has been predicted. In additional validation, the models were built on the whole training sets, and then validated on corresponding test sets. Comparison of prediction errors (RMSE) obtained in CV Mol. Inf. 2010, 29, 581 – 587

and AV allowed us (i) to check whether n-fold cross validation reliably estimates the method’s performance in screening external data sets; and (ii) to draw a conclusion concerning the impact of the training set size on the performance of predictions.

4 Results and Discussion 4.1 SVR Modelling

Performance of Support Vector Regression models for Ca2 + , Sr2 + and Ba2 + complexes is reported in Table 1. Notice that RMSE values in CV were systematically larger than those in AV. This shows that external 5-fold cross-validation gives a pessimistic estimation of predictive performance since the CV models were developed on 4/5 training set, whereas the AV models were built on the entire training sets. Table 1. RMSE of predictions for external 5-fold cross-validation and for validation on external test sets. The coverage q = 100 % corresponds to calculations performed without applying 1-SVM models as Applicability Domain.

Metal/q Ca2 + Sr2 + Ba2 +

Cross-Validation

External Validation

100 % 1.86 1.62 1.36

100 % 1.48 1.1 1.03

80 % 1.41 1.17 1.27

80 % 1.23 0.68 [a] 1.04

[a] corresponds to 76 % of coverage.

4.2 Regression Quantile Diagrams

In 1-SVM, an increase of the distance 1/jj w jj between hyperplane H(w, 1) and the origin results in reduction of the number of selected target class objects and, consequently, in the decrease of coverage q (Figure 1). Discarding the outliers leads to improvement of the predictive performance of the models: RMSE decreases for smaller q (Figure 2). This effect is pronounced for Ca2 + and Sr2 + data sets and is relatively weak for the Ba2 + data set. In most of cases, a significant decrease of RMSE (in some cases up to 40–45 %) is observed at q = 80 % (Figure 2 and Table 1) for both CV and AV calculations. Obviously, 80 % can’t be considered as a universal coverage threshold. This always depends on the particular dataset and on required prediction accuracy. The density of the data could be assessed as a ratio of the number of studied data and that for the reference data set uniformly distributed in the chemical space. As a latter, we used the NCI diversity data set derived from a large database of more than 250.000 compounds.[35] The 1-SVM modeling was performed on a mixed (NCI + complexants) data set. For each 1-SVM model, both the number of complexants (Ncompl) and NCI compounds (NNCI) beyond the hyperplane H (i.e., inside AD) were counted. Assuming that the compounds from structurally diverse

 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

www.molinf.com

585

Full Paper

I. I. Baskin et al.

Figure 2. Regression quantile diagrams for the prediction of stability constants (log K) for complexes of Ca2 + , Sr2 + and Ba2 + cations with organic ligands for cross-validation (CV) and additional validation (AV). RMSE as a function of the coverage q for: a) CV, b) AV.

NCI data set are uniformly distributed in chemical space, the ratio d = Ncompl/NNCI could be considered as a data density. Figure 3 shows that prediction error becomes smaller in the region with large data density in the feature space, in agreement with theoretical assumptions described in Section 2.

5 Conclusions We propose a new approach to applicability domain (AD) of QSAR/QSPR models formulated as a one-class classification task. The problem can be solved using the Data Description Analysis based on 1-SVM technique. Unlike some applicability domains techniques earlier described in the literature, this approach: (i) is not confined to any model-specific information (2) can handle large number of descriptors; (3) involves low (zero) variance descriptors; (4) can

even be used in the framework of descriptor-less approaches based on chemical graph kernels. An application of 1-SVM improves performance of regression models by rejecting compounds (outliers) dissimilar to the training set. The p %-quantile analysis and related quantile diagrams have been suggested to evaluate predictive performance of regression models. Regression quantile diagrams describe the dependence of prediction error (RMSE) as a function of the coverage q. For the case study of prediction of stability constants of alkaline-earth cation complexes, significant improvement of the models performance due to “outliers” discarding has been observed. It should be pointed out that definition of ADs of QSAR/ QSPR models is only one of possible fields of application of one-class classification (novelty detection) approaches in chemoinformatics. This type of machine learning procedures could be used everywhere where the class of coun-

Figure 3. RMSE of predictions (in log K units) as a function of the data density in CV and AV procedures.

586

www.molinf.com

 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2010, 29, 581 – 587

One-Class Classification Approach to Data Description and to Models Applicability Domain

ter-examples is not well-defined: in similarity-based data analysis virtual screening (see[38]), to define a feasibility of chemical reactions and synthetic accessibility of chemical compounds, etc.

Acknowledgement Authors thank the ARCUS “Alsace-Russia/Ukraine” program, GDRI SupraChem, GDR PARIS, and the College Doctoral Europen (Strasbourg) for support.

References [1] J. Jaworska, N. Nikolova-Jeliazkova, T. Aldenberg, ATLA Alternatives to Laboratory Animals 2005, 33, 445. [2] S. Dimitrov, G. Dimitrova, T. Pavlov, N. Dimitrova, G. Patlewicz, J. Niemela, O. Mekenyan, J. Chem. Inform. Modeling 2005, 45, 839. [3] I. V. Tetko, P. Bruneau, H.-W. Mewes, D. C. Rohrer, G. I. Poda, Drug Discovery Today 2006, 11, 700. [4] V. Vapnik, Statistical Learning Theory, Wiley-Interscience, New York, 1998. [5] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, N. D. Lawrence, in Neural Information Processing Series (Eds: M. I. Jordan, T. Dietterich), MIT Press, Cambridge, MA, 2009. [6] D. M. J. Tax, Doctor Thesis, Technische Universiteit Delft, Delft, The Netherlands 2001. [7] M. Markou, S. Singh, Signal Processing 2003, 83, 2481. [8] M. Markou, S. Singh, Signal Processing 2003, 83, 2499. [9] C. M. Bishop, IEE Proc.: Vision, Image and Signal Processing 1994, 141, 217. [10] N. Fechner, A. Jahn, G. Hinselmann, A. Zell, J. Cheminformatics 2010, 2. [11] B. Schçlkopf, A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, 2002. [12] M. Moya, M. Koch, L. Hostetler, in Proc. World Cong. Neural Networks, International Neural Network Society, INNS, Portland, OR, 1993, p. 797. [13] C. Bishop, IEE Proc. Vision, Image and Signal Processing, 1994, 141, 217.

Mol. Inf. 2010, 29, 581 – 587

[14] N. Japkowicz, C. Myers, M. Gluck, in Proc. 14th Int. Joint Conf. Artificial Intelligence, 1995, pp. 518. [15] G. Ritter, M. Gallegos, Pattern Recog. Lett. 1997, 18, 525. [16] C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK, 1995. [17] R. Duda, P. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973. [18] E. Parzen, Ann. Math. Stat. 1962, 33, 1065. [19] D. W. Scott, Multivariate Density Estimation. Theory, Practice and Visualization, Wiley, New York, 1992. [20] J. A. Hartigan, M. A. Wong, App. Stat. 1979, 28, 100. [21] T. Kohonen, Self-Organizing Maps, Springer, Heidelberg 2001. [22] G. A. Carpenter, S. Grossberg, D. B. Rosen, Neural Networks 1991, 4, 493. [23] A. Ben-Hur, D. Horn, H. T. Siegelmann, V. Vapkin, J. Machine Learning Res. 2001, 2, 125. [24] S. Holger, Neural Comput. 1998, 10, 2175. [25] P. Baldi, K. Hornik, Neural Networks 1989, 2, 53. [26] J. Hertz, A. Krogh, R. Palmer, Introduction to the Theory of Neural Computation, Addison Wesley, New York 1991. [27] A. Ypma, R. P. W. Duin, in ICANN’98, Skovde, Sweden, 1998. [28] M. Breunig, H.-P. Kriegel, R. Ng, J. Sander, in Proc. ACM SIGMOD 2000 Int. Conf. on Management of Data, 2000. [29] E. Knorr, R. Ng, V. Tucakov, VLDB J. 2000, 8, 237. [30] B. Schçlkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, R. C. Williamson, Neural Comp. 2001, 13, 1443. [31] D. M. J. Tax, R. P. W. Duin, Machine Learning 2004, 54, 45. [32] R. Vert, J. P. Vert, J. Machine Learning Res. 2006, 7, 817. [33] C.-C. Chang, C.-J. Lin, LIBSVM: a Library for Support Vector Machines 2001, software available at http://www.csie.ntu.edu.tw/ ~ cjlin/libsvm. [34] IUPAC Stability Constants Database. http://www.acadsoft.co.uk. [35] NCI Diversity Data Set. http://dtp.nci.nih.gov/branches/dscb/ diversity explanation.htm. [36] A. Varnek, D. Fourches, F. Hoonakker, V. P. Solov’ev, J. Computer-Aided Molecular Design 2006, 19, 693. [37] A. J. Smola, B. Schçlkopf, Stat. Computing 2004, 14, 199. [38] D. Hristozov, T. I. Oprea, J. Gasteiger, J. Chem. Inf. Model. 2007, 47, 2044.

 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Received: June 7, 2010 Accepted: July 11, 2010 Published online: August 30, 2010

www.molinf.com

587

Suggest Documents