98
Current Opinion in Drug Discovery & Development 2009 12(1):98-107 © Thomson Reuters (Scientific) Ltd ISSN 1367-6733
Clustering and its application in multi-target prediction William Liu1 & Dale E Johnson1,2 Address University of California at Berkeley, Department of Nutritional Science & Toxicology 119 Morgan Hall, University of California, Berkeley, CA 94720-3104, USA Email:
[email protected] 1
2 Emiliem Inc, Christie Avenue, Emeryville, CA 94608, USA Email:
[email protected]
Correspondence can be addressed to either author
Drug discovery teams are beginning to apply non-screening techniques to make early associations between chemical structure and various biological and druggability characteristics of compound series. The increasing availability of multiple data sets and models of target potency, ADME characteristics, and toxicity allows a researcher from any discipline to draw quick associations with multiple endpoints on sets of compounds or chemical scaffolds. Cluster analysis, for instance, can be used to correlate screening potency with data predicted from both freely available and commercially available models. In the future researchers will be able to draw chemical-biological associations ‘on-the-fly’ using various clustering or similarity techniques to determine whether the proposed toxicity of a drug is related to its chemical structure or its proposed efficacy mechanism. In this review associations are illustrated with target potency data gleaned from the literature associated with CYP450 substrate predictions from GeneGo’s MetaDrug. Keywords Chemical databases, cluster analysis, computational toxicology, CYP450, QSAR, similarity analysis
Introduction Classical quantitative structure activity relationship (QSAR) analysis studies are based on a training set consisting of a large, diverse range of compounds. Ideally, these compounds span a large chemical space, such that new test compounds can be reliably evaluated and covered by the chemical space of the training set. Frequently, however, the researcher is limited to a set of compounds which is neither large nor diverse. In many cases, restricted budgets dictate limited SAR analysis, and therefore a training set consisting of a limited number of chemical scaffolds. This can decrease model quality, and obscure the true SARs in the dataset. Furthermore, limited budgets can imply limited time scales; with the increased FRVW RI GUXJ GHYHORSPHQW TXLFN DQG HI¿FLHQW PHWKRGV RI separating data into relevant groups become invaluable, as potential groupings of scaffolds may be non-obvious DQG GLI¿FXOW WR VHSDUDWH ZLWKRXW FRPSXWDWLRQDO PHWKRGV Clustering analysis compensates for these limitations by HI¿FLHQWO\DQGHIIHFWLYHO\VHSDUDWLQJFRPSRXQGVLQWRGLVWLQFW clusters, thus improving the predictive power (while limiting the scope) of each individual cluster. This leads to more relevant results, as smaller datasets consisting of just one, or a few, scaffolds controls for the similarities in structure between each compound. Consequently, the differences in structure between compounds are emphasized. A non-diverse set of compounds, consisting of just a few scaffolds, can now be divided into distinct datasets, increasing predictive power and allowing for a
clearer interpretation of structural characteristics which affect activity in each cluster. In cases where datasets are large and contain a diverse range of compounds, cluster analysis can again play a role in drug development. While with smaller datasets, visual and expert driven clustering is a possible method of grouping compounds, when datasets become extremely large and consist of many scaffolds, it becomes almost impossible to manually sort compounds by structure. Again, computational methods are preferred in this case.
Clustering methodology Clustering is not a recent invention, nor is its relevance to computational toxicology a recent application. Its theory, however, is often lost within black box treatments used by QSAR programs. Clustering, in the general sense, is the grouping of objects together based on their similarity, while excluding REMHFWV ZKLFK DUH GLVVLPLODU 2QH RI WKH ¿UVW DSSOLFDWLRQV of cluster analysis to drug discovery was by Harrison [1], who asserted that locales exist within the chemical space which favor biological activity. Consequently, these localities form clusters of structurally similar compounds. This idea that structure confers activity is also the fundamental premise of all QSAR analyses. The basic framework for compound clustering consists of three main steps: the computation of structural features,
Clustering and its application in multi-target prediction Liu & Johnson 99
the selection of a difference metric, and the application of the clustering algorithm.
of the composition of a compound, compared to both traditional molecular descriptors and structural keys.
Molecular descriptors
7KH ¿QJHUSULQWLQJ DQG VWUXFWXUDO NH\V DUH FRPSRVHG RI descriptors whose values are represented either as bits or as bit-strings. Meanwhile, traditional descriptors are typically not encoded as bit-strings, but rather as numerical values. This implies that matrices made up of traditional descriptors are more easily interpretable, but DUH QRW FRPSRVHG RI DV PDQ\ HOHPHQWV DV ¿QJHUSULQWV RU structural keys.
%HIRUH FRPSRXQGV FDQ EH FOXVWHUHG WKH\ PXVW ¿UVW EH expressed in a way which is computationally interpretable. For traditional clustering, molecular descriptors are simply numerical, preferably continuous, representations of the structural characteristics of a compound, and can include measures of atom electrotopological state (E-state), connectivity indices, graph measures, and other molecular and physicochemical properties. These descriptors can then be represented as entries in a matrix and, consequently, can easily be used in statistical analyses as well as cluster analysis. The main drawback of traditional molecular descriptors is the possible presence of erroneous descriptors; these may not necessarily offer any strong differentiating measure between compounds, and could result in the grouping of compounds where similarity does not exist. Other methods of clustering avoid the use of traditional molecular descriptors, instead focusing purely on structural NH\V DQG PROHFXODU ¿QJHUSULQWV >@ 6WUXFWXUDO NH\V FDQ be thought of as descriptors that are assigned a binary value which represents the presence or absence of certain structural patterns [3]. Compounds are then represented as a bitmap, where each bit represents the absence or presence of a particular structural feature. The main advantage to using structural keys is the relative speed of analysis, as computers are quite adept at performing Boolean operations. In terms of clustering, because the composition of a structural key is determined by the researcher, the main drawback becomes a lack of JHQHUDOLW\ &RPSRXQGV DUH GLI¿FXOW WR IXOO\ UHSUHVHQW ZLWK D VPDOO QXPEHU RI VSHFL¿F SDWWHUQV H[SUHVVHG DV ELQDU\ variables. Instead, many structural patterns must be XVHGDQGQRWDOODUHJXDUDQWHHGWREHRIDQ\VLJQL¿FDQFH 7KLV FDQ UHVXOW LQ GHFUHDVHG HI¿FLHQF\ DV HUURQHRXV variables can be selected in the structural key, leading to either poor results or wasted computation. 0ROHFXODU ¿QJHUSULQWLQJ DV FKDUDFWHUL]HG E\ WKH 'D\OLJKW methodology [3], is an alternate method of describing compounds, where structures are again expressed as ELWPDSV 8QOLNH VWUXFWXUDO NH\V PROHFXODU ¿QJHUSULQWLQJ is not based on a preset number of structural patterns. 'HVFULEHG EHVW E\ %XWLQD >@ WKH PROHFXODU ¿QJHUSULQW of each compound is generated algorithmically, and is composed of the following: a pattern for each atom, a pattern for each atom and its neighbors, and a set of patterns which describes groups of atoms and their ERQGVUDQJLQJIURPSDWKVRIWRERQGOHQJWKV7KHVHW of patterns generated is quite large and unique. Because SDWWHUQV DUH JHQHUDWHG XQLTXHO\ DQG DUH QRW SUHGH¿QHG it becomes impossible to express each pattern as a single bit. Instead, each pattern is hashed, and an output of D VPDOO VHW RI ELWV W\SLFDOO\ WR LV XVHG WR UHSUHVHQW the pattern. This method of describing molecules using QRQSUHGH¿QHG SDWWHUQV RIIHUV D PRUH FRPSOHWH PHDVXUH
Similarity measures After a suitable method of describing the compounds in the dataset, the next step in clustering is the selection of a similarity measure. In order for compounds to EH JURXSHG WKHUH PXVW EH VRPH PHWULF GH¿QLQJ WKH differences between compounds. Some of the most popular VLPLODULW\ PHDVXUHV LQFOXGH (XFOLGHDQ >@ FLW\EORFN >@ 0LQNRZVNL >@ 3HDUVRQ V FRUUHODWLRQ FRHI¿FLHQW >@ and cosine [8]. A brief summary of each can be seen in Table 1. It should be borne in mind that Pearson's FRHI¿FLHQW LV QRW QHFHVVDULO\ D PHDVXUH RI VLPLODULW\ DV LQ the other metrics. Rather, it is a measure of the correlation of vectors (association), as opposed to a linear measure of distance. While this is certainly a useful measure, this is not necessarily a good means of determining differences between chemical structures. The most common method of determining similarity is the Euclidean distance metric [9]. In the context of QSAR analysis, the most-used distance measure is an algorithm which determines the differences in value of molecular descriptors between all compounds. Euclidean distance is typically the optimal metric, given its simplicity and ease of interpretation. It is important to note, however, that for datasets composed of molecular descriptors that have ODUJH GLIIHUHQFHV LQ PDJQLWXGH YDULDEOHV PXVW ¿UVW EH normalized, to prevent descriptors characterized by large values from overshadowing descriptors whose values are levels of magnitude smaller. Furthermore, while Euclidean distance can be used to measure datasets composed of continuous or ordinal categorical variables (where magnitude matters), in cases where variables are either nominal (discrete categories without order) or a mix of nominal and continuous, different measures of distance are required [10]. 7KH KHWHURJHQHRXV YDOXH GLIIHUHQFH PHWULF +9'0 >@ ZKLFK LV VKRZQ LQ 7DEOH LV DQ H[DPSOH RI D GLVWDQFH metric that can use discrete variables that are either continuous or nominal. +9'0 LV WKH VXFFHVVRU RI WKH KHWHURJHQHRXV (XFOLGHDQ overlap metric (HEOM), which, in comparing observations, separates out variables which are continuous and QRQFRQWLQXRXV 'LIIHUHQFHV EHWZHHQ FRQWLQXRXV YDULDEOHV are processed in a normalized city-block manner, while nominal variables are processed using an overlap function, where a difference in a nominal variable between
100 Current Opinion in Drug Discovery & Development 2009 Vol 12 No 1
Table 1. Summary of popular similarity measures used for the clustering of compounds. Distance measure
Description Euclidean distance [5].
n
∑(x -y )
d(x,y) =
i
i
2
i=1
City-block distance [6].
n
d(x,y) =
∑|x -y | i
i
i=1
Minkowski distance, for p = 1, this measure becomes city-block. For p = 2, this measure becomes Euclidean [6].
n
d(x,y) = (
∑
|xi-yi|p)1/p
i=1
Pearson's Correlation, xibar, yibar are the average values for attribute i in observation x,y [7].
Cosine Correlation [8].
observations is assigned a '1', and a '0' is assigned to all other cases. A more robust method of measuring differences in discrete, nominal values is the value GLIIHUHQFH PHWULF 9'0 7KH 9'0 PHDVXUH RI distance between two observations is determined by comparing the class conditional probability distribution for the values that each observation takes on for each YDULDEOH >@ 7KH +9'0 VLPLODU WR +(20 ZRUNV LQ D conditional manner; if a variable is continuous, then a normalized city-block distance metric is used. If a variable LV QRPLQDO WKHQ WKH 9'0 PHWKRG LV XVHG 7KH VTXDUH root of the sum of these squared distances is the distance between two observations. Although this method has yet to reach prominence in the context of QSAR cluster analysis, a recent study by Guo et al [13] applied this distance metric to a feature-selection analysis of phenols. Because traditional methods of clustering examine descriptors which are either continuous or nominal, and not both simultaneously, this method holds promise in deriving distances between compounds with a large number of continuous and categorical values.
Table 2. The heterogeneous value difference metric (HVDM), an example of a distance metric that can use discrete variables that are either continuous or nominal.
Another popular method of measuring differences between FRPSRXQGV LV WKH 7DQLPRWR FRHI¿FLHQW >@ ZKLFK LV SDUWLFXODUO\ HIIHFWLYH LQ FDVHV XVLQJ PROHFXODU ¿QJHUSULQWV DQGVWUXFWXUDONH\V%HFDXVHWKHYDOXHVXVHGE\¿QJHUSULQWV and keys are binary, the Tanimoto similarity can be expressed in the following manner: Cy(A+B-C), where C LV WKH QXPEHU RI YDOXHV LQ WKH ¿QJHUSULQWNH\ IRU ERWK REVHUYDWLRQVZKLFKKDVDYDOXHRIRULV SUHVHQWSRVLWLYH $ % DUH WKH QXPEHU RI YDOXHV ZKLFK DUH SRVLWLYHSUHVHQW RU SRVLWLYH LQ$RU%UHVSHFWLYHO\>@
HVDM heterogeneous value difference metric.
Steps
Equations
1.
2.
di(x,y) =
{
1 diffi(x,y) vdmi(x,y)
3a.
3b.
if x,y unknown, else if i is linear if i is nominal
Clustering algorithms Clustering algorithms can be divided into two broad realms: hierarchical and nonhierarchical partitioning [16]. Hierarchical clustering can be further divided into two groups: agglomerative and divisive. The general idea behind agglomerative clustering is that clusters are formed
Clustering and its application in multi-target prediction Liu & Johnson 101
from the smallest clusters, which eventually form to become part of the largest. Each individual compound is considered a cluster, and as the algorithm proceeds, each compound is absorbed into larger and larger clusters, until the dataset is expressed as a single cluster composed of all compounds. Conversely, divisive clustering works in the opposite direction, whereby the initial dataset is considered one large cluster, and is slowly divided LQWR VPDOOHU DQG VPDOOHU FOXVWHUV >@ ,Q JHQHUDO agglomerative clustering is the faster type of hierarchical clustering, and as a consequence, is the most popular. In the realm of agglomerative clustering, Ward's method is among the most popular [18]. Methods of non-hierarchical clustering include K-means and Jarvis-Patrick clustering >@ $ VXPPDU\ RI SRSXODU FOXVWHULQJ PHWKRGV LV presented in Figure 1. Hierarchical clustering is a method of representing a dataset as a collection of hierarchically-arranged subsets. These subsets are determined by some measure of SUR[LPLW\ >@ ,Q FODVVLFDO 46$5 DQDO\VLV WKLV PHDVXUH of proximity is determined by measuring the differences in values of molecular descriptors between different compounds. The generalized procedure for agglomerative hierarchical clustering is as follows: after descriptor calculation, a distance metric must be selected. As mentioned previously, this depends on the dataset; binary values are better described using a Tanimoto metric, while continuous
values are better described using a Euclidean metric. After a suitable metric is selected, a distance matrix is created, which, using a particular metric, is composed of rows which express the differences in molecular descriptor values between all of the observations. After this matrix has been created, it is then searched for the two observations which are the closest. These two observations are then combined, and a new matrix is created, with distances between this combination of observations and the remaining observations. Consequently, this results in one less row in the distance matrix. This procedure of combining observations into groups is repeated until all of the observations are integrated into one group consisting of all observations [18]. Ward's method of clustering follows the previously discussed procedure, but includes a measure of error as a metric of combining clusters. The 'error sum of squares' (66 LV GH¿QHG DV WKH VXP RI WKH VTXDUHG GLIIHUHQFH LQ D certain molecular descriptor for a particular observation in a given cluster and the average value of that molecular descriptor in the cluster. Clusters are combined if the increase in ESS between stages is minimized. It is important to emphasize that Ward's method does not FUHDWH DQ RSWLPDO VHW RI FOXVWHUV >@ DQG LW UHPDLQV up to the computational chemist to determine a logical clustering-stopping point. In the realm of nonhierarchical clustering, the K-means algorithm is among the oldest and most popular [19].
Figure 1. Summary of popular compound clustering methods. Clustering algorithms
Hierarchical
Divisive
Non-hierarchical
Agglomerative Mutual neighborhood value method [36]
Single-link [17]
Edward, Cavalli-Sforza method [37]
Complete-link [17]
MacNaughton-Smith method [37]
Median [17]
Group-average [17]
Centroid [17]
Ward's method [18]
MSA Macrostructure assembly, MCS Maximum common structure.
K-means [19]
Jarvis-Patrick [21]
Artificial neural networks [9]
Genetic algorithms [9]
MSA [24]
MCS [23]
102 Current Opinion in Drug Discovery & Development 2009 Vol 12 No 1
Nonhierarchical clustering differs from hierarchical clustering in that the former attempts to divide the data LQWR D SUHGH¿QHG QXPEHU RI QRQUHODWHG FOXVWHUV ZKLOH WKH latter nests compounds into larger and larger clusters. The K-means algorithm, in particular, follows an iterative DSSURDFK WR QRQKLHUDUFKLFDO FOXVWHULQJ >@ *LYHQ D JURXS of compounds, 'n' reference points are chosen, depending on the desired 'n' number of clusters. The distance of each compound to both reference points is computed. Cluster membership is determined by the distance of each compound to the closest reference point. The K-means algorithm initially follows this approach. After clusters DUH ¿UVW FRPSXWHG WKH UHIHUHQFH SRLQW IRU HDFK FOXVWHU is moved into the centroid of the cluster. The distance of this new reference point to each compound is recomputed. If a compound is closer to the reference point of another cluster, then the compound is reassigned to the other cluster. This process is repeated until the clusters are stable and successive trials yield no changes in cluster membership. An alternate nonhierarchical clustering algorithm is WKH -DUYLV3DWULFN PHWKRG >@ 7KLV DOJRULWKP FOXVWHUV compounds using a nearest-neighbor approach, where the distances between all compounds are computed, using the desired distance metric. Compounds are clustered together if they are suitably close to each other and have, to an extent, matching near-neighbors lists consisting of 1 QHDU QHLJKERUV ZKHUH 1 LV VRPH XVHUGH¿QHG QXPEHU Intuitively, the smaller the value of N, the more inclusive the clusters, while the larger the value of N, the more exclusive the clusters.
Recent developments The previously described clustering methods have become ZLGHO\ DGRSWHG LQ WKH ¿HOG RI 46$5 DQDO\VLV $OWKRXJK WKH methods remain effective, increased computing power VLQFH KDV UHVXOWHG LQ WKH LQWURGXFWLRQ RI D KRVW RI new techniques in cluster analysis. Maximum common structure (MCS) and macrostructure assembly (MSA) are methods which hold promise in cluster analysis. MCS is a method of searching for the largest VXEVWUXFWXUHLQDFROOHFWLRQRIJUDSKV>@7KHDSSOLFDWLRQ of MCS to cluster analysis has been characterized by 6WDKO DQG 0DXVHU >@ 7KH SURFHGXUH EHJLQV E\ FUHDWLQJ PROHFXODU ¿QJHUSULQWV RI D GDWDVHW ZKLFK DUH WKHQ clustered according to an exclusion-sphere algorithm. The MCS is then determined for each cluster. For each MCS, a neighbor list is created, where the scaffold for each cluster is compared to scaffolds of other clusters. The singleton FOXVWHUV FUHDWHG LQ WKH ¿UVW VWHS FDQ QRZ EH FRPSDUHG WR the MCS of other clusters. Along with these singletons, new clusters can also be formed from clusters with closely related MCSs. MSA is a related method of structural searching, but also has an inherent application to compound clustering. This method, by Cross et al >@ EHJLQV ZLWK DQ LQLWLDO VHW RI predetermined molecular building blocks, which are
then reassembled into larger structures. This process is repeated until an appropriate set of MSAs is reached. These 06$V DUH GH¿QHG DV D VXEVWUXFWXUDO VLJQDWXUH ZKLFK FDQ then be used to discriminate for membership in a cluster.
Application Suppose an investigator is given a moderately large set of unique compounds along with their corresponding binding data. Some compounds could be considered active, while others could be considered inactive. By grouping compounds into clusters with similar structures, the UHVHDUFKHU FRXOG DGG D JUHDWHU GHJUHH RI VSHFL¿FLW\ WR WKH QSAR models. Without any knowledge of the source of the compounds, the investigator would, traditionally, have to separate compounds based on a visual inspection. 7KLV FRXOG \LHOG HUURQHRXV UHVXOWV DV VXSHU¿FLDO similarities may not indicate accurate groupings. Even if the investigator were to have prior knowledge about the source of the compounds, and could distinguish between the unique scaffolds in the dataset, this does not preclude the use of a clustering algorithm. Scaffolds with a small set of compounds would not make for great models, and could contain structural characteristics that correlate well with structural features of other scaffolds. Furthermore, clustering could provide hints about a particular dataset's ability to bind other proteins. For example, in drug development, a researcher could quickly gain important knowledge about the binding ability of clustered lead compounds to off-target enzymes by simply examining the structures prevalent in clusters with increased off-target binding. One application can be demonstrated as follows. A set of compounds designed to bind pyruvate dehydrogenase NLQDVH LVR]\PH 3'. ZDV FRPSLOHG FRQVLVWLQJ RI FRPSRXQGV ZLWK GLVWLQFW VFDIIROGV 3'. LV D SDUWLFXODUO\ attractive target of inhibition because of its position as an upstream regulator in the phosphoinositide NLQDVHSURWHLQ NLQDVH %PDPPDOLDQ WDUJHW RI UDSDP\FLQ 3,.$.7P725 VLJQDOLQJ SDWKZD\ >@ %HFDXVH these compounds are potential cancer therapeutics, it is important to also characterize the metabolic effect these compounds may have, in order to determine effective and ineffective dose levels as compounds are moved forward. 7KHVH FRPSRXQGV ZHUH ¿UVW PDQXDOO\ FOXVWHUHG EDVHG RQ source, then computationally clustered. Models, based RQ 3'. ELQGLQJ GDWD ZHUH FUHDWHG XVLQJ HDFK RI WKH clusters. Using the computationally-derived clusters, DVVRFLDWHG F\WRFKURPH 3&@ naphthyridines. ,QWHUHVWLQJO\ LQ H[DPLQLQJ &