Similarity and Dissimilarity Methods for Processing ... - Semantic Scholar

Similarity and Dissimilarity Methods for Processing Chemical Structure Databases VALERIE J. G ILLET1 , DAVID J. W ILD1 , P ETER W ILLETT1,3 J OHN B RADSHAW2

AND

1 Krebs Institute for Biomolecular Research and Department of Information Studies, University of

Sheffield, Sheffield S10 2TN, UK 2 Glaxo Wellcome Research and Development Limited, Gunnels Wood Road, Stevenage SG1 2NY, UK

Email: [email protected] This paper reviews measures of similarity and dissimilarity between pairs of chemical molecules and the use of such measures for processing chemical databases. The applications discussed include similarity searching, database clustering and diversity analysis, focusing upon measures that are based on fragment bit-string occurrence data. The paper then discusses recent work on the calculation of similarity by aligning molecular fields and on the selection of structurally diverse subsets of chemical databases.

1. INTRODUCTION Many different scientific disciplines (such as synthetic organic chemistry, structural biology, pharmacology and toxicology) are needed to discover the new drugs that are the lifeblood of the pharmaceutical industry. The huge costs and extended timescales that characterize the industry mean that it is willing and able to make very substantial investments in any technology that can increase the speed with which drugs, i.e. novel chemical molecules with beneficial biological properties, are brought to the market place (and similar comments apply to the pesticides and fungicides developed by the agrochemicals industry). Such investments have provided one of the principal driving forces for the highly sophisticated systems that have been developed for the storage, retrieval and processing of a range of types of chemical information [1]. Textual information, such as the bibliographic details of a journal article describing the synthesis of a particular substance, or numerical information, such as the melting point and the molecular weight for that substance, can be stored and retrieved using conventional database methodologies. Radically different approaches are required to process the two-dimensional (2D) or three-dimensional (3D) structures of chemical compounds and this paper reviews the use of similarity-based and dissimilarity-based techniques for processing such structures. 2. REPRESENTATION AND CHEMICAL STRUCTURES

SEARCHING

OF

The principal method of representation for a 2D chemical structure diagram is the connection table, which contains a list of all of the (usually non-hydrogen) atoms within 3 To whom all correspondence should be addressed.

T HE C OMPUTER J OURNAL,

a structure, together with bond information that describes the exact manner in which the individual atoms are linked together. A connection table can be regarded as a labelled graph in which the nodes and edges of a graph represent the atoms and bonds, respectively, of a 2D chemical structure diagram and a chemical database can hence be represented by a large number of such graphs. A corporate database, such as those containing the molecules synthesized by a particular pharmaceutical or agrochemical company, will typically contain a few hundreds of thousands of such graphs; a public database, such as those constructed by the Beilstein Institute or Chemical Abstracts Service, will contain some millions of such graphs. The use of a graph representation facilitates two forms of database searching. Structure searching involves an exact-match search of a chemical database for a specific query structure as is required, for example, to retrieve the biological assay results and the synthetic details associated with a particular molecule. Such a search is effected by means of a graph isomorphism search, in which the graph describing the query molecule is checked for isomorphism with the graphs of each of the database molecules (although a variety of coding procedures ensure that the time-consuming graph isomorphism check needs to be carried out on only a very small fraction of the molecules in the database). Substructure searching, or atomby-atom searching, involves a partial-match search of a chemical database to find all those molecules that contain a user-defined query substructure, such as a penicillin ring nucleus, irrespective of the environment in which the query substructure occurs. This is effected by checking the graph describing the query substructure for subgraph isomorphism with the graphs of each of the database molecules [2]. The NP-complete nature of subgraph isomorphism means Vol. 41, No. 8, 1998

548

V. J. G ILLET et al.

that substructure searching in databases of non-trivial size would be totally infeasible if it were not for the use of an initial screen search, where a screen is a substructural feature, the presence of which is necessary, but not sufficient, for a molecule to contain the query substructure. These features are typically small, atom-, bond- or ring-centred fragment substructures that are algorithmically generated from a connection table when a molecule is added to the database that is to be searched. The fragments that have been chosen to comprise the screen set, i.e. the set of fragments that are to be used for screening, are listed in a fragment coding dictionary. The database structures are processed by analysing the corresponding connection tables to identify those screens from the coding dictionary that are present, and then represented for search by fixed-length bit-strings in which the non-zero bits correspond to the screens that are present. The query (sub)structure is subjected to the same process and the screen search is effected by checking the bit-strings representing each of the database structures for the presence of the screens that are encoded in the bit-string representing the query substructure. Only a very small fraction of a database will normally contain all of the screens that have been assigned to a query substructure and thus only this small fraction needs to undergo the final, time-consuming atom-by-atom search, which ensures that there is an exact subgraph isomorphism between the graphs representing the query substructure and each database structure. The simple, two-stage procedure (i.e. screen searching and atom-byatom searching) described previously has formed the basis for most operational 2D substructure searching systems and analogous techniques are used for 3D substructure searching. Here, a database structure is represented by a complete graph in which the nodes denote the atoms of a molecule and the edges denote the inter-atomic distances, with the screens being based on inter-atomic distance ranges [3]. In both 2D and 3D, the combination of a screen search and a subgraph isomorphism search is analogous to the combination of bit-string matching and pattern matching that characterizes signature-based systems for text retrieval [4]. Substructure searching, whether in 2D or in 3D, provides an invaluable tool for accessing databases of chemical structures. It does, however, have several limitations that are inherent in the retrieval criterion that is being used, which is that a database record must contain the entire query substructure in precisely the form that it has been specified by the user. First, and most importantly, a substructure search requires that the user who is posing the query must already have acquired a well defined view of what sorts of structures are expected to be retrieved from the database. For example, a 3D substructure search requires sufficient information about the geometric requirements for activity to be able to specify distance and/or angular constraints to characterize those molecules, and just those molecules, that can fit into some biological receptor site [3]. The identification of these requirements normally involves comparing several bioactive molecules to identify the pharmacophore, i.e. the pattern of features that they have T HE C OMPUTER J OURNAL,

in common and that are thus assumed to be responsible for the observed activity [5]. This is clearly very difficult at the start of an investigation when perhaps only one or two active structures have been identified and when it is not at all clear which particular feature(s) within them are responsible for the observed activity. Second, as with any form of partialmatch retrieval mechanism, there is very little control over the size of the output that is produced by a particular query substructure. Accordingly, the specification of a common ring system, such as the benzodiazepine system that forms the nucleus of many tranquillizers, can result in the retrieval of many thousands of compounds from a chemical database (unless it is also possible to apply additional filters, such as a user-defined range of values for some physicochemical property). Finally, a substructure search results in a simple partition of the database into two discrete sub-sets, viz those structures that contain the query and those that do not, and there is thus no direct mechanism by which the retrieved molecules can be ranked in order of decreasing probability of activity. These inherent limitations have occasioned interest in similarity searching as a complement to substructure searching, in much the same way as best-match searching is increasingly complementing the traditional Boolean approaches in text retrieval systems [6]. Similarity searching requires the specification of an entire target structure, rather than the partial structure that is required for substructure searching. The target molecule is characterized by a set of structural features, and this set is compared with the corresponding sets of features for each of the database structures. Each such comparison enables the calculation of a measure of similarity between the target structure and a database structure and the database molecules are then sorted into order of decreasing similarity with the target. The output from the search is a ranked list, where the structures that the system judges to be most similar to the target structure, the nearest neighbours, are located at the top of the list and are thus displayed first to the user. Accordingly, if an appropriate measure of similarity has been used, the first database structures inspected will be those that have the greatest probability of being of interest to the user. Since its introduction in the mid-eighties [7, 8], similarity searching has proved extremely popular with users, who have found that it provides a means of accessing chemical databases that is complementary to the existing structure and substructure searching facilities. At the heart of any similarity searching system is the measure that is used to quantify the degree of structural resemblance between the target structure and each of the structures in the database that is to be searched. Downs and Willett [9] provide an extended review of inter-molecular structural similarity measures, focusing on those that are sufficiently rapid for similarity searching in databases of non-trivial size. The most common measures of this type are based on comparing the fragment bit-strings that are normally used for 2D substructure searching, so that two molecules are judged as being similar if they have a large number of bits, and hence substructural fragments, in Vol. 41, No. 8, 1998

S IMILARITY

AND DISSIMILARITY METHODS FOR PROCESSING CHEMICAL STRUCTURE DATABASES

common. A normalized association coefficient, typically the Tanimoto coefficient, is used to give similarity values in the range of zero (no bits in common) to unity (all bits the same). Specifically, if two molecules have A and B bits set in their fragment bit-strings, with C of these in common, then the Tanimoto coefficient is defined to be C . A+ B −C While such a fragment-based measure clearly provides a very simple picture of the similarity relationships between pairs of structures, it is both efficient (since it involves just the application of logical operations to pairs of bit-strings) and effective (in that it is able to bring together molecules that are judged by chemists to be structurally similar to each other) in operation. The latter characteristic is most surprising, given that the fragments that are used for the calculation of the similarities were originally designed to maximize the efficiency of substructure searching, not the effectiveness of similarity searching. Many other types of similarity measure have been described [9, 10, 11], and at least some have been used for database searching; however, none of these measures has proved to be anywhere near as popular as the simple, fragment-based measures described earlier, and this type of measure is hence assumed in the remainder of this paper unless stated otherwise. 3. CLUSTERING METHODS IN CHEMICAL INFORMATION SYSTEMS Random screening has long played an important role in leaddiscovery programmes. Here, compounds are selected from a database, either a corporate database or one that is publicly available, and then tested in a bioassay that determines whether the selected compounds have the biological activity of interest. The identification of an active compound is used to initiate an iterative process in which a similarity search is used to identify structurally related molecules that are tested, in their turn, for activity. Once several such actives have been identified, a pharmacophore mapping procedure is used to derive a putative pharmacophore that can then form the basis for substructure searches that delineate the precise geometric requirements for activity. Considerations of cost-effectiveness dictate that the compounds selected for biological testing in the initial stages of such programmes cover the full range of structural types that are available to an organization, and there has thus been much interest in computer-based methods that can be used to maximize the coverage of structural space. Cluster analysis, or automatic classification, was the first such technique to be used for this purpose and continues to attract much attention for use in the combinatorial chemistry programmes that are discussed in the next section. Cluster analysis, or clustering, is the process of subdividing a group of objects (chemical molecules in the present context) into groups, or clusters, of objects that exhibit a high degree of both intra-cluster similarity and T HE C OMPUTER J OURNAL,

549

inter-cluster dissimilarity [12, 13]. It is thus possible to obtain an overview of the range of structural types present within a dataset by selecting one (or some small number) of the molecules from each of the clusters resulting from the application of an appropriate clustering method to that dataset. The representative molecule for each cluster is either selected at random or selected as being the closest to the cluster centroid. These selected compounds are then tested in the bioassay of interest. If a compound proves active it is then appropriate to assay the other compounds in its cluster since these may also exhibit the activity of interest: the fact that structurally similar molecules have similar properties is normally referred to as the similar property principle [11]. Very many different clustering methods have been described in the literature and a considerable amount of effort has gone into comparing the effectiveness of the various methods for clustering chemical structures (typically represented by fragment bit-strings). These comparisons, which are typified by the work of Willett [14] and of Brown and Martin [15] have used a ‘leave-one-out’ experimental methodology that was first suggested by Adamson and Bush [16] and that is based upon the similar property principle. Assume that the value of some quantitative (i.e. interval or ratio scale) property has been measured for each of the molecules in a dataset. The property value of a molecule, I , within this dataset is assumed to be unknown and the classification resulting from the use of some particular clustering method is scanned to identify the cluster that contains the molecule I . The predicted property value for I , P(I ), is then set equal to the arithmetic mean of the observed property values of the other compounds in that cluster. This procedure results in the calculation of a P(I ) value for each of the N structures in the dataset, and an overall figure of merit for the classification is then obtained by calculating the product moment correlation coefficient between the sets of N observed and N predicted values. The most generally useful clustering methods will be those that give high correlation coefficients across as wide a range of datasets as possible. The Adamson and Bush approach to the comparison of clustering methods was used by Willett [14] in a study of over 30 hierarchic (including both agglomerative and divisive) and non-hierarchic (including single-pass, relocation and near neighbour) clustering methods when applied to 10 small datasets for which physical, chemical or biological property data were available. The study found that the best results were obtained with the Ward hierarchicagglomerative method [17], with the non-hierarchic nearestneighbour method of Jarvis and Patrick [18] performing almost as well. At the time that these comparative experiments were carried out, computer limitations (in terms of both raw CPU speeds and the clustering algorithms available) meant that the Ward method could not be applied to chemical databases of substantial size. The Jarvis–Patrick method was thus rapidly adopted as the clustering method of choice in commercial chemical database software, not only to select compounds for random screening but also Vol. 41, No. 8, 1998

550


to cluster the outputs of substructure searches that retrieve very large numbers of molecules, thus providing the searcher with an overview of the structural classes that contain the substructure of interest [19]. However, the method does have limitations (see e.g. [20]) and subsequent comparisons [15, 21, 22] have reaffirmed the general superiority of the Ward method. The availability of improved computer hardware and of the efficient reciprocal nearest neighbours algorithm [23] means that this method can now be applied to databases containing some hundreds of thousands of molecules in an acceptable amount of time and the method is thus becoming available in commercial chemical database software; larger datasets, however, still require use of the Jarvis–Patrick method. 4. MOLECULAR DIVERSITY The last few years have seen the introduction of several alternatives to cluster-based approaches for the selection of sets of compounds. The development of these new approaches has been occasioned by the widespread adoption of combinatorial chemistry in the pharmaceutical industry. Combinatorial chemistry is the name given to a body of techniques that permit the synthesis of sets of molecules, called libraries, that contain large numbers (hundreds up to millions) of structurally related molecules that are rapidly generated by automated procedures [24, 25]. The development of a range of supporting technologies, such as high-throughput bioassays and chemical robotics, has meant that combinatorial chemistry provides a far more costeffective approach to the discovery of bioactive compounds than traditional approaches that require the sequential synthesis and testing of individual molecules. An important concept in the design of combinatorial libraries is the concept of molecular diversity, where diversity denotes the degree of heterogeneity, structural range or dissimilarity in a set of compounds [26]. Combinatorial approaches seek to maximize the structural diversity of the final library, so as to ensure coverage of the largest possible expanse of chemical space in the search for bioactive molecules. This is generally effected by selecting sets of starting materials, sometimes called monomer pools, that are as diverse as possible. These monomers are then reacted together in a combinatorial synthesis, with the expectation that this will maximize the diversity of the reaction products comprising the final library [27, 28]. There is thus much interest in the development of computerbased methods for selecting compounds so as to maximize chemical diversity, with the resulting techniques also being applicable to related tasks such as the identification of structural overlap in databases and the mapping of structural space. The availability of cluster-based methods for selecting compounds (as described previously) meant that the methods were soon adopted for use in combinatorial chemistry programmes, as exemplified by the work of Shemetulskis et al. [29]. However, at least three other approaches to compound selection have also been described. T HE C OMPUTER J OURNAL,

The first of these was dissimilarity-based compound selection. Cluster-based approaches identify a set of dissimilar molecules indirectly (since the approaches require the initial identification of clusters of similar molecules, from which disparate molecules can subsequently be selected) and analogous comments apply to the partitionbased approaches described later. Dissimilarity-based approaches, conversely, try to identify the most dissimilar molecules in a dataset directly, using some quantitative measure of dissimilarity such as the complement of the Tanimoto coefficient; non-chemical examples of this ‘anticlustering’ behaviour are described by Späth [30] and Mirkin and Muchnik [31]. A diverse set of molecules is normally generated by selecting an initial molecule at random from a database and then repeatedly selecting that molecule from the database that is as different as possible from those that have already been selected. This simple algorithm, which seems to have been first described by Kennard and Stone [32], is not guaranteed to identify the most diverse subset possible; however, it has been found to work well in practice [33, 34] and is reasonably efficient, with an expected time complexity of O(n N) for the selection of an n-compound subset of an N-compound dataset [35]. Other algorithms for dissimilarity-based selection are described by Hudson et al. [36], Nilakantan et al. [37], Snarey et al. [38] and Taylor [39], inter alia. Partition-based selection requires the identification of a set of p characteristics to describe each of the molecules in a dataset. These are typically molecular properties that would be expected to affect binding at a receptor site, such as hydrophobicity, polarity, hydrogen bond donor and acceptor power, torsional flexibility and shape [40]. The range of values for each such characteristic, i (1 ≤ i ≤ p), is (sub-)divided into a set of bi bins (or sub-ranges). The combinatorial product of all possible bins then defines the set of groups that make up the partition and each molecule is assigned to the group that matches the set of binned characteristics for that molecule. A subset is obtained by selecting one (or some small number) of the molecules in each of the resulting groups. Partition-based selection is much faster than both cluster-based and dissimilaritybased selection, having a time complexity of just O(N). Moreover, the availability of a partition enables the explicit identification of those sections of structural space that are under-represented, or even unrepresented, in a database; this can provide valuable information for de novo programmes that design structures subject to a set of structural and functional constraints [41]. It does, however, require an appropriate set of characteristics to define the partition (rather than the fragment bit-strings assumed elsewhere in this review): typical approaches are described by Cummins et al. [42], Pearlman [43] and Pickett et al. [44]. All of the selection methods discussed here try to identify a diverse subset of a database and a further approach hence involves defining a quantitative measure of structural diversity and then using some optimization technique to maximize the value of this measure. Such optimizationbased selection approaches might appear to be inappropriate Vol. 41, No. 8, 1998

S IMILARITY


because of the astronomical number of subsets that can be chosen from a database of non-trivial size. Specifically, the identification of the maximally-diverse n-compound subset of an N-compound dataset (where, typically, n N) would involve consideration of up to N! n!(N − n)! possible subsets and the optimization methods that have been used thus far hence seek to identify a reasonably diverse subset in a realistic amount of time. Optimization methods that have been used include D-optimal design [27], genetic algorithms [28, 45] and simulated annealing [46]. The precise nature of the subset that is obtained will depend both upon the optimization method and on the quantitative measure of structural diversity, or diversity index, that is used. Examples of diversity indices include: the HookSpace Index [47], which provides a quantitative description of the arrangement of functional groups in 3D space (and thus an estimate of the geometric diversity of structure databases); a count of the number of bits that are set in the union (the Boolean logical OR) of all of the fragment bit-strings for a database [27]; the number of distinct substructures that can be generated from all the molecules in a database [48]; the number of distinct rings that are present in a database [37]; and the fraction of the possible groups within a partition that contain at least some threshold number of molecules [43]. In addition to quantifying the diversity of an individual database, there have also been reports of methods for comparing the diversities of different databases. Thus, Turner et al. [49] use the sum of all of the pairwise intermolecular dissimilarities for a database to quantify the change in diversity that takes place when an external dataset is merged with an existing database, while Cummins et al. [42] have measured the overlaps between pairs of databases in terms of the numbers of partition-groups that are occupied by compounds from both databases. Having introduced the current status of similarity and dissimilarity methods for processing chemical structure databases, the next two sections discuss current research in Sheffield in these areas: specifically, work on fieldbased similarity searching and on a new algorithm for dissimilarity-based compound selection.

3D similarity searching. Molecular electrostatic potential (MEP) fields are a way of representing the electrostatic potential around a molecule (i.e. the force on a hypothetical probe due to electrostatic effects) and the calculation of inter-molecular similarities using MEP fields has been studied by several workers in the context of small sets of compounds [51, 52, 53, 54, 55, 56]. Over the last few years, work in Sheffield has studied the use of such measures for similarity searching in large chemical databases, with studies being conducted of: the effectiveness of various standardization methods and similarity coefficients for fieldbased similarity searching [57]; the automatic alignment of pairs of MEP fields (two fields must be aligned before similarity can be calculated) by means of graph theory [58] and by means of a genetic algorithm (hereafter a GA) [59]; and the extent to which these two approaches to the generation of molecular alignments can handle conformationally flexible molecules [60]. Here, we discuss the extension of this work to encompass other types of molecular field and the use of the resulting field-based similarity measures for identifying bioisosteres, i.e. structurally different molecules that exhibit the same biological effect [61]. Given a molecular property P that can be calculated at any point around a molecule, a field may be created around the molecule by integrating P with respect to volume. The similarity between a pair of molecules may then be calculated based on the overlap of the corresponding fields of the two molecules, using a similarity coefficient such as the Carbo index [51], a form of the long-established cosine coefficient. The Carbo index is defined to be R PA PB dv R R AB = R 2 ( PA dv)1/2 ( PB2 dv)1/2 where PA and PB are the properties of the two molecules that are being compared and where the summations are over all components of 3D grids that surround these two molecules. If P denotes the electrostatic potential, then the potential, Pr , at a point r for a molecule of n atoms is calculated from the point charges qi on each atom i in the molecule, so that Pr =

n X i=1

5. FIELD-BASED SIMILARITY SEARCHING The fragment-based, 2D approach to similarity searching described previously is an established, important component of chemical information systems and the last few years have seen interest in the development of comparable facilities for 3D similarity searching [9]. Most of the measures of 3D similarity that have been described to date are based on inter-atomic distances and the measures thus quantify the degree of geometric similarity between a target structure and a database structure. Modern approaches to the prediction of biological activity [50] emphasize the use of molecular fields, and it thus seems appropriate to consider the use of field-based information for T HE C OMPUTER J OURNAL,

551

qi (r − Ri )

where Ri denotes the position of the i th atom and where r − Ri denotes the Euclidean distance between the two points. The resulting Pr values can then be inserted back into the Carbo index and the integral calculated using grid-based methods, but this is very time-consuming unless a coarse grid spacing is used (in which case it is unlikely that the calculated similarities will give an accurate representation of the structural resemblances that are present in a dataset). Instead, we have used the Gaussian approximation approach described by Good et al. [52]. Here, the 1/r term in the calculation of the potential is replaced by a Gaussian approximation that is inserted into a version of the Carbo equation that can then be calculated far more quickly Vol. 41, No. 8, 1998

552


than the grid points can be processed in the conventional approach. The Gaussian approach hence provides an elegant means of efficiently calculating the similarity between two molecules’ MEPs and comparable procedures can be employed to calculate steric and hydrophobic similarities. The electron density, ρr , at a point r around a molecule can be calculated from the sum of the contributions from each of the atoms in the molecule, i.e. ρr =

n X

E i (r − Ri )

i=1

where E i (d) is the electron density contribution of atom i at distance d from the nucleus and where r and Ri are defined as previously. Good and Richards [53] have shown that it is possible to use Gaussian approximations to fit the curve of electron density against distance from the atomic nuclei and suggested that the resulting expressions can be used for the calculation of a shape-similarity version of the Carbo index. A measure of lipophilic potential may be gained by assigning partial lipophilicity constants to atoms and designing a distance function that relates lipophilicity to distance from atomic centres, in much the same way as the MEP is calculated. Specifically, Gaillard et al. [62] define the molecular lipophilic potential (MLP) at a point r around a molecule by M L Pr =

n X

f i e−(r−Ri )/2

i=1

where fi is the atomic lipophilicity constant. This function was fit to a two-term Gaussian function, analogous to those used for the MEP and shape similarities, with the f i values being calculated using a scheme described by Croizet et al. [63]. In this way, it is possible to generate electrostatic, steric and hydrophobic field descriptors that can then be used to quantify the degree of similarity between pairs of 3D structures, given an appropriate alignment of the molecular fields. The alignments are generated using a GA that aligns two molecules’ fields so as to maximize the value of the Carbo index [59]. In brief, each chromosome in this GA encodes the rotations and translations that are to be applied to a database structure to align it with the target structure. If no account is taken of conformational flexibility then just the rigid-body rotations are encoded; alternatively, if the molecules are allowed to flex, then the chromosome additionally encodes the torsional rotations [60], although the experiments reported here consider only rigid molecules. The fitness function is the value of the Gaussian similarity coefficient resulting from that particular encoded alignment. The algorithm has been used previously to align molecules on the basis of their MEPs but it is simple to modify it to encompass the Gaussian versions of the shape and MLP similarity measures described earlier, thus allowing the calculation of all three types of similarity by the same basic procedure. Alignments may be made based on a single field-type, or on any combination of the three types of field; T HE C OMPUTER J OURNAL,

FIGURE 1. The ten target structures for the similarity searches and their WDI activity classes.

here, we report the results when all three of them were combined. During the execution of the GA, each alignment of a target structure and a database structure was used to calculate each of the three individual types of field-based similarity (MEP, shape and MLP) and then the fitness for the chromosome encoding that alignment was the mean of the three resulting Gaussian similarity values. No types of weighting or standardization are applied to the individual similarity measures, so that all three types of field are assumed to contribute equally to the overall score of an individual database structure. 2D and 3D similarity searches were also carried out to provide a basis for the evaluation of the field-based similarity measures. Full details of this work and more extended results are presented by Wild and Willett [64]. Vol. 41, No. 8, 1998

S IMILARITY


553

TABLE 1. Analysis of the active molecules retrieved in the top-300 rank positions by the 2D, 3D and FBSS similarity measures, using ten different target molecules. The left-hand part of the table contains the number of actives (with the unique actives in brackets) retrieved and the fourth column lists the total number of molecules with the same activity classification as the target. The right-hand part of the table contains the diversity of the retrieved active molecules. Target molecule Apomorphine Captopril Cycliramine Diazepam Diethylst’ol Fenoterol Gaboxadol Morphine RS86 Serotonin Mean

Numbers of (unique) active molecules 2D 3D FBSS Total 60 (18) 68 (33) 94 (17) 46 (8) 100 (34) 58 (19) 7 (2) 46 (7) 26 (20) 21 (4) 52.6 (16.2)

31 (2) 28 (1) 141 (28) 45 (4) 91 (29) 31 (0) 16 (4) 43 (3) 18 (4) 19 (2) 46.3 (7.7)

49 (15) 26 (0) 120 (27) 37 (6) 100 (2) 33 (0) 14 (2) 30 (5) 14 (3) 22 (3) 44.5 (6.3)

The experiments used the 1995 edition of the World Drugs Index (WDI), which contains the 2D structures and activity classes for about 41,000 drugs. The GA is quite slow in operation, taking about 4 CPU days to search the entire file and the experiments hence used a randomly-selected 1-in10 subset of the full database. This was searched using the 10 target structures employed in a recent study of propertybased similarity measures [65]. These target structures are listed in Figure 1, together with their associated WDI activity classes. For the purposes of these experiments, drugs that lie within the same activity class or classes as the target are considered actives. For each of the target structures, all of the actives were added to the 1-in-10 subset and then a similarity search was carried out to retrieve the 300 nearest neighbours for the target structure. Three types of search were carried out for each target structure: a field-based similarity search (hereafter referred to as FBSS), which, as noted previously, involved all three types of field; a conventional 2D, fragment-based similarity search that was implemented using the search routines in the UNITY chemical information management system produced by Tripos Inc.; and a geometric 3D similarity search that was implemented using atom mapping, a distance-based measure of geometric similarity that maps the most similar atoms in two molecules to each other, based on the 3D geometric environment and the hydrogen-bonding type of the atom [66]. In each case, the search performance was measured by the number of actives present in the topranked 300 nearest neighbours and by the number of unique actives, i.e. actives that were not retrieved by any other of the similarity measures. The left-hand portion of Table 1 shows the number of actives in the top-300 hits returned by each of the methods, where it will be seen that the 2D similarity measure is the most effective at retrieving active molecules. This finding might appear counter-intuitive, given the known importance of 3D information in general (and 3D field information T HE C OMPUTER J OURNAL,

Calculated diversities 2D 3D FBSS

117 86 261 113 189 65 34 128 93 38

0.32 0.41 0.59 0.51 0.38 0.42 0.49 0.31 0.62 0.33

0.32 0.33 0.61 0.51 0.31 0.45 0.53 0.33 0.72 0.42

0.56 0.70 0.60 0.55 0.35 0.44 0.52 0.34 0.70 0.52

112.4

0.44

0.46

0.53

in particular) in determining biological activities [50], but account must also be taken of the types of molecule in the database used here. Many of the molecules with a given activity class in the WDI are topologically very similar, including many close analogues and so-called ‘metoo’ drugs. These molecules would be very easy to retrieve using a similarity measure that explicitly encodes 2D information; however, they might well not represent the full range of bioactive structural types, whereas such diverse sets of molecules might be retrieved by 3D measures that do not focus on the specific patterns of atoms and bonds in molecules. As Briem and Kuntz note [67], such measures permit the identification of molecules that are complementary to a particular biological target whereas 2D measures are well suited to finding molecules with a common substructural moiety. An inspection of the various search outputs certainly suggests that the 2D measure results in less diverse sets of nearest neighbours than do the other measures, and we have sought to quantify this finding by means of a diversity index based on the fragment bit-strings used for the 2D similarity search. The index used here is the mean pairwise dissimilarity when averaged over all of the molecules in a dataset [49]. This index was calculated for the sets of active structures retrieved in each of the searches and the results are listed in the right-hand part of Table 1, where it will be seen that the FBSS measure results in more diverse sets of compounds than do the established 2D and 3D measures; similar results are obtained if all of the nearest neighbours are considered, rather than just the active nearest neighbours as here [64]. This indicates that whilst the FBSS measure generally finds fewer actives, those that it does find are better able to suggest novel structural classes that are additional to the close analogues that are usually retrieved by 2D similarity searches. This finding provides some quantitative support for the suggestion that FBSS measures provide an attractive way of identifying Vol. 41, No. 8, 1998

554


bioisosteres in chemical databases, and a similar conclusion can be drawn from the more extended results reported by Wild and Willett [64]. That said, there are several ways in which the current FBSS algorithm can be extended. For example, thus far, we have considered all three types of field to be of equal importance and we are currently investigating the extent to which improvements in search performance can be achieved by differential weighting of the three types of alignment. We are also studying the effect of conformational flexibility on search effectiveness, by incorporating torsional rotations (as well as rigid-body rotations) in the GA using the procedures described by Thorner et al. [60]. Finally, and most importantly, we need to improve the search efficiency, since our current programs, written in C and running on a Silicon Graphics R10000 workstation under UNIX, require about 30 CPU seconds to match a target structure with each structure in a database when all three fields are used. There is hence a need to develop screening methods, analogous to those used for substructure searching, to enable the GA to be used on a routine basis for similarity searching in large corporate databases, which typically contain some hundreds of thousands of structures. 6. SELECTION OF STRUCTURALLY DIVERSE COMBINATORIAL LIBRARIES The fourth section of this paper introduced the techniques that are now being used to select sets of dissimilar molecules from chemical databases, with the aim of reacting these together in a combinatorial synthesis so that the final products cover a wide range of structural types. It is assumed that if it is possible to identify maximally-diverse (or, more realistically, near maximally-diverse) sets of reactants, then their use will result in the generation of a maximally-diverse combinatorial library of products when the reactants are combined in a combinatorial synthesis. If this assumption is correct, it will permit the full exploration of the potential structural space even though only a relatively small number of compounds are actually synthesized and tested, as we now demonstrate. Consider a combinatorial library, c, that is synthesized from reactants contained in two reactant pools, r1 and r2 , of sizes n 1 and n 2 , respectively (in the following, we consider only dimer libraries for the purpose of simplicity but the analysis can be extended to reactions that involve a greater number of reactants). These two reactant pools have previously been selected as representing diverse subsets of two larger potential-reactant pools, R1 and R2 , of sizes N1 and N2 , respectively, using one of the subsetselection procedures described previously. Let C be the fully enumerated combinatorial library that would have been generated from all possible combinations of R1 and R2 if the subset-selection procedure had not been used. Thus, c and C contain n 1 n 2 and N1 N2 dimers, respectively. The assumption underlying the use of diverse sets of reactants is that the library c will be as diverse as a library obtained by employing the same subset-selection procedure that was used to create the reactant pools r1 and r2 , (i.e. that was used T HE C OMPUTER J OURNAL,

1. Create the N1 N2 products in library C by combining each of the N1 reactants in R1 with each of the N2 reactants in R2 . 2. Create the library L by selecting the n 1 n 2 most diverse products from C. 3. Calculate the diversity D(L). (a) 1. Create r1 by selecting the n 1 most diverse reactants from R1 . 2. Create r2 by selecting the n 2 most diverse reactants from R2 . 3. Create the n 1 n 2 products in library c by combining each of the n 1 reactants in r1 with each of the n 2 reactants in r2 . 4. Calculate the diversity D(c). (b) FIGURE 2. Generation of structurally diverse, n 1 n 2 -molecule libraries by (a) selecting in product space to give the library L and (b) selecting in reactant space to give the library c.

to identify the n 1 most dissimilar molecules in R1 and the n 2 most dissimilar molecules in R2 ) to identify the most dissimilar n 1 n 2 -molecule library from amongst the N1 N2 molecules in C. This subset is referred to subsequently as library L. The two subset-selection procedures are summarized in Figure 2. Let D(X) be a function that returns a value describing the diversity of a set of molecules, X. Then the use of sets of diverse reactants to create diverse libraries assumes that c is comparable in structural diversity to L, i.e. that: D(L) = D(c). The validity of this assumption was challenged by Gillet et al. [28], who took three published combinatorial syntheses, generated libraries by both of the procedures summarized in Figure 2 and then calculated the diversities of the two libraries using the diversity index described by Turner et al. [49]: in all cases, the library L had a diversity that was greater than that of the library c. Thus, the greater effort involved in generating L, which involves the analysis of N1 N2 product molecules as against the analysis of the N1 + N2 reactant molecules required to generate c, results in an increase in the diversity of the final library. However, while L is a library, it is not a combinatorial library in that it contains a maximally diverse set of independent product molecules, rather than a set that can be synthesized using a combinatorial reaction. The synthetic inefficiency that can result from performing selection at the product level is illustrated in Figure 3a, in which a fully enumerated combinatorial library, C, built from two reactant pools is represented by a (9 × 9) matrix. The rows of the matrix represent the N1 reactants available in pool R1 , and the columns of the matrix represent the N2 reactants in pool R2 . The N1 N2 elements of the matrix then represent the full combinatorial library, C, that would result Vol. 41, No. 8, 1998

S IMILARITY


FIGURE 3. A fully enumerated, dimer library (C) represented by a (9 × 9) matrix. (a) The boxed elements represent an example of a subset library, L, that contains the nine most diverse compounds and that was chosen by applying a selection algorithm to C. (b) A n 1 n 2 subset of C that is also a combinatorial library can be selected by intersecting n 1 rows with n 2 columns. (c) Reordering of the rows and columns of the matrix shown in Figure 3b results in the combinatorial library occupying the top left hand corner of the matrix.

from reacting all the reactants in R1 with all the reactants in R2 . In Figure 3a, pool R1 contains the nine reactants labelled x 1 . . . x 9 and pool R2 contains the nine reactants y1 . . . y9 . Assume that we wish to select the nine most diverse compounds from C. Then a selection algorithm can select compounds from anywhere within the matrix: for example, the library, L, might correspond to the boxed elements, as shown. The potential synthetic inefficiency of this approach is highlighted by the fact that thirteen reactants are required to build the nine-member library (viz six reactants—x 3 , x 4 , x 5 , x 6 , x 7 and x 8 —from pool R1 and seven reactants—y1, y2 , y4 , y6 , y7 , y8 and y9 —from pool R2 ), rather than the three from each pool required to build a nine-member subset that is a combinatorial library. A nine-member subset of C that does represent a true combinatorial library can be selected by intersecting three T HE C OMPUTER J OURNAL,

555

rows of the matrix with three columns: for example, a (3 × 3) library built from reactants x 3 , x 6 and x 8 reacted with reactants y2 , y4 and y5 is shown by the boxed elements of the matrix in Figure 3b. For ease of visualisation, assume that the rows and columns of the matrix are reordered so that the boxed elements occupy the top lefthand corner of the matrix, as shown in Figure 3c, and that the diversity of this sub-library is then measured. Finding a maximally diverse combinatorial library is then equivalent to reordering the rows and columns of the matrix and measuring the diversity of all possible sub-libraries that occupy the n 1 n 2 -member, top left-hand corner of the matrix. Exploring all permutations of rows and columns represents an enormous search space even for libraries of moderate size and we have thus developed a GA to generate high-diversity combinatorial libraries. A chromosome in the GA represents one possible combinatorial library. For an n 1 n 2 -member library, a chromosome consists of two parts: the first part represents the n 1 reactants selected from pool R1 (or the rows of the matrix) and the second part represents the n 2 reactants selected from pool R2 (or the columns of the matrix). The fitness function of the GA is applied to each chromosome and involves constructing the n 1 n 2 -member combinatorial library, call this C ∗ , represented by the chromosome and measuring its diversity, D(C ∗ ), using the index of Turner et al. [49]. The GA tries to maximize the diversity, D(C ∗ ) and thus to identify the maximally diverse combinatorial subset of C. One of the chromosomes in the initial population is initialized to the reactant subsets found by performing dissimilarity-based selection on the original reactant pools, thus giving an initial solution with a diversity D(c), and the remaining chromosomes are initialized to random subsets. Full details of the algorithm are presented by Gillet et al. [28]. The operation of the GA is exemplified by its use to generate a combinatorial amide library. The basic synthetic reaction here involves coupling primary amines with carboxylic acids by forming amide bonds, as shown in Figure 4a. Sets of 400 primary amines and 400 carboxylic acids were chosen from the WDI database and represented by Daylight bit-strings. Software was developed to simulate the creation of amide bonds, thus enabling the fully enumerated library, C, containing 160,000 amides, to be constructed by joining all 400 amines to all 400 acids. A diverse subset, L, containing 1600 molecules was selected from C using the dissimilarity-based selection algorithm of Holliday et al. [35]. The same procedure was then used to select 40 diverse amines and 40 diverse acids from the two pools of 400 reactants; the selected acids and amines were used to generate the combinatorial library, c, of 1600 amides. Finally, the GA was applied to C to obtain a diverse combinatorial subset, C ∗ , of 1600 molecules. The same procedure was repeated for the two other combinatorial syntheses shown in Figures 4b and 4c. For each synthesis, the diversity was calculated for the three types of subset, i.e. for L (the library obtained by selecting in product space), for C ∗ (the library obtained Vol. 41, No. 8, 1998

556

V. J. G ILLET et al. product library requiring approximately 20 min for a C program running on a Silicon Graphics R10000 processor. 7. CONCLUSIONS

FIGURE 4. Typical combinatorial synthesis procedures: (a) amide library based on coupling carboxylic acids to primary amines; (b) 3-amino-5-hydroxybenzoic acid library using carboxylic acids and sulphonyl chlorides; and (c) Kemp’s acid library with R1 and R2 both being substituted by carboxylic acids. TABLE 2. Diversity of libraries selected in product space, D(L), in product space subject to the combinatorial constraint, D(C ∗ ), and in reactant space, D(c). The libraries contained 1600 molecules selected from 160,000-molecule fully enumerated libraries. Library

D(L)

D(C ∗ )

D(c)

Amides Benzoic acid Kemp’s acid

0.652 0.408 0.435

0.623 0.396 0.419

0.596 0.388 0.401

The first part of this paper contains an introduction to some of the uses of similarity and dissimilarity techniques in chemical information systems, focusing upon techniques that are capable of processing large structural databases. Since their introduction a decade or so ago, similarity searching and database clustering have become wellestablished components of drug-discovery programmes and the use of such similarity-based techniques can only increase still further with the current, intense interest in combinatorial approaches to drug discovery. This interest is manifested in the development and evaluation both of new methods for selecting compounds and database subsets and of measures of inter-molecular similarity. Both of these areas of activity are exemplified in the second part of the paper, which discusses recent work in Sheffield on field-based similarity searching and on dissimilarity-based compound selection. The current interest in similarity and dissimilarity methods for processing chemical databases has already resulted in many new methods for representing, searching and selecting molecules (see, e.g. [9, 26]) and this has led to comparative studies being carried out to ascertain which methods are the most effective and/or most efficient. Examples of this work thus far include studies of different types of structural representation [21, 68, 69] and of different types of selection method [15, 38]. A limitation of these studies is that they have involved different datasets and there is thus a clear need for the development of standard datasets, containing 2D structure diagrams and measured bioassay data for non-trivial numbers of compounds. These would be analogous to the datasets that have been developed for the TREC [6, 70] and MUC [71] programmes in text retrieval and information extraction, respectively and would be expected to be just as effective in facilitating the comparison of the results obtained by different research groups and, consequently, in progressing the entire field. ACKNOWLEDGEMENTS

by selecting in product space subject to the combinatorial constraint) and for c (the library obtained by selecting in reactant space). The results obtained are shown in Table 2, where it will be seen that the GA consistently identifies a subset of the products that is intermediate in diversity between the libraries c and L. Thus, while the combinatorial subset resulting from our GA is not as diverse as the subset obtained by selecting individual product molecules, it does provide a synthetically efficient solution that can be generated directly from a standard combinatorial synthesis experiment. The algorithm hence represents a significant improvement over existing, reactant-based approaches to compound selection; it is also rapid in execution with the selection of (40×40) reactant pools from a 160,000-member T HE C OMPUTER J OURNAL,

We thank the following: the Biotechnology and Biological Sciences Research Council, GlaxoWellcome and Zeneca Agrochemicals for funding the work described here; Daylight Chemical Information Systems Inc. and Tripos Inc. for software support; Derwent Information for providing the WDI database; and the referees for their comments on an earlier draft of this paper. The Krebs Institute for Biomolecular Research is a designated Biomolecular Sciences Centre of the Biotechnology and Biological Sciences Research Council. REFERENCES [1] Ash, J. E., Warr, W. A. and Willett, P. (eds) (1991) Chemical Structure Systems. Ellis Horwood, Chichester.

Vol. 41, No. 8, 1998

S IMILARITY


[2] Barnard, J. M. (1993) Substructure searching methods: old and new. J. Chem. Inf. Comp. Sci., 33, 532–538. [3] Willett, P. (1995) Searching for pharmacophoric patterns in databases of three-dimensional chemical structures. J. Mol. Recognit., 8, 290–303. [4] Faloutsos, C. (1992) Signature files. In Frakes, W. B. and Baeza-Yates, R. (eds), Information Retrieval. Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, NJ. [5] Martin, Y. C. (1998) Pharmacophore mapping. In Martin, Y. C. and Willett, P. (eds), Designing Bioactive Molecules: Three-Dimensional Techniques and Applications. American Chemical Society, Washington. [6] Sparck Jones, K. and Willett, P. (eds) (1997) Readings in Information Retrieval. Morgan Kaufman, San Francisco. [7] Carhart, R. E., Smith, D. H. and Venkataraghavan, R. (1985) Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comp. Sci., 25, 64– 73. [8] Willett, P., Winterman, V. and Bawden, D. (1986) Implementation of nearest neighbour searching in an online chemical structure search system. J. Chem. Inf. Comp. Sci., 26, 36–41. [9] Downs, G. M. and Willett, P. (1995) Similarity searching in databases of chemical structures. Rev. Comput. Chem., 7, 1– 66. [10] Dean, P. M. (ed.) (1994) Molecular Similarity in Drug Design. Chapman and Hall, Glasgow. [11] Johnson, M. A. and Maggiora, G. M. (1990) Concepts and Applications of Molecular Similarity. Wiley, New York. [12] Everitt, B. S. (1993) Cluster Analysis (3rd edn), Edward Arnold, London. [13] Sneath, P. H. A. and Sokal, R. R. (1973) Numerical Taxonomy. W H Freeman, San Francisco. [14] Willett, P. (1987) Similarity and Clustering in Chemical Information Systems. Research Studies Press, Letchworth. [15] Brown, R. D. and Martin, Y. C. (1996) Use of structureactivity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comp. Sci., 36, 572–584. [16] Adamson, G. W. and Bush, J. A. (1973) A method for the automatic classification of chemical structures. Inf. Stor. Retr., 9, 561–568. [17] Ward, J. H. (1963) Hierarchical grouping to optimize an objective function. J. Amer. Stat. Assoc., 58, 236–244. [18] Jarvis, R. A. and Patrick, E. A. (1973) Clustering using a similarity measure based on shared nearest neighbours. IEEE Trans. Comput., C-22, 1025–1034. [19] Willett, P., Winterman, V. and Bawden, D. (1986) Implementation of non-hierarchic cluster analysis methods in chemical information systems: selection of compounds for biological testing and clustering of substructure search output. J. Chem. Inf. Comp. Sci., 26, 109–118. [20] Doman, T. N., Cibulskis, J. M., Cibulskis, M. J., McCray, P. D. and Spangler, D. P. (1996) Algorithm5: a technique for fuzzy similarity clustering of chemical inventories. J. Chem. Inf. Comp. Sci., 36, 1195–1204. [21] Brown, R. D. and Martin, Y. C. (1997) The information content of 2D and 3D structural descriptors relevant to ligandreceptor binding. J. Chem. Inf. Comp. Sci., 37, 1–9. [22] Downs, G. M., Willett, P. and Fisanick, W. (1994) Similarity searching and clustering of chemical-structure databases


[23] [24]

[25]

[26]

[27]

[28]

[29]

[30] [31]

[32] [33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

557

using molecular property data. J. Chem. Inf. Comp. Sci., 34, 1094–1102. Murtagh, F. (1985) Multidimensional Clustering Algorithms. Physica Verlag, Vienna. Chaiken, I. M. and Janda, K. D. (eds) (1996) Molecular Diversity and Combinatorial Chemistry. Libraries and Drug Discovery. American Chemical Society, Washington. DeWitt, S. H. and Czarnik, A. W. (eds) (1997) A Practical Guide to Combinatorial Chemistry. American Chemical Society, Washington. Perspectives in Drug Discovery and Design (1997) Special volume on molecular diversity. Perspectives in Drug Discovery and Design, 7/8, 1–180. Martin, E. J., Blaney, J. M., Siani, M. A., Spellmeyer, D. C., Wong, A. K. and Moos, W. H. (1995) Measuring diversity: experimental design of combinatorial libraries for drug discovery. J. Med. Chem., 38, 1431–1436. Gillet, V. J., Willett, P. and Bradshaw, J. (1997) The effectiveness of reactant pools for generating structurally diverse combinatorial libraries. J. Chem. Inf. Comp. Sci., 37, 731–740. Shemetulskis, N. E., Dunbar, J. B., Dunbar, B. W., Moreland, D. W. and Humblet, C. (1995) Enhancing the diversity of a corporate database using chemical database clustering and analysis. J. Comp.-Aid. Mol. Design, 9, 407–416. Späth, H. (1986) Anti-clustering: maximising the variance criterion. Control Cybernet., 15, 213–218. Mirkin, B. and Muchnik, I. (1996) Clustering and multidimensional scaling in Russia (1960–1990): a review. In Arabie, P., Hubert, L. J. and De Soete, G. (eds), Clustering and Classification. World Scientific, Singapore. Kennard, R. W. and Stone, L. A. (1969) Computer aided design of experiments. Technomet., 11, 137–148. Bawden, D. (1993) Molecular dissimilarity in chemical information systems. In Warr, W. A. (ed.), Chemical Structures 2. The International Language of Chemistry. Springer, Heidelberg. Lajiness, M. (1990) Molecular similarity-based methods for selecting compounds for screening. In Rouvray, D. H. (ed.), Computational Chemical Graph Theory. Nova Science Publishers, New York. Holliday, J. D., Ranade, S. S. and Willett, P. (1995) A fast algorithm for selecting sets of dissimilar structures from large chemical databases. Quant. Struct.-Activ. Relat., 14, 501–506. Hudson, B. D., Hyde, R. M., Rahr, E., Wood, J. and Osman, J. (1996) Parameter based methods for compound selection from chemical databases. Quant. Struct.-Activ. Relat., 15, 285–289. Nilakantan, R., Bauman, N. and Haraki, K. S. (1997) Database diversity assessment: new ideas, concepts and tools. J. Comp.-Aid. Mol. Design, 11, 447–452. Snarey, M., Terret, N. K., Willett, P. and Wilton, D. J. (1998) Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graph. Model., 15, 372–385. Taylor, R. (1995) Simulation analysis of experimental design strategies for screening random compounds as potential new drugs and agrochemicals. J. Chem. Inf. Comp. Sci., 35, 59–67. Mason, J. S., McLay, I. M. and Lewis, R. A. (1994) Applications of computer-aided drug design techniques to lead generation. In Dean, P. M., Jolles, G. and Newton, C. G. (eds), New Perspectives in Drug Design. Academic Press, London.

Vol. 41, No. 8, 1998

558


[41] Lewis, R. A. and Leach, A. R. (1994) Current methods for site-directed structure generation. J. Comp.-Aid. Mol. Design, 8, 467–475. [42] Cummins, D. J., Andrews, C. W., Bentley, J. A. and Cory, M. (1996) Molecular diversity in chemical databases: comparison of medicinal chemistry knowledge bases and database of commercially available compounds. J. Chem. Inf. Comp. Sci., 36, 750–763. [43] Pearlman, R. S. (1996) Novel software tools for addressing chemical diversity. Accessible via WWW at URL http://www.awod.com/netsci/Issues/Jun96/feature1.html. [44] Pickett, S. D., Mason, J. S. and McLay, I. M. (1996) Diversity profiling and design using 3D pharmacophores: pharmacophore-derived queries (PDQ). J. Chem. Inf. Comp. Sci., 36, 1214–1223. [45] Brown, R. D. and Martin, Y. C. (1997) Designing combinatorial library mixtures using a genetic algorithm. J. Med. Chem., 40, 2304–2313. [46] Hassan, M., Bielawski, J. P., Hempel, J. C. and Waldman, M. (1996) Optimization and visualization of molecular diversity of combinatorial libraries. Mol. Diversity, 2, 64–74. [47] Boyd, S. M., Beverley, M., Norskov, L. and Hubbard, R. E. (1995) Characterizing the geometric diversity of functionalgroups in chemical databases. J. Comp.-Aid. Mol. Design, 9, 417–424. [48] Bone, R. G. A. and Villar, H. O. (1997) Exhaustive enumeration of molecular substructures. J. Comp. Chem., 18, 86–107. [49] Turner, D. B., Tyrrell, S. M. and Willett, P. (1997) Rapid quantification of molecular diversity for selective database acquisition. J. Chem. Inf. Comp. Sci., 37, 18–22. [50] Kubinyi, H. (ed.) (1993) 3D QSAR in Drug Design. ESCOM: Leiden. [51] Carbó, R., Leyda, L. and Arnau, M. (1980) How similar is a molecule to another? An electron density measure of similarity between two molecular structures. Int. J. Quant. Chem., 17, 1185–1189. [52] Good, A. C., Hodgkin, E. E. and Richards, W. G. (1992) The utilisation of Gaussian functions for the rapid evaluation of molecular similarity. J. Chem. Inf. Comp. Sci., 32, 188–191. [53] Good, A. C. and Richards, W. G. (1993) Rapid evaluation of shape similarity using Gaussian functions. J. Chem. Inf. Comp. Sci., 33, 112–116. [54] Manaut, F., Sanz, F., Jose, J. and Milesi, M. (1991) Automatic search for maximum similarity between molecular electrostatic potential distributions. J. Comp.-Aid. Mol. Design, 5, 371–380. [55] Petke, J. D. (1993) Cumulative and discrete similarity analysis of electrostatic potentials and fields. J. Comp. Chem., 14, 928–933. [56] Richard, A. M. (1991) Quantitative comparison of molecular electrostatic potentials for structure-activity studies. J. Comp. Chem., 12, 959–969. [57] Turner, D. B., Willett, P., Ferguson, A. and Heritage, T. W. (1995) Similarity searching in files of three-dimensional chemical structures: evaluation of similarity coefficients and


[58]

[59]

[60]

[61] [62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

standardisation methods for field-based similarity searching. SAR QSAR Environ. Res., 3, 101–130. Thorner, D. A., Willett, P., Wright, P. M. and Taylor, R. (1997) Similarity searching in files of three-dimensional chemical structures: representation and searching of molecular electrostatic potentials using field-graphs. J. Comp.-Aid. Mol. Design, 11, 163–174. Wild, D. J. and Willett, P. (1996) Similarity searching in files of three-dimensional chemical structures: alignment of molecular electrostatic potentials with a genetic algorithm. J. Chem. Inf. Comp. Sci., 36, 159–167. Thorner, D. A., Wild, D. J., Willett, P. and Wright, P. M. (1996) Similarity searching in files of three-dimensional chemical structures: flexible field-based searching of molecular electrostatic potentials. J. Chem. Inf. Comp. Sci., 36, 900–908. Lipinski, C. A. (1986) Bioisosterism in drug design. Ann. Reports Med. Chem., 21, 283–291. Gaillard, P., Carrupt, P., Testa, B. and Boudon, A. (1994) Molecular lipophilicity potential, a tool in 3D QSAR: method and applications. J. Comp.-Aid. Mol. Design, 8, 83–96. Croizet, F., Dubost, J. P., Langlois, M. H. and Audrey, E. (1991) A fragmental code usable for automatic attribution of lipophilicity atomic constants. Quant. Struct.-Activ. Relat., 10, 211–215. Wild, D. J. and Willett, P. (1997) Field-based similarity searching in databases of three-dimensional chemical structures. In Collier, H. (ed.), Proc. 1997 Int. Chemical Information Conf. Infonortics, Calne, UK. Kearsley, S. K., Sallamack, S., Fluder, E. M., Andose, J. D., Mosley, R. T. and Sheridan, R. P. (1996) Chemical similarity using physicochemical property descriptors. J. Chem. Inf. Comp. Sci., 36, 118–127. Pepperrell, C. A., Willett, P. and Taylor, R. (1990) Implementation and use of an atom-mapping procedure for similarity searching in databases of 3-D chemical structures. Tetrahed. Comp. Methodol., 3, 575–593. Briem, H. and Kuntz, I. D. (1996) Molecular similarity based on DOCK-generated fingerprints. J. Med. Chem., 39, 3401– 3408. Matter, H. (1997) Selecting optimally diverse compounds from structure databases: a validation study of twodimensional and three-dimensional molecular descriptors. J. Med. Chem., 40, 1219–1229. Patterson, D. E., Cramer, R. D., Ferguson, A. M., Clark, R. D. and Weinberger, L. E. (1996) Neighbourhood behaviour: a useful concept for validation of ‘molecular diversity’ descriptors. J. Med. Chem., 39, 3049–3059. Harman, D. K. (ed) (1993) The First Text REtrieval Conference (TREC-1). Special publication 500-207. National Institute of Standards and Technology, Gaithersburg, MD, and subsequent volumes in this series. Chinchor, N., Hirschman, L. and Lewis, D. D. (1993) Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3). Comput. Linguistics, 19, 409–449.

Vol. 41, No. 8, 1998