449054 lhoun et al.Journal of Biomolecular Screening 2012
JBXXXX10.1177/1087057112449054Ca
Automatically Detecting Workflows in PubChem
Journal of Biomolecular Screening 17(8) 1071–1079 © 2012 Society for Laboratory Automation and Screening DOI: 10.1177/1087057112449054 http://jbx.sagepub.com
Bradley T. Calhoun1, Michael R. Browning1, Brian R. Chen1, Joshua A. Bittker2, and S. Joshua Swamidass1, 2
Abstract Public databases that store the data from small-molecule screens are a rich and untapped resource of chemical and biological information. However, screening databases are unorganized, which makes interpreting their data difficult. We propose a method of inferring workflow graphs—which encode the relationships between assays in screening projects—directly from screening data and using these workflows to organize each project’s data. On the basis of four heuristics regarding the organization of screening projects, we designed an algorithm that extracts a project’s workflow graph from screening data. Where possible, the algorithm is evaluated by comparing each project’s inferred workflow to its documentation. In the majority of cases, there are no discrepancies between the two. Most errors can be traced to points in the project where screeners chose additional molecules to test based on structural similarity to promising molecules, a case our algorithm is not yet capable of handling. Nonetheless, these workflows accurately organize most of the data and also provide a method of visualizing a screening project. This method is robust enough to build a workfloworiented front-end to PubChem and is currently being used regularly by both our lab and our collaborators. A Python implementation of the algorithm is available online, and a searchable database of all PubChem workflows is available at http://swami.wustl.edu/flow. Keywords chemoinformatics, statistical analyses, automation or robotics, compound repositories
Introduction In recent years, the data from hundreds of high-throughput screening (HTS) projects—the type of data once locked away in proprietary industry databases—have become publicly available through the National Institutes of Health’s (NIH’s) screen repository, PubChem.1,2 Each project characterizes large libraries of molecules in several assays designed to identify molecules with useful biological properties. Every month, more projects are deposited. Public screening databases are a rich and untapped resource of chemical and biological information. Many groups are developing tools and methods capable of discovering new knowledge from high-throughput screening (HTS) data.3,4 Recently, for example, one group successfully used HTS data from PubChem to predict adverse drug reactions.5 Similar studies that mine HTS data to address important questions will hopefully become more commonplace as PubChem grows and scientists better understand its potential. However, screening repositories are unorganized, complicating the mining of their data. The exact protocols used
to run each assay are often unclear, and the reason that certain molecules are tested in each assay is usually obscured. Even when assay protocols are documented, critical details are stored in free text rather than in a standardized vocabulary or computer-parsable format. Furthermore, it is often unclear and incompletely documented how assays within an HTS project relate to one another. HTS workflows appropriately organize assays within a project. At a high level, screening workflows are often conceptualized as multistage funnels, where, at each stage, an 1
Washington University School of Medicine, St. Louis, MO, USA Broad Institute of Harvard and MIT, Cambridge, MA, USA
2
Received Mar 19, 2012, and in revised form Apr 17, 2012. Accepted for publication Apr 29, 2012. Corresponding Author: S. Joshua Swamidass, Division of Laboratory and Genomic Medicine, Department of Pathology and Immunology, Washington University School of Medicine and Chemical Biology/Novel Therapeutics, Broad Institute of Harvard and MIT, 660 S. Euclid, Box 8118, St Louis, MO 63108, USA Email:
[email protected]
1072
Journal of Biomolecular Screening 17(8)
Table 1. The Quorum Sensing Project’s Assays (Project 2 in Table 2) Assay ID 2094 2106 2723 2724 2725 2726 2727 2728 2735 2736 434943 434944
Title Plate Read MB Primary HTS to Identify Modulators of the AI-2 Quorum Sensing System Broad Institute MLPCN Quorum Sensing Project Absorbance Microorganism Dose Retest to Identify Inhibitors of Vibrio harveyi Luminescence MB Dose Retest to Identify Modulators of the AI-2 Quorum Sensing System Absorbance MB Dose Retest to Identify Inhibitors of Vibrio harveyi Absorbance MB Retest to Identify Inhibitors of Vibrio harveyi Luminescence Microorganism Retest to Identify Inhibitors of the AI-2 Quorum Sensing System Absorbance MB Retest to Identify Inhibitors of Vibrio harveyi Luminescence Microorganism Dose Retest to Identify Inhibitors of the AI-2 Quorum Sensing System Luminescence MB Retest to Identify Modulators of the AI-2 Quorum Sensing System Absorbance Microorganism Dose Response to Identify Inhibitors of Vibrio harveyi Luminescence Cell-Free Homogeneous Dose Response to Identify Inhibitors of Lux-S
This table shows the titles of several assays, all of which are in the same project (for brevity, “Microorganism-Based” is abbreviated MB), displayed as they would be presented in results from a PubChem BioAssay search. The titles of the screens are not informative, and it is unclear how these assays are related to one another. Likewise, the text descriptions associated with the project do not clarify the relationships between the assays.
experiment filters out uninteresting molecules and forwards the rest on to the next stage. In some cases, it is possible to track down the details of an HTS workflow by reading the free text descriptions scattered throughout the project’s documentation, but this is a labor-intensive process. Moreover, the documented workflow does not always match the deposited data. This problem is exacerbated by the lack of an ontology and controlled vocabulary for describing assays and workflows, as well as by the lack of tools for displaying and validating workflows. Consequently, when users search for assays in PubChem and other databases such as ChEMBL, they receive an unorganized soup of experiments without any guiding structure by which to understand them (Table 1). To address these problems, we propose organizing screening data by project—rather than by assay—and using the project’s workflow as a guide to understanding and organizing assay data. As a first step in support of this proposal, we present an algorithm that groups assays into projects and then detects the relationships between a project’s assays.
Materials and Methods Data We include all the assays deposited in PubChem that are not mirrored from ChEMBL. Each assay is identified with by an assay ID (AID). Each molecule is identified by its compound ID (CID) rather than its substance IDs (SIDs)— unless no CID is given. Each CID corresponds to a unique chemical structure and is usually associated with several SIDs, each of which corresponds to this CID in a different batch. There are 1 173 820 unique molecules in 4030 assays. On average, each assay contains 39 626 molecules, and each molecule appears in 125 assays.
Workflow Graphs Our method is based on the concept of a workflow graph. A workflow graph is a weighted, directed graph, in which each node corresponds to a bundle of assays run in parallel on the same set of molecules and each directed edge traces how a subset of molecules moves through a project from one bundle of assays to the next; the importance of each edge is quantified by its weight. Within this model, four heuristics usually apply: 1. Molecules move from broader experiments (those executed on large numbers of molecules) to more narrow experiments (those executed on fewer molecules). 2. Bundles with a preponderance of unique molecules are primary screens and have only outgoing edges. 3. A workflow should contain the minimum number of edges necessary to explain the path of molecules through the project. 4. Summary data sets that list the probes ultimately discovered in a project are leaf nodes with only incoming edges. Once the workflow graph is determined, its structure exposes critical details about an HTS project.
Grouping Assays into Projects We use three pieces of information to group assays into projects: (1) depositor-specified links, (2) text-based similarity, and (3) grant numbers parsed from the descriptions of each assay. Of these methods, text-based similarity is the most general, whereas the others rely on data only contained in specific repositories such as PubChem. As more public funding agencies begin to require deposition of results in public databases, this type of supplementary linking information, such as grant number or data source, should become more available. Scientists who deposit the data may specify links between assays in PubChem. They are encouraged to link to assays in
Calhoun et al.
1073
Figure 1. In a hypothetical example with four assay bundles (A, B, C, and D), where n(A) > n(B) > n(C) > n(D), the first graph (left) shows the bundles and all candidate edges. The middle panels show incoming edges for C (top) and D (bottom). The table in each panel displays the number of molecules in each combination of bundles considered by the algorithm. In the top panel, the algorithm first identifies B and then A as Imin. In the bottom panel, the algorithm identifies C as Imin and then terminates. The panel for B’s incoming connections is trivial and not shown. The final graph (right) is the final workflow graph detected by the algorithm.
the same project but are, unfortunately, also permitted to link to assays in different projects. Nonetheless, these links are the best starting point for detecting projects in PubChem. In particular, the PubChem summary assays indicate all associated assays in a screening campaign, although they provide the assays in a date-ordered list format rather than a logical progression of the screening funnel. In some cases, text similarity can form appropriate links between assays that are not explicitly specified.6 Assays from the same project often have nearly identical text in their descriptions because it is common for depositors to cut and paste descriptions between related assays. We link assays together when their text similarity—as computed using TF-IDF weighting— is 100%. Both depositor-specified connections and text similarity can inappropriately link assays from different projects, but sometimes these spurious links can be pruned by considering the grant numbers of each assay. To accomplish this, we link all assays with the same grant number and enforce that no project can include any assays with different grant numbers. We group clusters of linked assays together into projects using these constraints and heuristics. Further analysis proceeds on a project-by-project basis.
Detecting Workflow Graphs In the first pass through a project’s data, assays executed on the same set of molecules are grouped together into an assay “bundle” and said to be “run in parallel.” At this point, the project is defined by several distinct sets of molecules, each
one corresponding to either an assay or bundle of assays. For brevity, we refer to both cases as “bundles” in the following discussion. We propose here a workflow detection algorithm that infers the relationships between these bundles. At a high level, the workflow detection algorithm first identifies candidate edges according to a set of rules. Next, each node in the workflow graph is considered in turn, and a subset of the candidate edges is included in the workflow graph; the remaining candidate edges are discarded. Candidate edges are identified as follows (Fig. 1): 1. Let u(A) be the number of molecules unique to bundle A, let n(A) be the total number of molecules in bundle A, and let n(A ∩ B) be the number of molecules tested in both bundles A and B. 2. Add a candidate edge A → B if both n(A) > n(B) and n(A∩B) > 0. 3. For all bundles A, where there is no other bundle B such that u(A) < n(A ∩ B), remove all incoming edges to A. 4. Flip the direction of edges such that summary bundles only have incoming edges. In the next step, the minimum set of incoming candidate edges is selected for each node in turn. 1. Let N be the node under consideration and {I1, . . ., Ik} the nodes with edges pointing toward N. 2. For all possible combinations of {N,I1, . . ., Ik} that include N and at least one other bundle (2k − 1 possibilities), compute the number of molecules
1074 tested in exactly this combination of bundles. To be clear, membership in bundles not in {N,I1, . . ., Ik} is ignored. 3. Remove all combinations from consideration that include zero molecules. 4. For the combination representing the largest number of molecules, let Imin be the bundle, other than N, with the fewest molecules in this combination. Promote the edge Imin → N to inclusion in the workflow graph. Let the weight of this edge equal the total number of molecules in any combination still under consideration that contains Imin. 5. Remove all combinations from consideration that contain Imin. 6. While combinations remain, return to step 4 to pick a new Imin and promote another candidate edge. 7. While nodes remain unconsidered, pick the next node N and return to step 1. At the termination of the algorithm, the final workflow graph hypothesizes the path of all molecules through the HTS project. A Python implementation of this algorithm is available at http://swami.wustl.edu/flowpaper.
Results Grouping Assays into Projects We use a combination of three sources of information to group assays into projects. This divides PubChem into 823 projects with, on average, 5 assays per project (Fig. 2). More recent projects tend to have more assays. This reflects various developments in the probe discovery and HTS fields, such as users becoming more familiar with PubChem and more NIH grants requiring public data release through public databases such as PubChem. In addition, the results of government-funded probe discovery efforts such as the Molecular Libraries Initiative are being evaluated according to more stringent criteria to ensure broader applicability of tool compounds to the scientific community.7 To address this, recent probe discovery projects include extensive physical characterization and specificity assays to better define the properties of proposed probes. Quantitatively assessing the accuracy of our assay grouping is challenging because there is no automated way of identifying true errors. Qualitatively, visual inspection of the projects by our team confirms, in many cases, that our assay connections are reasonable. Projects are usually composed of assays that are clearly related. In particular, simple projects containing only two bundles are generally correct and represent dozens of successfully executed workflow mappings. We did find some grouping errors. In one case, the depositors did not link their primary screen (AID 686) to its
Journal of Biomolecular Screening 17(8)
Figure 2. Project size distribution. About half of PubChem projects contain only one assay, but the rest can contain a dozen or more assays each. More recent projects tend to include more assays as depositing confirmatory screens and follow-up experiments become more common.
confirmatory assay (AID 691), there were no grant numbers in their descriptions, and their text similarity was not sufficiently strong to form a link. In another case, one very large project, with several dozen assays, was formed because a depositor linked the screen (AID 463074) to a counterscreen from another project (AID 2712). This inappropriately created one large conglomerate project out of two distinct HTS campaigns. A similar error can arise when multiple isoforms of a target are the subject of separate HTS campaigns. The depositor may use similar text for the related targets, resulting in a bundling of the projects when in reality they should be distinct. In another case, several screens from the same project deposited from ChemBank were not linked in PubChem. Sometimes, text similarity is enough to link related ChemBank assays together, but often their projects are not formed appropriately. In another case, several assays (AIDs 2423, 2327, 2400, 2388, and 2387) were annotated in PubChem with the wrong grant number and, therefore, separated from the other assays in their project (Project 12 in Table 2). These errors seem to be uncommon, and, in some cases, we can envision methods of correcting them with better parsing algorithms. For example, the unlinked ChemBank assays contain a set of well-defined HTML links to project pages at ChemBank’s Web site. These links could be used to define connections between PubChem assays. In other cases, errors will be difficult to fix algorithmically because they are a consequence of flawed annotations in the PubChem data. Therefore, we included a mechanism in our analytic pipeline for manually adding and removing links between assays to fix errors we find.
1075
Calhoun et al.
Table 2. A Table Detailing Differences between Documented and Algorithmically Generated Workflows for a Set of 12 Screening Projects Executed at the Broad Institute Project Name
Assays
Bundles
1 2 3 4
Maternal gene expression Quorum sensing Glycogen synthase kinase 3α and 3β inhibitors
8 12 13
5 4 8
α-Synuclein 5′UTR activators/inhibitors
27
13
5 6 7 8 9
β Cell apoptosis inhibitors Breast cancer stem cell inhibitors Hif activators Platelet activation inhibitors Streptokinase expression inhibitor
18 4 5 20 12
6 3 3 15 6
10
Trypanosoma cruzi inhibitors
11
7
11 12
A1 apoptosis inhibitors Antifungal
8 28
5 9
Discrepancies Extra edges from B to C Missing edges from D to H and H to F Extra edge from D to G and G to H Missing edges from H to J, J to N, and G to K Extra edges from A to C, C to F, and E to F Missing edges from C to E, E to D, and D to F Extra edges from A to E, F to G, and D to E Missing edges from C to E, F to D, and D to G Extra edge from B to H Bundles E, F, G, and J uncertain in documentation
In all projects except for Projects 3 and 10, all extra edges are correctly inferred by our algorithm but missing from the documentation. The accuracy of these edges was confirmed by the screeners. In these cases, the inferred workflows are more accurate than the screeners’ documentation. Likewise, except for Projects 3 and 10, all missing edges are a consequence of the screeners choosing molecules by structural similarity to hits for structure-activity relationship analysis. This is a specific failure case that our method is not currently designed to handle. The workflows from Projects 3 and 10 are described in more detail elsewhere (Figures 5 and 6). Note that in Projects 3 and 12, several assays missing from our workflow were found to be submitted with incorrect grant numbers and thus were incorrectly grouped together. This table evaluates the workflow generated when correctly grouping Project 3 and 12’s assays. All workflows can be found in the supplementary information.
Workflow Detection To evaluate the workflows detected by our method, we contacted the Broad Institute—one of the NIH’s Molecular Libraries Probe Centers Network (MLPCN) comprehensive screening centers—to obtain the internal documentation for several HTS projects. They responded with the documentation for 12 projects, and we compared the documentation they provided with the inferred workflows generated by our method (available at http://swami.wustl.edu/ flowpaper). Most of the time, inferred workflows closely matched the documentation (Table 2, Figs. 3 and 4). Discrepancies were discussed with the screeners to determine failure cases. Three types of discrepancies arose. First, the molecule counts for our bundles and those in the documentation are always slightly different. The documentation reported the exact number of molecules tested in the HTS experiment, whereas our system reports only the number of unique molecules. Occasionally, several samples tested in a screen have the same molecular structure and, therefore, the same CID but different sources. It is, of course, a matter of preference how to handle this bookkeeping detail, but it explains these minor discrepancies. Second, a small number of edges were missing in five projects. In most cases, these missing edges involved an
assay run on previously untested molecules that were selected for their structural similarity to molecules active in another screen. These screens are run to establish structureactivity relationships (SARs) around the hits the screeners hope to chemically optimize. Our algorithm cannot yet make these edges because it does not consider chemical structures, so this is not surprising. We can imagine modifications that might work in these cases but have not yet implemented them. Third, a few additional, undocumented edges were included in five projects. In most cases, the original screeners confirmed that our algorithm is correct and the documentation is wrong. This is understandable; HTS projects are complicated, and occasionally components of the documentation are overlooked. In two projects, however, one or two edges were erroneously inferred (Figs. 5 and 7). Of 12 projects, our system inferred only three clearly incorrect edges that were not a consequence of screens run on molecules chosen for SAR.
Project-Oriented View of PubChem We created a front-end to PubChem that organizes screening data into the workflows detected by our algorithm (http://swami.wustl.edu/flow). This front-end allows
1076
Journal of Biomolecular Screening 17(8)
Figure 3. Maternal gene expression (Project 1 in Table 2). In this example, the left graph shows the depositor-specified connections of a group of related assays in PubChem. Each node is labeled with its AID and the bundle to which it is eventually assigned. The middle graph shows the bundles connected by all candidate edges. The right graph is the final workflow computed by our algorithm. In this case, A is the summary screen with zero molecules in it, so it is excluded; B is the primary screen on the full library of molecules; C and D are, respectively, the first and second batches of dose-response confirmatory experiments; and E is an additional confirmatory test where molecules were obtained from a commercial vender in powder form instead of being sourced from existing high-throughput screening plates. The accuracy of this workflow was confirmed by the original depositor.
searching by keyword, assay ID, mesh term, date, and project size. Rather than returning a list of matching assays in response to a query, however, it returns a list of projects and their workflow graphs. Most of the critical information in a project is readily accessible through the workflow graph. The graph itself is laid out using GraphViz.8 Stronger connections are denoted with darker lines. Thicker borders around bundles denote those with more assays, and each bundle’s width is proportional to the logarithm of the number of molecules tested. Clicking on a bundle highlights its corresponding assays, and clicking on an assay displays its depositor-contributed description and protocol. Links to related PubMed articles and links to the PubChem assay pages are also included. Although it is only a prototype, this tool provides a useful—but entirely different—view of PubChem, complementary to other available tools. We now use it daily to analyze PubChem projects and communicate with collaborators. We, therefore, automatically update it on a regular basis as new information is deposited into PubChem, and we hope it will be a useful resource to others. We also intend to improve the search index to allow queries by gene, active compound, and other important features. These enhancements, however, are beyond the scope of this article, which is meant to focus specifically on the workflow detection algorithm and demonstrate the feasibility of applying it broadly across a large HTS repository.
Figure 4. Quorum sensing (Project 2 in Table 2). In this example, the top graph shows the depositor-specified connections of a group of related assays in PubChem. Each node is labeled with its AID and the bundle to which it is eventually assigned. The lower left graph shows the bundles connected by all candidate edges. The lower right graph shows the final workflow graph computed by the algorithm. In this case, A is the summary screen with zero molecules in it, so it is excluded; B is the primary screen on the full library of molecules; C is a bundle of single-dose retests to confirm activity; and D is the set of dose-response confirmatory tests to characterize seven molecules that yielded discordant results in B and C.
Discussion In this study, we have introduced a heuristic-based algorithm that accurately infers components of a screening project’s workflow by looking at which molecules are tested in each assay. Usually, the workflows inferred by our method are very close to those documented by screeners, and sometimes they reveal undocumented portions of project workflows. Our key finding is that screening workflows are an organizing concept that makes projects understandable and that they leave a strong, detectable signal in the project’s data. To the best of our knowledge, this is the first algorithm capable of organizing assays in a project into a coherent view. We see two key paths forward in refining our algorithm.
Calhoun et al.
Figure 5. Glycogen synthase kinase 3α and 3β (Project 3 in Table 2). This workflow has an extra, incorrect connection from B to C; both of these bundles are primary screens executed on the National Institutes of Health (NIH) screening library. They are not bundled together into a single node because they were executed on different dates, and extra molecules were added to the NIH library in the intervening time. H is disconnected because its molecules were selected for structure-actvity relationships, but the rest of the workflow is correct.
First, it should be possible to infer the rules used to filter molecules at each stage of the project. Using each molecule’s membership in an assay as a prediction target, a rules inference engine—like those included in the WEKA datamining toolkit9—can extract the rules that filter molecules at each state of the project. For example, a rules inference engine successfully identified all the documented filters in one project (Fig. 6). Discussing these results with the original screeners, we discovered that most of the molecules incorrectly classified by our rules were molecules that were requested but could not be obtained from the NIH for further testing. Other incorrectly classified molecules were chosen because they were structurally similar to the top hits, a step that was not documented in the project. It should be possible to identify these additional filters using chemical structure mining algorithms such as GraphSig and ParMol.10,11 Our preliminary results suggest the general approach is feasible and suggest strategies for identifying structure-guided filters as well. Determining understandable ways of displaying these filters to users remains an open problem.
1077
Second, as more specific examples are validated and errors are uncovered (e.g., Project 10 in Fig. 7), we hope to build a data set on which we can train machine-learning methods instead of our heuristic-based algorithm. With the right training data, we expect machine-learning approaches to more robustly detect workflows and overcome annotation errors in the source data. This remains an important but long-range goal at the moment, as it will take substantial effort to validate a sufficient number of workflows to form a training set of appropriate size. To accelerate this process, we are considering creating a Web site where screeners can validate the accuracy of the inferred workflows and identify errors that they uncover. Our study also has implications for assay annotation. The main effort toward creating an ontology for biological assays—the BioAssay Ontology (BAO) project led by researchers at the University of Miami—focuses primarily on creating a controlled vocabulary for describing the technical and biological details of each individual assay.12,13 The BAO does include a few terms—under the “assay stage” heading—that are used to annotate the purpose of each assay in a project as primary, secondary, and several versions of selectivity screens. However, as we have seen, projects can be too complicated to fully capture with a few terms and knowing terms to annotate the provenance of the molecules in a screen are absent. In our view, the BAO is a critical effort that should be paired with an equally vigorous effort to annotate the purpose of and relationship between assays, as well as the filters used to promote molecules along the workflow. Ultimately, it should be possible to extend the BAO with a workflow ontology that captures these relationships and filters. A project’s structure leaves such a strong signature on its data that it might be possible to directly validate annotations made using an appropriate workflow ontology. Until a workflow ontology is developed and PubChem is annotated with it, the inferred workflows from our current algorithm—despite their occasional errors—are a simple, intuitive method of browsing and searching through HTS data. We have developed a workflow-based front-end to PubChem available to the community at http://swami.wustl .edu/flowpaper. Small-molecule screening projects are complicated, multistage experiments that string several assays together to identify molecules that satisfy several constraints. The assays in these projects are best understood in light of their relationship to the other assays, and the project’s workflow exposes many important relationships between assays. Although screening workflows are often unpublished or incompletely documented, they can be inferred directly from the assay data deposited in public repositories. These inferred workflows expose the structure of screening
1078
Journal of Biomolecular Screening 17(8)
Edge A B
Filter
Molecules
Exceptions
activity ≥ 25.05, S-channel ≤ 99.5 25.05 > activity ≥ 24.1, not in B 24.1 > activity ≥ 20, S-channel ≤ 99.5, not in B activity ≥ 20, S-channel ≤ 100.7, not in A
1029
38
354
44
23
6
481
42
Figure 6. Inferred filters in a high-throughput screening project (Project 4 in Table 2). In this project, the filters inferred from the project’s data closely match the filters documented in the project and also uncover undocumented details of the workflow.
Declaration of Conflicting Interests The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding This work was supported by generous support from the Pathology and Immunology Department at Washington University.
References
Figure 7. Trypanosoma cruzi inhibitors (Project 10 in Table 2). The left graph shows the inferred workflow with incorrect edges depicted with dotted lines. The right graph shows the correct workflow with correct but missed edges depicted with dotted lines. This project contains the most mistakes, all of which involve two bundles, C and D. Bundle D was run on a set of molecules selected for structure-activity relationships.
projects and suggest ways of annotating assays in public screening repositories. Acknowledgments Author contributions: SJS conceived the idea of this project, provided project direction, and wrote the final manuscript. BTC designed and implemented the workflow detection algorithm. BTC and BRC implemented the assay grouping algorithm. MRB implemented a Web database of the mined workflows. JAB and BRC compared the inferred and documented workflows. We would also like to thank Lynn VerPlank and Jim Spoonamore from the Broad Institute for validating some of their screens’ workflows.
1. Wang, Y.; J. Xiao; Suzek, T. O.; Zhang, J.; Wang, J.; Bryant, S. H. PubChem: A Public Information System for Analyzing Bioactivities of Small Molecules. Nucleic Acids Res. 2009, 37, W623. 2. Meinl, T.; Wörlein, M.; Urzova, O.; Fischer, I.; Philippsen, M. The ParMol Package for Frequent Subgraph Mining; Bibliothek der Universität Konstanz: Konstanz, Germany, 2006. 3. Bolton, E.E.; Wang, Y.; Thiessen, P.A.; Bryant, S.H. PubChem: Integrated Platform of Small Molecules and Biological Activities. Annu. Rep. Comp. Chem. 2008, 4, 217–241. 4. Visser, U.; Abeyruwan, S.; Vempati, U.; Smith, R. P.; Lemmon, V.; Schürer, S. C. Bioassay Ontology (BAO): A Semantic Description of Bioassays and High-Throughput Screening Results. BMC Bioinform. 2011, 12, 257. 5. Workman P.; Collins, I. Probing the Probes: Fitness Factors for Small Molecule Tools. Chem. Biol. 2010, 17, 561–577. 6. Pouliot, Y.; Chiang, A. P.; Butte, A. J. Predicting Adverse Drug Reactions Using Publicly Available PubChem Bioassay Data. Clin Pharmacol. Ther. 2011, 90, 90–99. 7. Schürer, S.C.; Vempati, U.; Smith, R.; Southern, M.; Lemmon, V. Bioassay Ontology Annotations Facilitate Cross-Analysis of Diverse High-Throughput Screening Data Sets. J. Biomol. Screen. 2011, 16, 415–426. 8. Chen, B.; Dong, X.; Jiao, D.; Wang, H.; Zhu, Q.; Ding, Y.; Wild, D. J. Chem2Bio2RDF: A Semantic Framework for
Calhoun et al. Linking and Data Mining Chemogenomic and Systems Chemical Biology Data. BMC Bioinform. 2010, 11, 255. 9. Zhu, Q.; Lajiness, M.; Ding, T.; Wild, D. WENDI: A Tool for Finding Non-Obvious Relationships between Compounds and Biological Properties, Genes, Diseases and Scholarly Publications. J Cheminform. 2010, 2, 6. 10. Han, L.; Suzek, T.; Wang, T.; Bryant, S. The Text-Mining Based PubChem Bioassay Neighboring Analysis. BMC Bioinform. 2010, 11, 549. 11. Ellson, J.; Gansner, E.; Koutsofios, L.; North. S.; Woodhull, G. GraphViz—Open Source Graph Drawing Tools. 2002. http://www2.research.att.com/~north/people/pdf/GDS-Ellsongknw03.pdf
1079 12. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I. H. The WEKA Data Mining Software: An Update. ACM SIGKDD Explorations Newsletter 2009, 11, 10–18. 13. Ranu S.; Singh, A. K. GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases. In Proceedings of the 25th International Conference on Data Engineering, 2009, 844–855. http://making.csie.ndhu.edu. tw/seminar/making/papers/PDF/GraphSig%20A%20Scalable%20Approach%20to%20Mining%20Significant%20 Subgraphs%20in%20Large%20Graph%20Dtatabases.pdf