A Grid Infrastructure for Mixed Bioinformatics Data and Text Mining Moustafa Ghanem, Yike Guo, Anthony Rowe Department of Computing Imperial College London 180 Queens Gate London, SW7 2AZ, UK
[email protected],
[email protected],
[email protected]
Abstract In this paper we present an infrastructure for conducting data and text mining over distributed data and computational resources. Our approach is based on extending the Discovery Net infrastructure, a gridcomputing environment for data mining, to allow end users to construct complex distributed text and data mining workflows. We describe our architecture, data model and visual programming approach and also present a number of text mining examples conducted over biological data to highlight the advantages of our system. 1
Introduction and Motivation
A fundamental problem that biological researchers face today is how to make effective use of the available wealth of online information to improve their understanding of complex biological systems. Online biological information exists in a combination of structured, semi-structured and unstructured forms and is dispersed between specialised databases providing medical and scientific literature on the one hand and information about genes, proteins and chemical compounds on the other. The ability to effectively extract, integrate, understand and make use of the information stored on all these different data sources and to relate it to experimental data is typically a dynamic and iterative process that requires the integration of data and text mining components in an open computing environment. In the remainder of this section we characterise some of the features of this dynamic process to set the context and describe the motivation for our work. 1.1
Mixed Data and Text Mining
In this paper we note two forms of interaction between data and text mining. The traditional interaction whereby text mining makes use of traditional data mining techniques has been noted by a large number of authors. For example, it is not
Alexandros Chortaras, Jon Ratcliffe InforSense Ltd 47 Princess Gardens London, SW7 2PE, UK
[email protected],
[email protected]
difficult to note that many text mining applications follow a generic pipeline that takes in text documents, performs any of a number of text preprocessing (cleaning, NLP parsing, regular expression operations, entity extraction operations etc), followed by the application of statistical operations that code features of the documents in vector form where counts are recorded for userdefined features such as keywords, patterns, gene and disease names, disease names etc. This vector form is then amenable to the application of traditional data mining techniques such as classification, clustering, PCA, association analysis, etc. A second form of interaction appears often in bioinformatics research whereby text mining is used to validate and interpret the results of a data mining procedure. For example consider a scientist in engaged in the analysis of numerical experimental data using traditional data mining techniques, for example for the analysis of microarray gene expression data using data clustering techniques. The result of this clustering analysis may be a group of co-regulated genes (i.e. genes that exhibit similar experimental behaviour). At this stage, the user may wish to validate the significance of his findings by referring to available online information sources that hold information about these genes. This second phase requires the use of text mining techniques to access, integrate and analyse the content available about such genes in structured, semi-structured and unstructured information sources. 1.2
Open Knowledge Discovery Environments
The diversity, distribution and heterogeneity of bioinformatics data makes it impractical to use “closed” data/text mining systems that assume a centralised database or a data warehouse where all the data required for an analysis task can be materialised locally at anytime before feeding them into data mining algorithms and tools that themselves have been predefined at the configuration stage of the tool. We thus argue that within a bioinformatics mixed data and text mining
Client1
Client2
Client n
Discovery Net API
Component Service
Execution Service
User defined service
The Discovery Net project (Curcin et al., 2002) is funded under the UK e-Science programme. The requirements for the project have developed from the need for a higher-level layer of informatics middleware to allow scientists to create meaningful data analysis processes and execute them using an underlying distributed computing (Grid computing) infrastructure without being aware of the protocol used by individual services. The Discovery Net system builds on top of the fundamental Grid technologies (Foster and Kesselman, 1997) to provide a bridge between the end user of a Grid service and the developers of individual Grid tools. Using the various tools produced, as part of Discovery Net, generating a reusable Grid application becomes the task of selecting the required components and services and connecting them into a process using a workflow model represented in an XML based language Discovery Process Mark-up Language (DPML) (Syed et al., 2002). A workflow created in DPML is reusable and can then be encapsulated and shared as a new service on the Grid for other users. The system is generic and we have developed a number of case studies for bioinformatics applications including genome and protein annotation. For example in (Rowe et al, 2003a), we describe how system was used to combine the
Discovery Net Architecture
InfoGrid Service
The Discovery Net System
2.1
Computational Service
2
distributed tools to build a produce a real time genome sequence and annotation pipeline. This application was awarded the “Most Innovative Data Intensive Application” at High Performance Computing Challenge in the IEEE/ACM 2002 Super Computing Conference. In the remainder of this section we briefly describe the architecture, and in the next section we describe how it has been specialised for constructing distributed mixed data/text mining applications.
Data Access and Storage Service
scenario a scientific knowledge discovery process better proceeds in an open environment by making use of distributed compute and data resources. The reasons can be summarised as follows: 1. There are typically many different data and text analysis software components that can be used to analyse the data. Such components may be on the user's local machine, others may be tied for execution on remote servers, e.g. via a web-service interface or even simply via a web page interface. An individual researcher needs to be able to locate such software components and integrate them within their analysis procedures. The ability to remotely execute such components plays a crucial role in compute intensive applications, such as data and text mining where the efficient execution of such components may be tied to dedicated high performance computing resources. 2. The data resources themselves (genomic databases, article repositories, etc) are by default physically distributed. The ability to effectively integrate such resources within a knowledge discovery process without downloading them locally is becoming an important feature of successful bioinformatics platforms.
Figure 1: The Discovery Net Architecture.
The Discovery Net system is designed primarily to support the distributed analysis of scientific data based on a workflow or pipeline methodology. In this framework, analysis components are treated as remote services or black boxes with known input and output interfaces. Such services can be combined together into workflows that coordinates the execution of the distributed components. The system architecture is split into a number of core services as shown in Figure 1. These services can be used via well-defined APIs to allow the user to expand and modify the system with new data types, components, clients and to compose and execute the services. Component Service: The Component service manages the integration of different remote services into the system. It allows a wide variety of different protocols to be used in integrating remote services into the system. These include HTTP, Web services using the SOAP protocol, and OGSA Grid services. Computational Service: The Computational Service provides access infrastructure for the execution of local software components directly on a user’s local system. Allowing the user to execute components locally can improve the performance of a variety of applications. Execution Service: The Execution Service manages the distributed execution of a workflow by analysing its DPML specification and matching
the operations to the components made available via the component service. The execution engine co-ordinates the scheduling of the workflow, passing each component the correct input data and handling the output results. Data Access and Storage Service: The Data Access and Storage service is a utility service designed to aid common Data Access tasks (e.g. remote ODBC database access), and the Storage Service allows data that has been accessed to be stored locally. It also provides storage for the representations of workflows that have been designed in the system. InfoGrid Service: The InfoGrid Service (Giannadakis et al 2003) provides a standard query interface for heterogeneous databases, such as those found in bioinformatics. It also allows the dynamic creation of data sets from a wide variety of distributed heterogeneous platforms. User Defined Service: The main aim of the Discovery Net platform is to make the system extensible so that the users of the system can take advantage of new services that become available. It provides the user with an SDK for adding new remote and local components (and 3rd party software) into the system using a standard interface Discovery Net API: The Discovery Net API allows programmatic access to all of the DiscoveryNet services. It is used to develop applications using the system. Discovery Net Clients: The Discovery Net clients provide users with graphical means of constructing their knowledge discovery workflows and providing access to data resources and result visualisation tools. 2.2
Component Integration
The main abstraction of a Discovery Net component is a Service Descriptor that provides the basic mechanisms required for integrating local and remote software components into the system. It provides a description of the input and output ports of the component, the type of data that can be passed to the component and parameters of the service that a user might want to change. Once the service descriptor has been developed, the component can be added into the DiscoveryNet by registering with the component service. This will dynamically make the service available to clients so that users can take advantage of the new service. Input ports
Output ports
Input Metadata Output Metadata User Parameters
Figure 2: The anatomy of a Discovery Net component.
The abstraction is shown in Figure 2 depicting the three main types of interface. The first is the Output port interface. This describes what will be output by the service at runtime. There is also the Input port interfaces to describe the required inputs to the service. These can be constrained to only accept connections from output ports that provide specific metadata. This is important for describing valid discovery processes according to DPML and to help users who compose services together into a specific application build processes that make sense. The final interface is the parameters that the user is able to change to customise the service. 3
Text Mining in Discovery Net
The complete flexibility that Discovery Net infrastructure provides with respect to data structures that can be passed between components makes the modelling of data a crucial element in deciding how best to use the infrastructure within a particular domain. Any data object that is being passed around in the framework is attached a metadata object that captures its properties. These can be the types of columns and table size in case of relational data structures, or types of annotations present in the case of a set of genomic/proteomic sequences, and the necessary information about the content of text documents in the case of text mining applications. It is this metadata that is being handled in the Metadata declaration of a component. In the same way as abstraction layering of components enables the definition of the execution to be decoupled from the actual execution. This is a significant advance compared to the standard service composition frameworks that provide only abstraction over the execution of the components, but not over the data objects that are being passed between them. The physical location and nature of the data becomes relevant only in the execution stage when it needs to be passed to the component, not during the workflow construction phase. In the remainder of this section we describe our text document model and its associated metadata that enable us to compose distributed and heterogeneous software components into standardised text mining workflows. 3.1
Document Model
At the core of the Discovery Net text mining is an extensible document representation model. The model is based on the Tipster Document Architecture (Grishman, 1998) that uses the notion of a document annotation, defined as “the primary means by which information about a document is recorded and transmitted between components with in the system”. Following this model, a single
document is represented by two entities: the document text, which corresponds to the plain document text and the annotation set structure. The annotation set structure provides a flexible mechanism for associating extra-textual information with certain text segments. Each such text segment is called an annotation, and an annotation set consists of the full set of annotations that a make up a document. A single annotation is uniquely defined by its span, i.e. by its starting and ending position in the document, and has associated with it a set of attributes. The role of the attributes is to hold additional information e.g. about the function, the semantics, or other types of user-defined information related to the corresponding text segment. Each annotation has also a type which, unlike in the original Tipster Architecture, in our system is a low level notion limited to defining the role of the annotation as a constituent part of the document, i.e. as a single word, as a sequence of words, etc. Any type of additional information is represented by the associated attributes. This distinction prevents the conflicting use of annotation types and attributes as means of information holders by reserving this role only to the attributes. Thus each annotation has a unique type and may have any number of attributes. Each attribute has its own type denoting the type of information that the attribute is representing, and a value. Depending on the particular application different attributes will be assigned to the annotations. Typical examples include attributes that represent the results of a natural language processing operation such as part-of-speech tagging, stemming and morphological analysis, the results of dictionary lookups or database queries for certain annotations or the results of a named entity or terminology extraction process. Table 1 shows a simple example of an annotated text. Text
Annot. Type
Attributes
Insulin resistance Insulin resistance plays a major role
token token compound token token token token token
pos:noun, stem:insulin pos:noun, stem:resist disease:insulin resistance pos:verb, stem:plai pos:det, stopword pos:adj, stem:major pos:noun, stem:role
Table 1: Annotated Text Example
In addition to the individual annotations the annotation set structure holds also rich metadata information that provides a basic profile for the document. This includes statistical information about each entity in the text, and the contained attributes types and values.
A set of documents together with their associated annotation set structures makes up a document collection. In addition, each document collection has a separate metadata structure associated with it that provides overall statistical information about the annotation and attributes types and the values of the underlying documents. 3.2
Document Indexing
A document collection will typically consist of a large number of documents; an indexing scheme for efficient processing is therefore required. At the individual document level the annotation set structure set is by itself an index and allows the fast retrieval of all text segments that satisfy certain conditions involving particular annotation types, attribute types and values. At the document collection level the system supports the creation of indices on the document text and on attribute types. A persistent index is generated which is then used by the processing components for improving their performance. 3.3
Feature Vector
In many cases the aim of a text mining analysis is the reduction of each document of a document collection to a single feature vector, whose dimensions reflect the significant informational components that the analysis identified. These features can then be passed to other typical data mining operations such as clustering and categorization for further processing. The main characteristic of the feature vectors extracted from documents is that they are usually sparse vectors of high dimensionality. The representation of these vectors as normal dense vectors is thus highly inefficient resulting to a waste of memory and processing resources. Thus our system is based on the use of a sparse feature vector type as the basic type for representing documents in the feature vector space. All components are designed to generate and operate on sparse feature vectors and conversions from and to the dense feature vector type can explicitly be done if required. 4
Discovery Net Text Processing Components
We have used our system to integrate a wide variety of 3rd party text mining and data mining software components. In addition, we have implemented a collection of processing components that offer a basic set of text mining operations, from which advanced processing workflows can be composed. The components are organized in the following seven groups: Import/Export, Pre-processing, Annotations, Indexing, Filtering, Feature Vectors, and Viewers.
The Import/Export group components allow text documents to be imported (or exported) to the system from external sources, e.g. by converting pdf, doc, html, xml, end note formats etc. to plain text for use by other components. The Pre-processing group consists of a set of components, which perform any necessary modifications on the original document text before the creation of the annotation structure. They provide a rich functionality, which includes the removal of stop words, of certain user-specified characters, keywords, phrases and regular expressions, stemming, dictionary lookups and keyword replacement, etc. During preprocessing the granularity of the document analysis may also be changed, by applying a component that splits the documents e.g. into sentences or other document segments. The Annotations Group components manage the creation of basic annotation set as well as enriching the set by adding new attributes. The implemented/integrated components allow the addition natural language processing attributes (e.g. part-of-speech, stem) and the addition of attributes based on the results of looking up of the annotations in a user-provided dictionary, as well as identifying and annotating frequently occurring phrases using statistical methods. The Indexing group contains the components for creating a persistent index of a document collection. The index is created on the annotation attributes that the user specifies and is used by the subsequent components in the workflow. The Filtering Group contains components that filter out from a document collection documents that do not meet certain criteria. The filtering criteria can involve certain keywords, phrases, regular expressions and combinations of annotation attribute types and values. The Feature Vectors group is the most extensive group and includes many components that transform a document collection to the feature vector space and perform statistical and transformation operations on the extracted feature vectors. For the generation of feature vectors full flexibility on their dimensions is provided: the user selects which annotation types and which attribute types and possibly also which specific values he wants the feature vectors to have as their dimensions, e.g. he can choose that he wants all annotations of type token that are nouns and not stop words to be included, or if attributes based on a special entity extraction components have been generated he can specify that the features of each document will be the extracted entities. Apart from annotations the feature vectors can be extracted also directly from the document text. In this case
the used must provide a set of keywords or regular expressions that will correspond to the feature vector dimensions. Other components in the same group allow the computation of co-occurrence matrices between the annotations of the documents, and the similarity (e.g. cosine similarity) between the extracted feature vectors. Finally components that compute statistical information for the features of a single document collection (e.g. inverse, document frequency, entropy) as well as inter-collection feature statistics (e.g. chi square, mutual information, information gain) are provided. The Document Viewer Group provides graphical tools for the visualisation of the analysis results. It includes a heatmap viewer for the visualisation of feature vector sets and viewers for the visualisation of individual documents, document collections and their annotation set structures. In addition to the seven text mining component groups, the system also provides access to a large number of statistical and data mining components provided by the KDE data mining system (InforSense 2004).
Figure 3: Co-occurrences Analysis Workflow.
An example of the type of available functionality is shown in the simple workflow of Figure 3, which computes the co-occurrence matrix between genes and diseases that appear in a set of PubMed abstract texts. The imported data are in XML format, from which the extraction component extracts the texts of the abstracts. The next three components create an annotation set structure for the resulting document collection, and they use a gene and disease dictionary to identify matching annotations in the abstracts. An index is then generated over the annotation set structure which allows a fast computation of the desired cooccurrence matrix. 5 5.1
Case Studies Interpreting Gene Expression Data Sets
The first case study using the system was conducted to find associations between gene expression data and disease terms, and is described in the data analysis workflow shown in Figure 4. The workflow is divided into three logical phases.
Figure 4 Interpreting Gene Expression Data Sets.
In the first phase of the workflow (“Gene Expression Analysis”), a biologist conducts analysis over gene expression data using traditional data mining methods (statistical analysis, clustering etc). The output of this stage is a set of “interesting genes” that the data mining methods isolate as being candidates for further analysis. For example, these can be differentially expressed genes for which the user is interested in finding a biological significance of why they are differentially expressed and to identify the diseases that may be associated with these genes. In the second phase of the workflow (“Find Relevant Genes from Online Databases”) the user uses the InfoGrid integration framework to obtain further information about the genes. This part of the workflow starts by obtaining the nucleotide sequence for each gene by issuing a query to the NCBI database based on the gene accession number. The retrieved sequence is then used to execute a BLAST query to retrieve a set of homologous sequences; these sequences in turn are used to issue a query to SwissProt to retrieve the PubMed Ids identifying articles relating to the homologous sequences. Finally the PubMed Ids are used to issue a query against PubMed to retrieve the abstracts associated with these articles. In the third phase of the workflow (“Find Association between Frequent Terms”) the user uses a dictionary of disease terms obtained from the MESH dictionary to isolate the key disease terms appearing in the articles. The identified disease words are then analysed using traditional association analysis techniques to find frequently co-occurring disease terms in the retrieved article sets that are associated with both the identified genes as well as their homologues. 5.2
Gene Expression-Metabolite Mapping
The second case study was conducted to find relationships between data obtained from genomic and metabonomic experimental data relating to mice (Rowe et al 2003b). The gene expression data measures the amount of RNA expressed at the time a sample is taken and the NMR spectra are
produced from analysing urine samples of the same mice. We used a specific text corpus obtained by accessing 200 different papers about the insulin resistance and reactions catalysed by Enzymes. The goal is to develop a workflow that allows relationships to be found between the genes and metabolites; this proceeds in three phases: The first phase (“Microarray analysis) uses standard gene expression analysis techniques to filter interesting genes within the gene expression domain. The gene expression process that is used starts by mapping the gene expression probe Id to the sequence that would bind to that area. Using the sequence, we use BlastX to search the SwissProt database. This provides a method for finding known genes. After the blast process, we use the hits from this database to download features from the actual records from Swiss-Prot to annotate the probe Id with possible gene names for the sequence and any Enzyme commission number when it exists.
Figure 5 Gene Expression-Metabolite Mapping
In parallel, the second phase (“Metabonomic Analysis”) proceeds by analysing the NMR data using multivariate analysis techniques to find interesting features from the metabolic domain. The NMR shifts can be mapped to candidate metabolites using either manual processes, or using NMR shift databases that provide a set of candidates for a given shift. The third phase (“Text Selections and Relationship Functions”) then proceeds based on “joining” the outputs of phases 1 and 2. In the text database, we use a co-occurrence approach to find the most general relationships possible. 5.3
Scientific Document Categorization
Our third case study is based on using traditional data mining components for the categorization of bioinformatics documents. The case study comes from the KDD CUP 2002 competition. The task dealt with building automatic methods for detecting which scientific papers, in a set of full text fruitfly genetics papers from FlyBase, contained experimental results about gene products (transcripts and proteins), and also within each paper, to identify and score which genes had
experimental results about their products mentioned. The task was designed to mirror how human experts at FlyBase Harvard curate papers containing experimental gene expression evidence of interest to the curators, specifically, experimental evidence about the products (mRNA transcripts (TR) or proteins/polypeptides (PP)) associated with a given gene.
Figure 6 Document Classification Workflow
To address the task (Ghanem et al 2003), we developed an improved automatic feature selection method in conjunction with traditional data mining classifiers that learn from examples. The feature selection method is based on capturing frequently occurring keyword combinations within short segments of the text to statistically mirror how a human expert would tackle the task. This features were then passed into a 3rd party SVM classifier. Our document classifiers based on this method proved to produce more accurate results than approaches relying solely on using keyword-based features and secured our team an honourable mention in the competition. The initial implementation of our solution to the competition was implemented using standalone components written in java and perl. Over the last year we have ported the implementation into the Discovery Net framework, the corresponding workflow is shown in Figure 6. The port had significantly contributed to better maintenance of the code including its easy adaptation to similar problems and to our improved ability to experiment with various feature selection methods and classifiers. It has also allowed us to reduce the execution time of the workflows by mapping the implementation of the compute intensive components to using high performance resources. 6
Discussion and Conclusion
In this paper we have presented architecture for mixed data and text mining. The main features can be summarised as: 1) Providing a workflow model for the co-ordinated execution of remote services by providing a uniform abstraction for invoking the execution of such components either as local computations or remote services. 2) Providing an easy programmatic interface for integrating third party components into the system, and also by
providing a uniform abstraction for a shared data model and metadata declaration model allowing text, data and models to be passed between them in a consistent way. 3) Providing mechanisms for allowing the execution of compute intensive components on high performance machines using task farming. 4) Providing mechanisms for integrating data from remote data sources, 5) Providing a visual application building front-end that allows for rapid application building and workflow maintenance. We have used the system to develop a large number of applications. In addition to those described in this paper, we are also in the process of developing other applications that integrate text and data mining components including named entity recognition applications (e.g. gene name identification) and entity-entity relationship extraction applications. (e.g. gene-disease and protein-protein). References Curcin V, Ghanem M, Guo Y, Kohler M, Rowe A, Syed J and Wendel P. (2002). Discovery Net: Towards a Grid of Knowledge Discovery. In Proceedings of KDD-2002, the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23-26, 2002 Edmonton, Canada. Foster, I, and Kesselman, C. (1997). Globus: A metacomputing infrastructure toolkit. Int. Journal of Supercomputer Applications, 11(2):115--128,. Ghanem M, Guo Y, Lodhi H, and Zhang, Y. (2003) Automatic Scientific Text Classification Using Local Patterns: KDD CUP 2002 (Task 1). SIGKDD Explorations. Vol. 4. No.2 Dec 2002. Giannadakis, N, Rowe A, Ghanem M, and Guo Y. (2003) InfoGrid: Providing Information Integration for Knowledge Discovery. Information Science, 2003: 3:199-226. Grishman, R. (1998). TIPSTER Text Architecture Design,http://www.itl.nist.gov/iaui/894.02/related_pr ojects/tipster/docs/arch31.doc. InforSense (2004) http://www.inforsense.com Rowe, A, Kalaitzopoulos D, Osmond M, Ghanem M, and Guo Y. (2003a) The Discovery Net System for High Throughput Bioinformatics. In Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology, 2003. Also appears in ISMB (Supplement of Bioinformatics) 2003: 225-231 Rowe A, Ghanem M, and Guo Y. (2003b) Using Domain Mapping to Integrate Biological and Chemical Databases. International Chemical Information Conference, Nimes, 2003. Syed J, Guo Y, and Ghanem M (2002) Discovery Processes: Representation And Re-Use, UK e-Science All Hands Meeting, Sheffield UK, September, 2002.