Multiagent Framework for Bio-data Mining Pengyi Yang1 , Li Tao2 , Liang Xu2 , and Zili Zhang2,3 1
2
School of Information Technologies (J12), The University of Sydney NSW 2006, Australia
[email protected] Faculty of Computer and Information Science, Southwest University Chongqing 400715, China 3 School of Information Technology, Deakin University Geelong, Victoria 3217, Australia
[email protected]
Abstract. This paper proposes to apply multiagent based data mining technologies to biological data analysis. The rationale is justified from multiple perspectives with an emphasis on biological context. Followed by that, an initial multiagent based bio-data mining framework is presented. Based on the framework, we developed a prototype system to demonstrate how it helps the biologists to perform a comprehensive mining task for answering biological questions. The system offers a new way to reuse biological datasets and available data mining algorithms with ease.
1
Introduction
The unprecedentedly fast development of molecular biology is driven by modern high-throughput data generating technologies. The massive amount of data accumulated from the last two decades covers a full spectrum of various biological aspects and promised to promote our view and understanding to a higher level–system biology [1]. Yet, such vast collections of data are not in themselves meaningful. To extract useful biological information and knowledge from the raw data, various data mining strategies and their hybrids have been explored [2,3]. Owing to the high expenses, high labor force, and most importantly the nature of different level of analysis (genome, transcriptome, or proteome, etc.), various data generating protocols (sequencing, genotyping, microarray, serial analysis of gene expression or SAGE, and mass spectrometry or MS, etc.), biological data are largely distributed in different databases around the globe with heterogeneous characteristics and formats etc. [4]. However, the available data mining strategies and their hybrids are often determined by the problem formulation and require careful preparation and editing before applying to a specific problem. Such a gap creates an application barrier to researchers who want to combine different types of data to answer general biological problems, and make the reuse of a once developed data mining program very difficult. To address the difficulties of reusability and make the bio-data mining an easy access practice for biological researchers who are often unfamiliar with P. Wen et al. (Eds.): RSKT 2009, LNCS 5589, pp. 200–207, 2009. c Springer-Verlag Berlin Heidelberg 2009
Multiagent Framework for Bio-data Mining
201
any specific data mining algorithm, an agent driven data mining framework is proposed for biological data analysis. This system hides the data mining details from the users and attempts to provide as many available results as possible for a given enquiry. It helps the biological researchers to view the enquiry problems from a higher level by combining multiple levels/sources of results and makes the data mining easy to be applied for nonexperts. The paper is organized as follows: Section 2 argues for applying such an agent driven data mining framework in biological data analysis context. Section 3 provides an overview of the proposed framework, details the experimental design and provides some preliminary results. Section 4 concludes the paper.
2
Why Multiagent Based Bio-data Mining?
In this section, we present the rationale of introducing multiagent based bio-data mining for biological data analysis from different perspectives. Hidden Technical Details. The essential goal of bio-data mining is to provide meaningful biological information and knowledge for better understanding of the organism been investigated. Therefore, the target users of the mining algorithms should be biologists. However, since the bio-data mining in its nature is a datadriven process, most bio-data mining programs assume that the users have at least moderate knowledge of data representation and data mining, and require he/she to select an appropriate algorithm from a large number of candidates for a specific biological problem. Unfortunately, such requirements are unreasonable for most biologists. Agent-based bio-data mining leaves the technical details of choosing mining algorithms, forming hybrid system, and preparing specific data format to the intelligent system itself. It alleviates the technical difficulty while enhance the reusability of the mining algorithms and available datasets. Data Format. One major difficulty of reusing a once developed program to bio-data mining is that many biological datasets are generated with different protocols and stored in different formats. Take microarray data as an example. While many mining algorithms require the genes and samples/conditions to be represented as a data matrix, many microarray data are actually stored as a gene vector per sample/condition in multiple files. This is probably due to the fact that different laboratories often use different technologies and standards for data generation and acquisition. Nevertheless, the ever-changing technologies make the standardization of the bio-data very hard. As one may expected, this caused a similar effect on program and algorithm development. That is, different mining programs often make different requirement on data format. However, even a slight difference in format requirement may force the analyzer to go carefully through the data format manual many times for data preparation, otherwise the program will produce erroneous results or simply won’t work. By applying agent based data mining framework, we can leave the data format details to agents who actually carry out the dirty work, and the reusability of both data and algorithms can be enhanced.
202
P. Yang et al.
Parallel Analyzing. Multiagent system (MAS) is a powerful technology for dealing with system complexities [5]. It provides an architecture for distributed computing [6], and is primarily designed to solving computationally intensive problems by delegating the task of exploring different portions of the computation and data analysis to different agents, with the objective of expediting the search. We believe that such an architecture is well suited for bio-data analysis because data mining is often a computational intensive and time consuming procedure. By applying multiagent based distributed bio-data mining, the computing load can be balanced and the computational effort can be achieved in a parallel manner. Such a framework can not only speed up the overall mining process but also incorporate multiple sources of information for answering a given biological problem (bio-information fusion). Agent-Based Hybrid Construction. Many data mining algorithms have been successfully applied to bioinformatics. Some examples are genetic algorithm (GA) [7], neural networks, and support vector machine (SVM) [8]. However, recent development indicates that in many cases one technique will not be sufficient to solve a problem, due to the ever-increasing problem complexity and requirements. With such observation, we witnessed the boom of various hybrid systems in last few years [9,10,11]. Yet, there are numerous ways in which algorithms can be combined. In our previous studies, we demonstrated that a specially designed agent-based framework can be utilized to create efficient hybrid systems in a short time period [12,13]. The basic idea is to provide the data mining agents with some general mining rules and then let the agents evaluate different mining algorithms at runtime. By applying such agent-based hybrid system, any mining algorithm can be added to the system dynamically, and the flexibility and robustness of the system are greatly improved. Mining Multiple Levels of Data. A unique feature of biological data is that they ranging from the very basic DNA sequences to 3-Dimensional protein structures. As indicated in Figure 1, we divide them broadly into three major groups, namely, genomic data, transcriptomic data, and proteomic data, in accordance to nucleotide, gene expression, and protein analyses. Traditionally, certain biological enquiry is performed by applying certain data mining algorithms to a specific biological data type. However, a full view of the biological system will only be clear by integrating data from all levels. In order to obtain an in-depth understanding of the underlying mechanisms, mining multiple levels of data may offer us a more holistic picture. Multiagent bio-data mining framework offers us an efficient way to organize and mine multiple levels of bio-data at ease. Mining Same Level of Data Obtained by Different Technologies. Within a level, we may have different types of data generated by different technologies (Figure 1). Take the transcriptomic level as an example, two types of gene expression profiling technologies are widely used. They are serial analysis of gene expression (SAGE) and DNA microarray. While SAGE data consists of a list of thousands of sequence tag and the number of times each tag is observed in different samples or conditions, microarray present the gene expressions with
Multiagent Framework for Bio-data Mining
Microarray
DNA sequence
SNPs
Genomic Level
RNA blotting
SAGE
Transcriptomic Level
MS
203
Protein 3-D structure
Protein sequence
Proteomic Level
Fig. 1. Biological Data. Biological data can be divided into three levels. Each color block indicates a type of data generated by a specific technology in a given level.
hybridizing abundance from different samples or conditions. Multiagent based bio-data mining framework offers us the capability to mine and combine the results generated by different types of technologies simultaneously. In this way, multiple outcomes can be used for mining results validation and confirmation. Mining Same Data From Multiple Sources. In many cases, a biological dataset may be pre-processed with different criteria and stored in different databases with different formats. When applying the same data mining algorithm, dataset pre-processed or pre-filtered with different pipelines and in different formats may gives quite different mining outcomes. This leads to the inconsistency of the results. To enhance the reliability, one can employ different mining algorithms to mine the different versions of a same dataset, and assess the mining results collaboratively. This will give a less bias analysis, and help the biologists to discriminate genuine factors associated with biological phenomenon of interest. Such a procedure can be done by the multiagent system in a parallel way, and the results can be compared and combined to increase the reliability. With above analysis and justification, we anticipate that multiagent systems will be an increasingly important framework for biological data mining and analysis in coming years.
3 3.1
Bio-data Mining of Human Diseases: A Case Study An Initial Framework
The overview of the proposed framework is shown in Figure 2. Essentially, the framework can be divided into three levels. The first level are the interface agents. They collect mining task from end users and search for available planning agent and aggregation agent from the yellow page. The second level lies the planning agent and aggregation agent which are for task planning and mining results aggregation, respectively. The third level, which is the most important, are the mining agents. Each mining agent manages a database which can be geographically distributed. For each mining agent, the mining knowledge serve as its “brain” while a pack of mining algorithms serve as its tools. The service each mining agent provides is registered in the yellow page. It is worth noting that
204
P. Yang et al.
End User Search
Interface Agent
Interface Agent
Interface Agent
Yellow Page Server
ACL
Register
Aggregation Agent
Planning Agent
Bioinformatics Ontology
ACL
Agent Based Database Mining Agent
Mining Agent
Mining Agent
Mining Agent
Mining Algorithms
Mining Algorithms
Mining Algorithms
MS Data
Sequence Data SNP Data
MS Data Microarray Data
Register Mining Register Algorithms
Microarray Data SGAE Data
Fig. 2. Overview of the initial multiagent based bio-data mining framework
although each mining agent is implemented in the same way the mining knowledge and the mining algorithms with respect to a mining agent do not have to be identical to other mining agents. The flow of the message is that interface agents search for planning agent at the yellow page and record its address. When a mining task from the end user is collected, it sends the task to the available planning agent. The planning agent receives the task and searches for mining agents which is capable to provide mining results with respect to this task. When the candidate mining agent(s) has been identified, the task or its subtask is deployed to the mining agent(s) and the aggregation agent is informed. When the mining results are available from any mining agent, it sends the results back to the aggregation agent after identifying its address in the yellow page. If all the mining results are collected, the aggregation agent combines the results in an intelligent way and sends them back to the interface agent for display. Through the process, a bioinformatics ontology base is used to match the task with multiple data sources. 3.2
Datasets
Table 1 summarizes the datasets used in system demonstration. The “Ontology Keywords” column provides the keywords used for enquires matching. The “Features”, “Samples”, “Class”, and “Format” columns are used by mining agents as the data characteristics. As can be seen, many diseases have been studied from multiple aspects using different analyzing technologies and the data are in various formats, which is suitable for testing the proposed multiagent based bio-data mining system.
Multiagent Framework for Bio-data Mining
205
Table 1. Datasets descriptions Dataset∗ Features Samples Class Format Leukemia1 [14] 7,129 72 2 Arffb Leukemia2 [14] 3,571 72 2 Matrix MLL [15] 12,582 72 3 Matrix
Ontology Keywords Microarray; Leukemia Microarray; Leukemia Microarray; Leukemia; Subtypes Breast1 [16] 305 15 2 Matrix SAGE; Breast; Cancer Breast2 [17] 24,481 97 2 C4.5c Microarray; Breast; Cancer Prostate1 [18] 15,154 322 4 Arff MS; Prostate; Cancer Prostate2 [19] 12,600 136 2 C4.5 Microarray; Prostate; Cancer a A matrix format with sample id in the first column and feature id in the first row. b A data format standard of Weka data mining package. c A C4.5 data format with feature ids and values are stored in two separate files.
3.3
Deployment and Implementation
We store the above datasets in three different computers connected by intranet as follows: Computer1 192.168.208.110 System Fedora 5 Datasets Leukemia1
Computer2 192.168.208.111 System Fedora 5 Datasets Breast1 MLL Prostate2
Computer3 192.168.208.112 System XP Professional Datasets Breast2 Prostate1 Leukemia2
The multiagent system is implemented using JADE [20], which is a FIPACompliant multiagent programming framework. The communication is made by following the FIPA ACL message structure specification and the databases are agentified by adding transducers (mining agents) on the top of the systems for requests translation and mining algorithm invocation [21]. Another server is used as the container of planning agent and aggregation agent for generating work plans, delegating mining tasks, and aggregating mining results. 3.4
Results
Due to the page limit, we only present the experimental results with mining enquiries of “Leukemia” and “Cancer”. Table 2 provides the mining details of each enquiry. As can be seen, the input enquiry “Leukemia” matches three datasets from multiple databases (192.168.208.110; 192.168.208.111; 192.168.208.112). The first two, namely, Leukemia1 and Leukemia2 are the same dataset which have been pre-processed with different pre-filtering procedures [14] and in different data formats. The third one is generated by another leukemia study [15]. The mining results not only provide the selected genes and sample classification accuracy of each dataset, but also provide the overlapped genes in different mining results. As to input enquiry “Cancer”, four datasets match it and the system provides the mining results of each dataset. Note that for breast cancer datasets, the results include those generated from SAGE and from microarrays. For prostate cancer datasets, the results include those generated from MS and from microarrays. These results give a multi-level view of the enquired biological problems.
206
P. Yang et al. Table 2. Mining results with inputs “Leukemia” and “Cancer”
Input: Results:
“Leukemia” Dataset: Leukemia1 Data Type (Level): Microarray (Transcriptomic) Selected BioMarkers (N=5): X95735, L09209, M84526, M27891, U50136 rna1 Classification Accuracy: 94.22% Overlap: (With dataset: Leukemia2 ) X95735, L09209, M27891 Dataset: Leukemia2 Data Type (Level): Microarray (Transcriptomic) Selected BioMarkers (N=5): M27891, U46499, L09209, X95735, M12959 Classification Accuracy: 96.05% Overlap: (With dataset: Leukemia1 ) M27891, L09209, X95735 Dataset: MLL Data Type (Level): Microarray (Transcriptomic) Selected BioMarkers (N=5): 34168 at, 36122 at, 1096 g at, 1389 at, 266 s at Classification Accuracy: 92.14% Comments: Leukemia1 results provided by agent: 192.168.208.110 Leukemia2 results provided by agent: 192.168.208.112 MLL results provided by agent: 192.168.208.111 Input: “Cancer” Results: Dataset: Breast1 Data Type (Level): SAGE (Transcriptomic) Selected BioMarkers (N=5): CCTTCGAGAT, TTTCAGAGAG, TATCCCAGAA, CTAAGACTTC, TTGGAGATCT Classification Accuracy: 98.88% Dataset: Breast2 Data Type (Level): Microarray (Transcriptomic) Selected BioMarkers (N=5): NM 003258, AL137514, NM 003079, Contig 15031 RC, AL080059 Classification Accuracy: 73.39% Dataset: Prostate1 Data Type (Level): MS (Proteomic) Selected BioMarkers (N=5): 0.054651894, 125.2173, 271.33373, 478.95419, 362.11416 Classification Accuracy: 88.31% Dataset: Prostate2 Data Type (Level): Microarray (Transcriptomic) Selected BioMarkers (N=5): HPN, TSPAN7, GUSB, ALDH1A3, HEPH Classification Accuracy: 92.55% Comments: Breast1 , Prostate2 results provided by agent: 192.168.208.111 Breast2 , Prostate1 results provided by agent: 192.168.208.112
With more datasets and systems from different biological studies and experiments been integrated, this framework should be able to provide a more holistic picture for analyzer to view a given biological problem from multiple aspects.
4
Conclusion
In this proposal, we argue for applying multiagent based data mining framework to biological data analysis. The argument has been supported from multiple perspectives by briefly viewing the advantages of applying such a framework in biological data analysis context. We believe multiagent based bio-data mining framework will help to bridge the knowledge gap between data mining community and biology community, and enhance the reusability of biological databases as well as data mining algorithms.
Multiagent Framework for Bio-data Mining
207
References 1. Westerhoff, H., Palsson, B.: The evaluation of molecular biology into systems biology. Nature Biotechnology 22(10), 1249–1252 (2004) 2. Wang, J., et al.: Data mining in Bioinformatics. Springer, Heidelberg (2005) 3. Frank, E.: Data mining in bioinformatics using Weka. Bioinformatics 20(15), 2479– 2481 (2004) 4. Louie, B., et al.: Data integration and genomic medicine. Journal of Biomedical Informatics 40, 5–16 (2007) 5. Cao, L., Luo, C., Zhang, C.: Agent-mining interaction: An Emerging Area. In: Gorodetsky, V., Zhang, C., Skormin, V.A., Cao, L. (eds.) AIS-ADM 2007. LNCS, vol. 4476, pp. 60–73. Springer, Heidelberg (2007) 6. da Silva, J.C., et al.: Distributed data mining and agents. Engineering Applications of Artificial Intelligence 18, 791–807 (2005) 7. Ooi, C., Tan, P.: Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics 19, 37–44 (2003) 8. Ding, C., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 7(4), 349–358 (2001) 9. Yang, P., Zhang, Z.: A clustering based hybrid system for mass spectrometry data analysis. In: Chetty, M., Ngom, A., Ahmad, S. (eds.) PRIB 2008. LNCS (LNBI), vol. 5265, pp. 98–109. Springer, Heidelberg (2008) 10. Keedwell, E., Narayanan, A.: Discovering gene networks with a neural-genetic hybrid. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(3), 231–242 (2005) 11. Wang, Y., et al.: HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics 21, 1530–1537 (2005) 12. Zhang, Z., Zhang, C.: Building agent-based hybrid intelligent systems: A case study. Web Intelligence and Agent Systems 5(3), 255–271 (2007) 13. Zhang, Z., et al.: An agent-based hybrid system for microarray data analysis. IEEE Intelligent Systems (accepted) 14. Golub, T., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999) 15. Armstrong, S., et al.: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics 30, 41–47 (2001) 16. Lash, A., et al.: SAGEmap: A public gene expression resource. Genome Research 10, 1051–1060 (2000) 17. van’t Veer, L., et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002) 18. Petricoin, E., et al.: Serum proteomic patterns for detection of prostate cancer. Journal of the National Cancer Institute 94, 1576–1578 (2002) 19. Singh, D., et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209 (2002) 20. Bellifemine, F., Poggi, A., Rimassa, G.: Developing multi agent systems with a FIPA-Compliant agent framework. Software-Practice and Experience 31, 103–128 (2001) 21. Karasavvas, K., Baldock, R., Burger, A.: Bioinformatics integration and agent technology. Journal of Biomedical Informatics 37, 205–219 (2004)