DNA-protein datasets to reduce the computation time for searching algorithm, ... As well, the decision tree has the ability to deal with large data sets, as several.
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |57
Adapting Decision Tree-Based Method to Index Large DNA-Protein Sequence Datasets Khalid Mohammad Jaber, Faculty of Science and Information Technology, Al-Zaytoonah University of Jordan
Rosni Abdullah, School of Computer Sciences, Universiti Sains Malaysia Nur’Aini Abdul Rashid, School of Computer Sciences, Universiti Sains Malaysia
Currently, the size of biological databases has increased significantly with the growing number of users and the rate of queries where some databases are of terabyte size. Hence, there is an increasing need to access databases at the fastest possible rate. Where biologists are concerned, the need is more of a means to fast, scalable and accuracy searching in biological databases. This may seem to be a simple task, given the speed of current available gigabytes processors. However, this is far from the truth as the growing number of data which are deposited into the database are ever increasing. Hence, searching the database becomes a difficult and time-consuming task. Here, the computer scientist can help to organize data in a way that allows biologists to quickly search existing information. In this paper, a decision tree indexing model for DNA and protein sequence datasets is proposed. This method of indexing can effectively and rapidly retrieve all the similar proteins from a large database for a given protein query. A theoretical and conceptual proposed framework is derived, based on published works using indexing techniques for different applications. After this, the methodology was proved by extensive experiments using 10 data sets with variant sizes for DNA and protein. The experimental results show that the proposed method reduced the searching space to an average of 97.9% for DNA and 98% for protein, compared to the Top Down Disk-based suffix tree methods currently in use. Furthermore, the proposed method was about 2.35 times faster for DNA and 29 times for protein compared to the BLAST+ algorithm, in respect of query processing time. Additional Key Words and Phrases: Indexing Approaches, Searching Algorithms, Decision Tree, DNA, Protein, Bioinformatics, Data Mining
1. INTRODUCTION
One of the major challenges for researchers of computer science, is the increase in the amount of biological and genomic data, with growing numbers of users and rates of queries; where some databases are of terabyte size. In real life situations for any given query data in a large database, the need to quickly retrieve all data which are similar is of utmost importance. With the ever growing number of data deposited in the database, searching becomes a difficult and time-consuming task. However, this paper is intended to help biologists to easily, quickly and effectively search existing biological data. This paper will focus on the decision tree indexing algorithm to build indexes for enormous DNA-protein datasets to reduce the computation time for searching algorithm, as well as reducing the space required to do the computation and store indexes. However, The protein is the most prevalent bio-molecule. It consists of a long chain of amino acid sequences that fold into a complex four dimensional structure. Structurally, each amino acid is one of 20 molecules consisting of an amine, a carboxyl group, and a variable molecular side chain that determines the identity and chemical activity of the amino acid. The DeoxyriboNucleic Acid (DNA) consists of a double stranded helix, close to each other. Each strand of the DNA consists of a sugar (Deoxyribose), and a base (adenine, thymine, cytosine or guanine). Each of the four characters (A, T, C and G) in each strand builds the code for the technology of human life. The strands are connected by hydrogen bonds between adenine and thymine (A, T) and between cytosine and guanine (C, G). Each of these relations is called a base pair [Westhead et al. 2002]. This paper is organized in seven sections. The Section 1 explains the protein database problems and the contribution of this paper. Section 2 defines the Workflow of adapting the decision tree for indexing, followed by the benchmark for the proposed method in Section 3. Section 4 presents the data sets for DNA and protein, used for the evaluation of the
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |58
indexing model. Section 5 presents the experiments and analyzes the results. Finally, the authors conclude and present future work in the last section 6. 2. ADAPTING THE DECISION TREE FOR INDEXING: FRAMEWORK
Database indexing searching is one of the approaches adapted to achieve fast, accurate and efficient searching. Indexing genomic searching is a method that reduces the cost and time of the searching [Williams and Zobel 2002]. In other words, the index method can efficiently solve such kinds of memory storage capacity problems by cutting down the memory storage requirement and access time [Williams 2003; Jiang et al. 2007]. In the absence of other better methods, the index method is necessary to solve the serious problems arising from the use of the exhaustive search methods [Williams and Zobel 2002]. The methods used for genomic indexing approaches are classified under three categories which are shown in Figure 1.
Fig. 1. Classification of Methods for Genomic Searching Approaches
However, we discussed these approaches in previous works [Jaber et al. 2010b; 2010a]. In general, The searching approaches have many drawbacks which could be classified under three categories: (1) Storage problem: the capacity of results produced by the searching approaches are larger than the original data. (2) Memory problem: this problem has relationship with the storage problem, in which case to read the large data located on the disk and try to fit it into the main memory causes
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |59
problems, since the main memory’s capability is much smaller than the capability of the storage disk. (3) Poor accuracy: this problem is due to difficulties in finding the correct query solution. Large databases increase the size of the search space, which in turn increases the opportunity to choose the wrong solution. In some approaches, for an example the RAMdb requires a large index, about twice the size of the original flat file database (storage problem) [Jiang et al. 2007]. In FASTA and BLAST, all sequences must be in the main memory when a program executes, therefore large space is needed (storage and memory problems) [Williams 2003]. Inverted Files have suffered retrieval accuracy and as such are not very useful for small query proteins with few SSEs (poor accuracy) [Gao and Zaki 2008]. Suffix trees, is a possible data structure for indexing, but it can only be built for small sequences due to memory bottleneck (memory problem) [Farach et al. 1998; Tian et al. 2005]. Therefore, due to the limitations of the searching approaches and the limitation of searching in large genomic data sets, which is a time consuming task and due to computing complexity, the adapting decision tree basedmethod was proposed, to index data from large protein databases to address these problems. To the best of our knowledge so far, no research effort has used decision tree as an indexing approach for biological data. However, the decision tree was observed to have been used in different fields as the indexing approach, such as in the information retrieval [Quellec et al. 2007], video indexing [Shearer et al. 2001], graph database [Irniger and Bunke 2004; 2008] and others. As well, the decision tree has the ability to deal with large data sets, as several previous effort tried to use it in large data; such as SLIQ [Mehta et al. 1996], SPRINT [Shafer et al. 1996], CLOUDS [Alsabti et al. 1998]. Furthermore, decision tree can be constructed without any domain knowledge or parameter setting and it tree has good accuracy, speed, robustness, scalability and interpretability [Han and Kamber 2006]. However, to clarify the problem area, Figure 2 shows the main steps of the proposed framework in general. In subsequent subsections, the phases are described in detail which involves: (1) (2) (3) (4) (5)
The downloading of DNA-protein datasets from public databases. Pre-processing phase. Building the index using decision tree algorithms. Storing the indices. Searching DNA-Protein queries in the decision tree indices.
2.1. Download DNA-Protein Datasets from Public Databases
The first step in this research is to download DNA and protein datasets from genomic databases such as GenBank and Swiss-Prot (an example of DNA sequence is given in Figure 3). The sequences are downloaded and stored locally in a text file, and used in the next phase (pre-processing phase). Furthermore, it is used for the evaluation the indexing model. 2.2. Pre-Processing Phase
The aim of this phase is to prepare the representation of the data sets to store in the databases. This phase is carried out in three steps. The first step in this phase is to retrieve data from the text file by opening the file, then reading the data which it contains line by line. After that, the second step is to extract the description of the sequence, such as a unique library accession number (ID), family name and so on. Figure 4 shows an example of the description of the sequence. The third step is to process and build efficient data representations which are suitable for SQL databases, to reduce the processing time and space required. The data sets are represented as a collection of tuples (S) that may contain duplicates. Tuples is known also
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |60
Fig. 2. Workflow for Decision Tree Indexing
>gi|157697810|ref|NW_001838472.1| Homo sapiens chromosome 18 genomic contig, alternate assembly, whole genome shotgun sequence GTGGTGATGGTGATGGTGATGATGGTGGTGGTGATGGTGATGGTGATGATGGTGGTGATGGTGGTAGTGAT GGTGGTAATAGTGATTCTGAACTGCCTCTTAGGGGCTTAGGGGAAAGCAATGTGCAGAGTTGAAGTCACCTC AGCAGGGTGCCTGGAGATGAGACCAGTAGAAATACAGAAGTGATAAACAGAGCCTCAGGGGAAGGTTCTCC ACGTAAGTAGAATGACCAATCTCCAAAAGAATTCATTGGCATTGAAACAGATTCTAACTGTTCAGAATTTTCTA AACGTTTCTCAAAAATGAAAATTCTCCCACTTGGGAAGGCAGCAGAGAAGAGGAGAAACAGC
Fig. 3. An Example of DNA Sequence
>gi|157697810|ref|NW_001838472.1| Homo sapiens chromosome 18 genomic contig, alternate assembly (based on HuRef), whole genome shotgun sequence
Fig. 4. An Example for Description of the Sequence
as records, rows, samples or instances. Each tuple is represented as a vector of attribute values (A). Attribute is known also as field, variable or feature. The data set provides the description of the attributes and domains. In this paper, data sets are denoted as D(A ∪ Y ), where A denotes the set of input attributes containing n attributes: A = {a1 , , ai , , an } as shown in matrix Equation 1, and Y represents the target attributes.
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
a11 a12 a2 a2 1 2 D(A ∪ Y ) = . . m am 1 a2
1 ... a1n y|pos(y)| 2 ... a2n y|pos(y)| ... . . m m ... an y|pos(y)|
Page |61
(1)
The possible value of attribute ai is denoted as pos(ai ) = {vi,1 , vi,2 , ..., vi,|pos(ai )| }. The set of all possible values is defined as a Cartesian product of all input possible attributes: X = {pos(a1 ) × pos(a2 ) × ... × pos(an )}. In a similar manner, pos(yi ) = {T1 , ..., T|pos(y)| } represents the possible values of the target attributes. The training set consists of a set of m tuples. Formally, the training set is denoted as Tset (D) = (hx1 , y1 i, ..., hxm , ym i) where xq ∈ X and yq ∈ pos(y). The difficulty of this step is in the method of selecting target attributes, so adoption of the target data to depend on the protein family name for protein data, and to depend on the type of chromosomes for the DNA data was used. After building data representations in preprocessing phase, these data were stored in databases, to allow for multiple ways of accessing the stored data and make it searchable. For a particular file system however, there is a fixed path of accessing data and it is difficult to search into it. 2.3. Building the Indices
The first step in this phase was to initialize some parameters such as number of target attributes, the data type (DNA or Protein), name of database and name of tables, and so on. The next step in this phase is to adapt building decision tree for large data sets. Figure 5 shows the flowchart used to build the decision tree. The building algorithm has several steps, which are as follows: (1) The first step is to initialize the splitting criteria. The information gained; called Entropy is used as a measure to split the attributes and compute in units called bits. During this step, statistical method is used to calculate the information gain. (2) Select an attribute which has the highest Entropy, placed at the root node and stored into the database (loading and storing data issues are discussed in section 2.4). (3) Create one branch for each possible value pos(ai ) = {vi,1 , vi,2 , ..., vi,|pos(ai )| }. (4) Select one possible value vi,|pos(ai )| from pos(ai ). Then, calculate the Entropy and select the highest value to place at the branch node and store in the database. After this, the process could be repeated recursively for each branch, using only those instances which actually get to the branch. (5) If all the instances in the training set belong to a single target value, the particular node is not split further and the recursive process down that branch terminates. (6) If the maximum possible value of the tree is reached, to stop developing the tree. Else repeat steps 4, 5 and 6. 2.4. Hybrid Decision Tree Indexing Model with Database Management System Using Structured Query Language
Due to the limitations noted with the previous research which tried to fit large data sets into the memory before an induction decision tree such as Catlett method requires that all data are loaded into memory before induction [Sug 2005; Rokach and Maimon 2005] and [Mehta et al. 1996] uses a data structure which scales with the dataset size, this data structure must be resident in the main memory all the time [Shafer et al. 1996]. Furthermore, in previous indexing algorithms works, all data sets are resident on disks and during the building of the index, they will be moved onto the memory such as [Joshi et al. 1998; Zaki et al. 1999; Srivastava et al. 1999; Sreenivas et al. 1999; Jin and Agrawal 2003; Jin et al. 2005; Ben-Haim and Tom-Tov 2010]. And due to performance of the merged trees such as [Chan and Stolfo 1997] proposed the approach to divide the datasets into many portions,
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |62
Fig. 5. Flowchart for Building the Decision Tree
then each portion is loaded alone into the memory and used to induction a decision tree. After this, every decision tree is merged, to create a single tree. And due to the size of the storage indexing problem, the hybrid Decision Tree Indexing Model (DTIM) with Database Management System (DBMS) using SQL is proposed in the current study, to address the memory bottleneck and storage problems. However, substantial research efforts utilizes SQL data mining techniques to improve the scalability [Sattler and Dunemann 2001]. Some further examples of investigated SQL techniques are presented in [Sarawagi et al. 1998; 2000] for association rules, in [Ordonez and Cereghini 2000] for clustering and in [Slezak and Sakai 2009] for decision rule. This section discusses the hybrid method from many, different perspectives; such as connecting the proposed method with the database, loading the data for induction and data representation for storing the new indexes which were produced by the decision tree indexing model process. In the next section, the description of the searching query with the decision tree index is made in greater details. 2.4.1. Connect DTIM with Databases. Connection of the decision tree indexing model with DBMS has several benefits. For example, it allows for multiple ways of easy access, organization, management and retrieval of data stored in the database and makes it searchable via asking of simple questions in a query language. As for the particular file systems, there are fixed paths for accessing data which are slow, difficult and expensive to search and which are resident among all data sets in the main memory during processing computation.
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Page |63
/∗ MySQL Connector/C++ specific headers ∗/ #include #include #include #include #include #include #include #include #include #include /∗ ========================== ∗/ #define DBHOST ”tcp://127.0.0.1:3306” #define USER ”kjaber” #define PASSWORD ”password123” #define DATABASE ”kjaber” #define TABLE ”‘100k‘” using namespace sql; /∗ ==========================∗/ Driver ∗driver = get driver instance(); /∗ create a database connection using the Driver ∗/ auto ptr < Connection > con ( driver −> connect(DBHOST, USER, PASSWORD) ); /∗ select appropriate database schema ∗/ con −> setSchema(database); /∗ create a statement object ∗/ auto ptr< sql::Statement > prep stmt(con1−>createStatement()); auto ptr< sql::ResultSet > res; /∗ run a query which returns exactly one result set ∗/ res.reset(prep stmt −> executeQuery (lbnv−>get())); /∗ retrieve and display the result set metadata ∗/ ResultSetMetaData ∗res meta = res −> getMetaData(); numcols = res meta −> getColumnCount(); Fig. 6. Create Connection
To create a connection between the current proposed method and DBMS, latest connectors were used for MySQL, developed by Sun Microsystems. The MySQL connector for C++ provides an object-oriented API and a database driver for connecting C++ applications to the MySQL Server. Therefore, the main components for establishing the connection are the Driver, Connection, Statement, PreparedStatement, ResultSet and ResultSetMetaData. Figure 6 presents the algorithm for establishing a connection to the DBMS by retrieving an object of sql::Connection from the sql::Driver object. An sql::Driver object is returned by the sql::Driver::get driver instance method and sql::Driver::connect method returns the sql:: Connection object. However, the first parameter in the connection object is the TCP/IP, which is used to connect the system with DBMS server by specifying ”tcp://[hostname[:port]][/schemaname]” for the connection URL (Figure 6 lines 14 and 25). Specifying the schema name in the connection URL is optional, and which if not set, then there is the need to select the database schema using Connection::setSchema method (Figure 6 line 28).
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
ID
Parent (P)
Node (N)
Page |64
Element (E) Decision (y)
Fig. 7. Indexes Data Representation
To prevent memory leaks and to reduce the memory usage, the dynamic memory was used, with the class template auto ptr as shown in Figure 6 lines 25, 31 and 33. An auto ptr object has the same semantics as that of a pointer. However, when it goes out of scope, it automatically releases the dynamic memory which it manages. In other words, when auto ptr template class is in use, there is no need to free up the memory explicitly using the delete operator. eg., delete con; More highlight in respect of the hybrid DTIM with DBMS, there is no need to load all of a huge data set into the main memory for induction decision tree. Thus connection is created, then simple questions are asked using SQL, after which the results are retrieved, automatically closing the connection. 2.4.2. Indexes Data Representation. An efficient data representations for storing the indexes is proposed, which was produced by the decision tree indexing model process, to minimize the indexing size. Usually, the results produced by the decision tree are stored as a rule, which has many drawbacks such as the following [Zhou et al. 2003b; 2003a]:
(1) Generated rules are often very complex than is necessary and contain redundant information. (2) Storage problems for large data and inefficient speed. (3) Memory bottleneck problems which need to fit all the rules in the memory to do the search. On the other hand, the proposed data representation has several advantages as follows: (1) (2) (3) (4) (5)
Efficient way to insert, delete and retrieve the search tree elements. Minimizes the number of database queries by having only one query for each activity. Minimizes the storage size. Is generated efficiently and easily. Is a solution to the memory problem, with no need to fit all indexes into the memory.
The indexes could be represented by five attributes which are; ID, Parent, Node, Element and Decision as shown in Figure 7. The ID represents the unique identification number for each node. Parent (P) represents the father ID for each node. The Node (N) represents a position of an element, depending on the sequence. Element (E) represents a set of attributes (A); where attribute (A) contains n attributes: A = {a1 , ...., ai , ...., an }. The possible value of an attribute ai is denoted by pos(ai ) = {vi,1 , vi,2 , ..., vi,|pos(ai )| }. Finally, decision represents the target attribute (y) or the next step in the tree; where the possible value of (yi ): pos(yi ) = {T1 , ..., T|pos(y)| }. To clarify the proposed index representation, Figure 8 represents an example of the indexes representation. Figure 8 consists of two parts; the first part on the left represents the tree from the root node a5 and one branch element A, until the last level of the indexes. The second part on the right is the indexes data representation. In the first instance, ID = 0 the parent P = −1, meaning that it is the root node, thus the decision is a1 , meaning that the next step is node = a1 with parent = 0.
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
ID
Parent
Node
Element
Decision
0
-1
a5
A
a1
1
0
a1
T
a2
2
1
a2
A
3
3
1
a2
T
1
4
1
a2
G
3
5
0
a1
G
1
Page |65
Fig. 8. Example of Indexes Data Representation
2.5. Search DNA-Protein Queries in the Decision Tree Indices
The fifth phase of this experiment involves the searching process. This involves using the indexes which were stored in the previous phase. Therefore, the searching process of the work could be summarized in the following steps: Step Step Step Step
1:. 2:. 3:. 4:.
Enter the query (Q) or queries. Tokenize the query, called query chopping. Search the query in the decision tree indices, using SQL command. Save the result or display it.
From the review of previous methods, it was noticed that some of the popular indexing methods do not support the search for multiple queries, called Query multiplexing or packing ´ [Williams and Zobel 2002], Top Down Disk-based (TDD) suffix trees [Tian such as CAFE et al. 2005; Tata et al. 2004]. To address this problem, the hybrid in the proposed method is equipped with a query packing, to handle multiple queries and reduce the overhead of reading the queries repeatedly. Additionally, to enhance the searching speed, the query chopping technique was adapted to merge with the proposed method. Query chopping is dividing the individual query sequences into several segments, searching them independently and then merging the results back together. After adapting the query chopping for the proposed method, the query chopping definition is thus to divide individual query sequence into several segments, use particular segments to search them dependently and then show the result. Figure 9 presents the query sequence, which is divided into five segments and the segments needed for the query process only are a5 , a1 and a2 , whereas segments a3 and a4 are not necessary in the query search process. Therefore, supposing there is a query length of 80 nucleotides and this uses only 50 nucleotides in the search process; therefore the use of particular nucleotide increases the speed of the search instead of the search process with all the 80 Nucleotides. Before moving on to the next section, following is a clarification of an adapted decision tree (DTIM) compared to a normal decision tree. The DTIM has many improvements (advantages) such as the preprocessing phase, where the preprocessing phase is done in three steps. The important step in preprocessing phase is to retrieve and clean the data, while this
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |66
Fig. 9. Query Chopping Technique
step is not existing in the normal decision tree algorithms. Moreover, the normal decision tree data representations are stored in a text file as a storage which causes a problem in memory and difficulties in processing and building the tree -especially for large data. While the DTIM data representation is stored in DBMS, -which is fast and better to handle large data. Furthermore, the normal decision tree cannot deal with DNA-protein data directly for the reason that the DNA-protein is same as English text, and it does not have a target value, while the DTIM can deal with DNA-protein data due to the preprocessing phases. The DTIM is connected with DBMS instead of the file system which is available in the normal decision tree. However, the results produced from the normal decision tree are rules, whereas the results of the adapted decision tree are records in a table; the discussion of advantages and disadvantages is provided in subsection 2.4.2. Finally, the normal decision tree has no concept of query multiplexing and packing, while it is available in DTIM. 3. BENCHMARK
To evaluate the DTIM method, different tests were performed as follows: first, DTIM method was compared with the Top Down Disk-Based suffix tree [Tata et al. 2004; Tian et al. 2005], to test the indexing size. Secondly, DTIM method was compared with BLAST+ version 2.22.22, downloaded from [NCB 2010] to test the query processing time. Finally, computing the performance of the DTIM is tested by accuracy, recall, precision and elapsed time for building the indexes. 4. TEST DATA SETS
In the experiments, different sets of ten DNA sequences were used, extracted from human chromosomes 18, 19 and 21 with varying sizes from 10 thousand to 5 million bases, which were downloaded from GenBank. Furthermore, other different data sets of ten protein sequences were used with varying sizes from 10 thousand to 5 million bases, extracted from Swiss-Prot database. As query sequences, some subsequences were randomly extracted from the data sets. 5. EXPERIMENTS AND RESULTS
This section describes the experiment performed and the result of the experiment on the decision tree indexing model (DTIM). The first experiment conducted was to compare the indexing size between DTIM and TDD (Top Down Disk-Based suffix tree) download from http://www.eecs.umich.edu/tdd. The experiment was performed with two different data types; DNA and protein data. The reason for using this is that each nucleotide from DNA consists of 4 characters, whereas the protein data consists of 20 amino acids, which is more complicated. The sizes of ten data sets range from 10 thousand to 5 million bases. The second experiment performed was to compare the DTIM with BLAST+, in term of query processing time. The number of queries was similar to the number of deposit sequences in the databases, to show the effects of performance of the proposed method.
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |67
The third experiment was to examine the time speed for building an index on one processor. The final experiment was is to quantify the relative performance or retrieval effectiveness of the proposed method, used to measure accuracy, precision and recall. 5.1. Indexing Size Experiments
Figure 10 shows the DTIM index sizes for nucleotide data sets, while Figure 11 is for amino acid data sets, with changing data sizes. Furthermore, the current research approach was compared with the Top Down Disk-Based suffix tree approach in respect of the index size for both DNA and protein. The Top Down Disk-Based is a previous approach based on the suffix tree [Tian et al. 2005; Tata et al. 2004]. From the experimental results, it was observed that the index size increases linearly in proportion to the data size in both approaches. However, in comparison with TDD, the current approach saves an average of about 97% of storage space for DNA and 98.2% for protein data. In comparison, in the DNA-Protein data with original data sets, it was observed that the DTIM reduced the size of the data significantly. For example, the DTIM reduced 99% of the size of data set number 10, which contained 5 million nucleotides, whereas the size of the original data is about 5M while the size of indexing about 550K. This is enough evidence which supports that the proposed DTIM is appropriate for large data sets. DNA Indexing Size 70000
Storage Size (KB)
60000 50000
40000 30000 20000 10000 0
100 K
200 K
300 K
400 K
500 K
TDD
1210
2411.8 3606.4
4839
6052
DTIM
39.6
59.7
82.4
145.9
181.6
1M
2M
3M
4M
5M
12501 25004 37580 50115 62761 222
387.9
443
488.5
533.8
Data Sets Size (KB)
Fig. 10. Index Sizes for DNA Data Sets
Protein Indexing Size 70000
Storage Size (KB)
60000 50000 40000 30000
20000 10000 0
100 K
200 K
300 K
400 K
500 K
TDD
1246
2501
3656
4891
5935
11742 24169 36283 48573 60495
1M
2M
DTIM
48.4
55.1
60.4
65.1
73.3
222.8
403.3
3M 548.4
4M 661
Data Sets Size (KB)
Fig. 11. Index Sizes for Protein Data Sets
5M 746.2
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |68
DNA Query Processing Time 800 700
Time (sec)
600 500
400 300 200 100 0
100 K
200 K
300 K
400 K
500 K
BLAST+
0.93
1.978
3.266
4.919
6.878
1m
2m
3m
4m
5m
DTIM
1.095
2.585
4.428
8.765
12.424 28.569 88.272 147.105 211.139 292.64
31.287 137.742 320.855 635.264 733.334
Data Sets Size
Fig. 12. DNA Query Processing Time
5.2. Query Processing Time
In this experiment, DTIM was compared with BLAST+ version 2.22.22, in respect of the elapsed time for query search. The elapsed time is the time spent on finding all the queries, which are submitted to the DTIM or BLAST+. The effects of variant data sets sizes was also examined, with changing number of queries on DTIM and BLAST+ methods. Figures 12, 13 respectively show the DNA and protein queries processing time of DTIM and BLAST+ for exact match queries. The X axis denotes a varying data sets size, while the Y axis the elapsed time in unit of seconds. From this experiment results, it was observed that the query processing time increased linearly in proportion to the data size in both approaches (DTIM and BLAST+) as shown in Figures 12, 13. However, in Figure 12, it was observed that the BLAST+ achieves 1.48 faster than the DTIM when the DNA data set size is greater than 500K. On the other hand, DTIM shows good performance when DNA data set size is less than 500K and achieves about 2.35 times of query processing time compared with BLAST+. This is because of the BLAST+ methodology, where it examines every possible alternative to find a particular solution. In other words, BLAST+ needs to compare a query with each of the sequences in the database, depending on the sequence alignment techniques. In the case of the proposed approach, the data sets which are less than 500K is somewhat small in size, hence the BLAST+ completes the search quickly. However, when the size of data sets increased to more than 500k, DTIM achieved about 2.35 times speedup for the DNA and 29 times speedup for the protein, compared with the BLAST+ in terms of query processing time, as clarified in Figure 12 and 13. This is because of two reasons. The first was mentioned previously, in respect of the BLAST+ methodology. Secondly, is the combination of query packing and query chopping techniques with DTIM, to increase the speed in the process of retrieval queries. For example, DTIM packed 1412 queries together when it searched in the 100K and 72753 queries together when it searched in 5M. 5.3. Building Index Time
In this experiment, the time taken to build the index on one processor was calculated, with changing data sets size for the two groups of data; DNA and protein as shown in Table I. The aim of this experiment was to examine the time cost of building the index for the current approach DTIM. The results shows that the required time to build the index increased exponentially in proportion to the data size. However, the required time to build 100K DNA data is about 72 seconds while for 100K protein data is 105 seconds. Compared with the 500K DNA data,
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |69
Protein Query Processing Time 3500 3000
Time (sec)
2500 2000 1500 1000 500 0
100 K
200 K
300 K
400 K
500 K
1m
2m
3m
4m
5m
BLAST+ 65.318 123.077 183.973 224.997 245.02 430.943 940.003 1487.88 2147.42 2910.98 DTIM
0.4747
1.138
1.6155 2.2659
2.867
12.342 32.227 57.899 82.921 107.823
Data Sets Size
Fig. 13. Protein Query Processing Time
which is about 2531 seconds while in 500K protein data is about 1118 seconds as shown in Table I. Hence, this indicates that a large data takes a long time to build the index and the DTIM approach is not practical for large data sets. To improve the performance of DTIM in order to minimize the required time to build the index, the parallel version of DTIM will propose as future work. Table I. Building Index Time using One Processor Data
DNA
Protein
Size 100k 200k 300k 400k 500k 100k 200k 300k 400k 500k
Time (Sec) 72.2857 205.76 541.135 1704.87 2531.77 105.103 314.421 542.843 852.372 1118.05
5.4. Evaluating DTIM
In this experiment, the retrieval effectiveness of DTIM was examined, using the measure of accuracy, precision and recall. Before doing this, the holdout method was used as shown in Figure 14. In this method, the data set was randomly partitioned into two independent sets; a training set and a test set [Han and Kamber 2006]. Typically, two-thirds of the data are allocated to the training set, and the remaining one-third are allocated to the test set. The training set is used to produce the index data, whose accuracy, precision and recall are estimated with the test set using confusion matrix. The confusion matrix is a useful tool for analyzing the DTIM, to calculate the accuracy, recall and precision. From experimental results, it is observed that the accuracy of DTIM for DNA is slightly low but according to [Doolittle 1986] it is acceptable because when two amino acids or nucleotides are more than 30% identical, then the sequences are most likely homologous. However, low accuracy was achieved for many reasons; the important one is with regards to the sampling method (Holdout), which selected the training set randomly, leading to a pessimistic estimate because only a portion of the initial data was used to derive the mode.
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |70
Fig. 14. Estimating Accuracy, Precision and Recall with the Holdout Method
Probably, this portion was not sufficient to learn the mode to achieve good accuracy. On the other hand, DTIM performs better accuracy in the protein data. The average overall accuracy for the ten data sets are about 90.48% for protein while for DNA data, it is about 55.89%. This is also enough evidence to support that the proposed DTIM is appropriate for large data sets in terms of accuracy. Figures 15, 16 illustrates the comparison between average precision training test and average precision of testing set for DNA and protein respectively, whereas Figures 17, 18 illustrate the comparison between average recall training set and average recall testing set for DNA and protein respectively. The precision values for training set was mostly 100% in both DNA and protein, which imply that no false positive values were produced by DTIM. Testing sets for DNA produces the most false positives while testing set for protein produces a very small number of false positive. However, the DNA recall in the Figure 17 is better than the precision of testing data for DNA in the same figure, implying that DTIM for DNA produces less false negative. The recall values for protein testing set are above 79%, indicating that the false negative produced is low. Overall, DTIM for protein performs better than DTIM for DNA. However this also supports that DTIM is appropriate for complicated data sets such as protein (20 amino acids) in terms of precision and recall. DNA Precision Testing Precision
Training Precision
120.00 100
100
100
100
100
100
100
100
100
100
Precision (%)
100.00 80.00 60.00 40.00 34.96
35.39
33.14
32.68
32.76
25.39
26.56
25.94
25.89
26.20
1M
2M
3M
4M
5M
20.00 0.00 100K
200K
300K
400K
500K
Data Size
Fig. 15. Average Precision for DNA
6. CONCLUSION
Database indexing for larger biological data has become an important task in bioinformatics. The main objective for indexing is to reduce the computation time searching algorithm, as well as reducing the space required to do the computation and store indexes. To achieve
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Protein Precision Testing Precision 102.00
100
100
94.21
Training Precision
100 96.81
97.00
Precision (%)
100
100
100
100
100
100
97.01
96.43
94.77
93.40
100
95.44
96.05
4M
5M
92.00 89.78 87.00 82.00 77.85 77.00 100K
200K
300K
400K
500K
1M
2M
3M
Data size
Fig. 16. Average Precision for Protein
DNA Recall Training Recall
Testing Recall
120 100
100
100
100
100
100
100
100
100
99.98
100
Recall (%)
80 60 40
32.56
34.14
32.22
32.34
32.06 22.93
23.50
23.20
22.85
23.12
1M
2M
3M
4M
5M
20 0 100K
200K
300K
400K
500K
Data Size
Fig. 17. Average Recall for DNA
Protein Recall Training Recall 105
100
100
100
100
Testing Recall
100
100 95 87.60
Recall (%)
90 85 80
89.68 86.73
88.59
97.94
87.36
97.23
97.06
88.48
96.93
96.85
89.17
89.96
4M
5M
79.86
79.77
75
70 65
60 100K
200K
300K
400K
500K
1M
2M
3M
Darta Size
Fig. 18. Average Recall for Protein
Page |71
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |72
this, the decision tree for indexing the large DNA-Protein data was adapted from many perspectives such as data representation, connecting the DTIM with DBMS to store the indexes which were produced from the model, loading data from DBMS to do the induction for DTIM, search in DTIM, workflow, benchmark, data sets, system requirement and so on. After this, the methodology was proved by extensive experiments using 10 data sets with variant sizes for DNA and protein. From the experiment, it was observed that the proposed DTIM reduces the indexing space an average of 97.9% for DNA and 98% for protein compared with TDD and performs the queries in lower query processing time is about 2.35 times faster for DNA and 29 times for protein compared with BLAST+ algorithm. Furthermore, the experiment examined the time cost of building the index for DTIM and shows that the required time to build the index increased exponentially in proportion to the data size and therefore, to improve the performance of DTIM in order to minimize the required time to build the index, the parallel version of DTIM will be proposed in future work. Finally, it could be concluded that DTIM performed better accuracy, precision and recall for protein data set than DNA data set in the ten data sets tested. However, low accuracy of DNA data was derived, the accuracy of which needs to be improved in future work.
REFERENCES 2010. Ncbi website, http://blast.ncbi.nlm.nih.gov. Alsabti, K., Ranka, S., and Singh, V. 1998. Clouds: A decision tree classifier for large datasets. In Conference on Knowledge Discovery and Data Mining KDD. 2. Ben-Haim, Y. and Tom-Tov, E. 2010. A streaming parallel decision tree algorithm. Journal of Machine Learning Research 11, 849?872. Chan, P. K. and Stolfo, S. J. 1997. On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information Systems 8, 1, 5–28. Doolittle, R. F. 1986. Of Urfs and Orfs: A Primer on how to Analyze Derived Amino Acid. University Science Books. Farach, M., Ferragina, P., and Muthukrishnan, S. 1998. Overcoming the memory bottleneck in suffix tree construction. In FOCS ’98: Proceedings of the 39th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Washington, DC, USA, 174. Gao, F. and Zaki, M. J. 2008. Psist: A scalable approach to indexing protein structures using suffix trees. Journal of Parallel and Distributed Computing 68, 1, 54–63. Han, J. and Kamber, M. 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc. Irniger, C. and Bunke, H. 2004. Graph database filtering using decision trees. In ICPR ’04: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR’04) Volume 3. IEEE Computer Society, Washington, DC, USA, 383–388. Irniger, C. and Bunke, H. 2008. Decision trees for filtering large databases of graphs. Int. J. Intell. Syst. Technol. Appl. 3, 3/4, 166–187. Jaber, K., Abdullah, R., and Abdul Rashid, N. 2010a. A framework for decision tree-based method to index data from large protein sequence databases. In 2010 IEEE EMBS Conference on Biomedical Engineering & Sciences (IECBES 2010). Kuala Lumpur, Malaysia. Jaber, K., Abdullah, R., and Abdul Rashid, N. 2010b. Indexing protein sequence/ structure databases using decision tree: A preliminary study. In 4th International Symposium on Information Technology 2010 (ITSim 2010). Vol. 2. Kuala Lumpur Convention Center, Kuala Lumpur, Malaysia, 844–849. Jiang, X., Zhang, P., Liu, X., and Yau, S. S.-T. 2007. Survey on index based homology search algorithms. J. Supercomput. 40, 2, 185–212. Jin, R. and Agrawal, G. 2003. Communication and memory efficient parallel decision tree construction. In In Proceedings of Third SIAM Conference on Data Mining. Jin, R., Yang, G., and Agrawal, G. 2005. Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. IEEE Trans. on Knowl. and Data Eng. 17, 1, 71–89.
The Research Bulletin of Jordan ACM, ISSN: 2078-7952, Volume II(III)
Page |73
Joshi, M. V., Karypis, G., and Kumar, V. 1998. Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets. In IPPS ’98: Proceedings of the 12th. International Parallel Processing Symposium. IEEE Computer Society, Washington, DC, USA, 573. Mehta, M., Agrawal, R., and Rissanen, J. 1996. Sliq: A fast scalable classifier for data mining. In EDBT ’96: Proceedings of the 5th International Conference on Extending Database Technology. SpringerVerlag, London, UK, 18–32. Ordonez, C. and Cereghini, P. 2000. Sqlem: fast clustering in sql using the em algorithm. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data. ACM, New York, NY, USA, 559–570. Quellec, G., Lamard, M., Bekri, L., Cazuguel, G., Cochener, B., and Roux, C. 2007. Multimedia medical case retrieval using decision trees. In Engineering in Medicine and Biology Society, 2007. EMBS 2007. 29th Annual International Conference of the IEEE. 4536 –4539. Rokach, L. and Maimon, O. 2005. Decision trees. In Data Mining and Knowledge Discovery Handbook. 165–192. Sarawagi, S., Thomas, S., and Agrawal, R. 1998. Integrating association rule mining with relational database systems: alternatives and implications. SIGMOD Rec. 27, 2, 343–354. Sarawagi, S., Thomas, S., and Agrawal, R. 2000. Integrating association rule mining with relational database systems: Alternatives and implications. Data Mining and Knowledge Discovery 4, 2, 89–125. Sattler, K.-U. and Dunemann, O. 2001. Sql database primitives for decision tree classifiers. In CIKM ’01: Proceedings of the tenth international conference on Information and knowledge management. ACM, New York, NY, USA, 379–386. Shafer, J. C., Agrawal, R., and Mehta, M. 1996. Sprint: A scalable parallel classifier for data mining. In VLDB ’96: Proceedings of the 22th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 544–555. Shearer, K., Bunke, H., and Venkatesh, S. 2001. Video indexing and similarity retrieval by largest common subgraph detection using decision trees. Pattern Recognition 34, 5, 1075–1091. Slezak, D. and Sakai, H. 2009. Automatic extraction of decision rules from non-deterministic data systems: Theoretical foundations and sql-based implementation. In Database Theory and Application. Springer Berlin Heidelberg, 151–162. Sreenivas, M. K., Alsabti, K., and Ranka, S. 1999. Parallel out-of-core divide-and-conquer techniques with application to classification trees. In IPPS ’99/SPDP ’99: Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing. IEEE Computer Society, Washington, DC, USA, 555–562. Srivastava, A., Han, E.-H., Kumar, V., and Singh, V. 1999. Parallel formulations of decision-tree classification algorithms. Data Min. Knowl. Discov. 3, 3, 237–261. Sug, H. 2005. A comprehensively sized decision tree generation method for interactive data mining of very large databases. In Advanced Data Mining and Applications. 141–148. Tata, S., Hankins, R. A., and Patel, J. M. 2004. Practical suffix tree construction. In VLDB ’04: Proceedings of the Thirtieth international conference on Very large data bases. VLDB Endowment, 36–47. Tian, Y., Tata, S., Hankins, R. A., and Patel, J. M. 2005. Practical methods for constructing suffix trees. The VLDB Journal 14, 3, 281–299. Westhead, D., Parish, H., and Twyman, R. 2002. Instant Notes in Bioinformatics. Topeka Bindery. Williams, H. E. 2003. Genomic information retrieval. In ADC ’03: Proceedings of the 14th Australasian database conference. Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 27–35. Williams, H. E. and Zobel, J. 2002. Indexing and retrieval for genomic databases. Knowledge and Data Engineering, IEEE Transactions on 14, 1, 63–78. Zaki, M., Ho, C.-t., and Agrawal, R. 1999. Parallel classification for data mining on shared-memory multiprocessors. In ICDE ’99: Proceedings of the 15th International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 198. Zhou, C., Xiao, W., Tirpak, T., and Nelson, P. 2003a. Evolving accurate and compact classification rules with gene expression programming. Evolutionary Computation, IEEE Transactions on 7, 6, 519 – 531. Zhou, C., Xiao, W., Tirpak, T. M., and Nelson, P. C. 2003b. Evolving classification rules with gene expression programming. IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 7.