An algorithm for discovering deep order ... - IEEE Computer Society

1 downloads 0 Views 1MB Size Report
gene data expression analysis because of its biological significance and noise robustness. However, most of existing algorithms for OPSMs mining are based ...
2015 IEEE International Conference on Bioinformatics and Biomedicine (BTBM)

An Algorithm for Discovering Deep Order Preserving Submatrix in Gene Expression Data Qiuhua Kuang, Meizhen Zhang, Zhihao Ma, Bo Ma, Zhiwen Liu, Yun Xue School of Physics and Telecommunication Engineering South China Normal University Guangzhou, PR China E-mail: [email protected].{zhangmzh.mzh4339}@[email protected]. [email protected], [email protected]

A bstract-Recently, model

has

been

order

widely

preserving

applied

in

submatrix

many

fields,

conditions, while clustering algorithm can just fInd clusters including all conditions or all genes. That is to say, clustering algorithms cannot fInd the local expression pattern in the matrix (Fig. 1 (c) and (Fig. 1 (d)). In order to solve the problem, the biclustering method [26] was proposed, which is more suitable for the gene expression data in recent years. Biclustering method can fInd the local expression pattern in the matrix so the genomes with similar expressions under part of the condition sets can be found. What is more, overlapped areas can exist between different categories, as shown in the green part in Fig. 1 (d), genes and conditions can be part of several clusters at the same time.

(OPSM) such

as

biological gene expression data analysis, finance data mining and recommendation system. OPSM model is widely used in gene

data

expression

analysis

because

of

its

biological

significance and noise robustness. However, most of existing algorithms for OPSMs mining are based on greedy strategy or Apriori principle, which will miss some meaningful OPSMs, especially Deep OPSMs that the biologists are interested in. In this paper, an algorithm for accurate OPSMs searching based on sequential pattern mining was proposed, which could find all OPSMs, especially those Deep OPSMs. The idea of dynamic programming, data structure of suffix tree and the branch and bound

rules were combined

to improve

efficiency of the

algorithm. The proposed algorithm was verified by real gene data

through

algorithm

experiments

performance.

on

biological

significance

�[I}��{�

and

Experimental results demonstrated

that it is a high-efficiency algorithm and can find meaningful results.

Keywords-bicluster; OPSM; dynamic programming; increasing subsequence; all common subsequences

# condition

# condition

# condition

# condition

(a)

(b)

(c)

(d)

Figure 1.

I.

INTRODUCTION

However, high-dimensional genomic microarray data has high-level noise, and the co-regulated genes cannot express at the same expression value. So it's signifIcant to focus on the relative expression level of different genes under different conditions rather than their specifIc values. That is to say, fInding genes fluctuating simultaneously in expression level under different experimental conditions can reveal interesting biological knowledge. Our research is concerned about a kind of pattern-based biclustering model, which is called order preserving submatrix (OPSM). OPSM model focuses on the relative order of the expression levels in the data matrix, instead of the specifIc values. In fact, from the aspect of genetics, there is no requirement for the co-regulated genes to have the same or similar expression value under the same conditions. For instance, since the expression values of genes will essentially change according to the environment, the variation tendency (up or down) under different conditions is thought to be more reliable than the specifIc values. Therefore, to mine the gene clusters with similar variation trend is very important for explaining the relations among gene regulatory networks

As the DNA microarray technology develops rapidly in recent years, the genetic expression data becomes so huge that it will greatly accelerate the development of biological information technology. Table 1 is an example of genetic expression matrixes, in which the rows represent genes, columns are those different experimental conditions and every element indicates the expression value of different genes under different conditions. Clustering algorithm [1] has been used to analyze the genetic expression data fustly, and it can cluster the similar genes in a group as similar genes have similar functions. Clustering technology has two types: clustering by conditions (Fig. 1 (a)) and clustering by genes (Fig. 1 (b)). For the clustering by genes, the genes will be divided into mutually disjoint subsets based on the similarity of expression patterns of genes under all conditions, and each of those subsets is a cluster. Besides, in each subset, the genes have the same biological function or regulatory mechanism. Clustering by conditions is similar with this. However, part of the genes will function under part of the experimental conditions, rather than function under all experimental

978-1-4673-6799-8/15/$31.00 ©2015 IEEE

Cluster and bicluster: (a) clustering by conditions, (b) clustering by genes, (c) biclusters, (d) overlapped biclusters.

1678

Cluster model, Cheung et al. [10] proposed maximal OPSM model, and Gao et at. [11] proposed Deep OPSM model, all of which are further researches on OPSM. Proposed by Gao et aI, Deep OPSM refers to the pattern with small support and has been regarded as that with biological significance. Due to only a few rows included, Deep OPSM is not likely to be found by traditional OPSM algorithms. Thus, a mining algorithm KiWi has been proposed to find as many Deep OPSMs as possible using two parameters k and w to balance the computer resources. But since it also uses heuristic strategies, it cannot ensure all the Deep OPSMs can be found. To solve the problem, an efficient and accurate algorithm was proposed in our research to mine all OPSMs, especially Deep OPSMs, which is based on sequential pattern mining.

[7] and this kind of model has better robustness than others in the noise caused during the experimental measurement. Most of existing OPSM accurate algorithms based on sequential pattern mining search the target bicluster by setting thresholds of rows and columns. One common characteristic of these algorithms is that small thresholds of rows and columns will increase the computation load significantly. To control time complexity and space complexity within acceptable range, they often will set big thresholds of rows and columns. However, this will miss meaningful biclusters with small row supports. Therefore, traditional accuracy algorithms could not solve the Deep OPSMs problem well. An efficient accurate algorithm which is based on the concept of sequential pattern mining was proposed in our research to mine OPSMs. The proposed algorithm is capable of fmding all OPSMs that meet the threshold in a dataset, especially the Deep OPSMs. Hui Wang [8] proposed a measurement to compute the similarity of two sequences in 2007. This measurement is committed to calculate the quantity of all common subsequences (ACS) using a dynamic programming algorithm. We make improvement of this method to make it output all common subsequences, rather than the quantity of ACS. We convert common subsequences into increasing subsequences, and use dynamic programming to find increasing subsequences, then take advantage of branch and bound rule and store results into the suffix tree. Eventually, we will find all OPSMs that reach the threshold of row and column in the suffix tree. The applied suffix tree data structure is conducive to store and traverse sequential patterns, thus improving operating efficiency of the program significantly. The remainder of this paper is organized as following. Section II made a brief introduction of the related works. The detail of our algorithms is presented in section III. Section N contains the experimental results and analysis, and Section V is a conclusion. II.

III.

ALGORITHM

An accurate OPSM algorithm was proposed in this paper, which can convert the OPSM mining problem into a special sequential pattern mining problem and solve the Deep OPSMs problem effectively. Here is the definitions of OPSMs and Deep OPSMs. Definition 1: If S = (Rs'Cs) is a submatrix in the data matrix D=(R,C) and there is a column arrangement which could let all elements in the row set be in ascending order S = (Rs'Cs)

under this column arrangement,

is an OPSM in

D=(R,C) . Definition 2:

If

S = (Rs'CS>

is an OPSM and its row set

contains fewer rows, while its column set has more columns, S = (Rs'CS> is called as a Deep OPSM. For example, an original matrix D is presented in Table l. In Fig. 2 (a), we can see that the row set {b,c,e} has consistent increasing and decreasing patterns in the column set {2,3,4,5}. After column arrangement, the row set {b,c,e} presents monotone increasing in the column set {5,2,3,4}. This is known as the OPSM pattern. In Fig. 2 (b), we can see that the row set {b,c} meets the defmition of OPSM in the column set {l,2,3,4,5,6}. However, compared to OPSM in Fig. 2 (a), there are fewer rows, but more columns in the OPSM in Fig. 2 (b). This is the desired Deep OPSM.

RELATED WORK

The conception of bicluster came as early as 1972, which was raised by Hartigan [9], but the application of bicluster has not come into being until 2000. At that time, Cheng and Church [2] defined the conception of mean square residual error. This submatrix is a bicluster only when the mean square residual error of a submatrix is smaller than threshold. Because it is a greedy algorithm, not all the biclusters meeting conditions could be found. Later, a growing number of bicluster algorithms have been proposed [6, 7, 10-12] and applied in various fields such as analysis on gene expression and recommender system. Recently, a growing number of algorithms are focusing on the pattern on data instead of the specific value. Ben-Dor et al [13] are the first to propose order-preserving submatrix (OPSM) model. The model put lots of importance on the relative order rather than specific value and has been proved as an NP-hard problem. Besides, they also proposed a greedy heuristic algorithm which could find OPSM with big support threshold. Based on this, Liu and Wang [7] proposed OP-

TABLEr. Rowlcol a

b c

d e

I 21 66 53 9 24

ORIGINAL MATRIX

2 37 38 10 36 II

TABLE II.

c

I 6 5 5

e

5

Rowlcol a

b

d

1679

3 43 41 27 44 19

4 45 44 39 42 31

5 55 20 5 47 8

6 7 51 44 23 17

SEQUENCE MATRIX

2 I 2 2 6 2

3 2 3 3 2 6

4 3 4 4 4 3

5 4 6 6 3

6 5 5 4

Hui Wang [7] gained the index sequence by using Defmition 3 directly. But this requires bicirculation and has large time complexity. Hence, we put forward a new method to get the index sequence. If element values of a are subscripts of fJ and element values of fJ are subscripts of a then a and fJ are mutual self-index sequences (fJ=indexself(a), a=indexself(fJ» . For example, if Figure 2.

Comparison of OPSM and Deep OPSM: (a) an OPSM, (b) a Deep OPSM.

fJ=indexself(a) and a,

that makes a,

Search all common subsequences

After data preprocessing, all common subsequences of every two rows which meet the threshold of columns shall be searched, so that the OPSM mining problem is transformed into a sequential pattern mining problem. 1) Get the index sequence The sequence formed by indexes of elements in sequence a at positions of sequence b is called as the index sequence of a in b . Its mathematical defmition is introduced in the following. Definition 3: If there are three sequences a, /3, r with

fJ

=

length

n

(fJ" fJ,,···, fJ), r

where =

(r"

a

=

i . To get index(a,b) ,

b ,then ind, j

indselJ" . . Our method makes full use of this relationship to traverse elements in a and construct index(a,b) by using the corresponding values of indexself(b) directly. This only needs two single circulations. 2) Get all increasing subsequences Searching all common subsequences has been confirmed a NP-hard problem. But we convert it into a problem of searching increasing subsequences. Moreover, this problem is optimized by combining the dynamic programming and the branch and bound rule, thus reducing time complexity to a certain extent. Firstly, elements in the sequence ind are investigated one by one. If the value of one element is larger than that of the ahead element, it could be moved next to the ahead element, thus getting a new increasing subsequence. All investigated elements are put in an increasing order, marked as record . The investigating element is inserted into the record by the dichotomy method. Traverse the record and move the investigating element next to the element whose value is smaller, getting a new increasing subsequence. This will be repeated until all elements in the record with smaller value than the investigating element are surveyed. After all elements in ind are investigated, all increasing subsequences of ind could be gained. 3) Get all common subsequences All increasing subsequences of the sequence ind (ind index(a, b» could be transformed into all common subsequences of a and b according to the relationship of ind , a and b . The relationship is: Sequence a and b have equal length. Let ind index(a, b) and set i, j as subscripts of a, b

Elements in every row of the original dataset are put in an ascending order through the quicksort method and element values are substituted by the original column label. In this way, the original dataset is changed into a sequence dataset.

same

=

=

Preprocessing

B.

j, then P j

the self-index sequence of b ( indexself(b» shall be constructed firstly. The relationship between index(a,b) and indexself(b) is: Sequence a and b are the same length. Let ind index(a,b) and indself = indexself(b) . Set i,j are subscripts of a, b separately. For any i, if there's only j

Table 1 presented the original matrix D . When all row elements in the original matrix D are ordered according to the monotonic increasing pattern and the original values are substituted by the corresponding column label, the original matrix will be converted into a sequence matrix (Table 2). Consequently, the OPSMs mining problem will become a sequential pattern mining problem. One sequential pattern could correspond to only one OPSM. The proposed algorithm mainly consists of three steps. Firstly, original dataset are preprocessed to get the sequence dataset. Secondly, all common subsequences between any two rows are found. This part includes getting the index sequence of one sequence at the position of another sequence, searching all increasing subsequences of the index sequence and store them onto the suffix tree, and converting increasing subsequences into common subsequence of the two sequences. Thirdly, traverse the suffix tree to get the row set and column set of the OPSM. A.

=

=

=

=

(a"a,,···,a)

r,'···' r), i, j, k are subscripts of

=

elements in these three sequences separately. For any i , if

separately. For any i, let

there's only one j that makes a, = fJj, k = i, r, = j , the sequence r is called as the index sequence of a in fJ. a is called as the mother sequence of r and fJ is the target sequence of r, recorded as r index(a, /3) .

k

=

ind, then a, '

=

b, .

Therefore, the relationship of the increasing subsequences (ins) of ind with the common subsequences ( cs ) of a and b could be concluded. One common subsequence could correspond to only one increasing subsequence.

=

1680

Let ind index(a, b) , ins is the increasing subsequence of the ind , and cs is the common subsequence of a and b . Set i as the subscript of ins and cs . For any i , let =

j

ins, then cs,

=

'

C.

=

bj

a,

is

a,

is,

15

. After

is inserted into the sequence record , length of the

a ,

=

a,

is the jth element in the record , that

recordJ . As

a '

is inserted after the increasing

subsequence, length of current investigating increasing subsequence is I . J) Bound J When searching increasing subsequences, those using over high-valued elements as the prefix could not meet the column threshold. On this basis, elements without any potential could be filtered out, thus reducing candidate increasing subsequences significantly. When considering element

a,

'

(3)

Add element behind the existing increasing subsequence. If it is ranked on the back of the existing sequence, the new increasing subsequence could not expand to the sequence meeting the colwnn threshold. Similarly, the corresponding constraint is: (4) D.

IV.

is:

a, ::;; n+i+j-t5-m

(1)

2) Bound 2 When searching increasing subsequences, those using small-valued elements as the suffix could not meet the

column threshold. When investigating element

a"

its value

must meet n-i:2: 15-a, and the value of j must meet n - 15:2: m-j . As a, :2: j and i

=

m , the bound rule is:

(2) j:2: m+t5 - n Under the premise that two bound rules are met, the following two constraints shall be used to further screen candidate sequences. 3) Constraint J Add element behind the existing increasing subsequence. If the element value is too small, the new increasing subsequence could not expand to the sequence meeting the column threshold. It shall guarantee that

a,

EXPERIMENTS

Experiments based on real gene data matrix is conducted to verify effectiveness of the proposed algorithm. Galactose saccharomyces gene dataset [14] is used. It is a partial subset of the dataset collected when studying responses of GLA of baker's yeast to different knockout genes. The column are 9 GLA gene knockout experimental and wild types. They were cultured on medium with or without galactose, generating a total of 20 experimental conditions (columns). There are gene pair knockout responses corresponding to 205 rows. Parameters of the system development platform include processor: Intel(R) Core(TM) i3 CPU M380 @ 3GHz; RAM: 4G; operating speed: 3GHz; computer system: Windows7; operating software: VS2010; experimental programming language: C++.

its value meets ( n-a) -(m-j) :2: 15-i. In other words, the a,

Get row set and column set ojOPSM

All recognized common subsequences of every two sequences are stored in the corresponding suffix tree. Later, these suffix tree will be hanged onto the total suffix tree and their column label will be marked by leaf node. After all common subsequences are investigated, the total suffix tree stores all common subsequences as well as the corresponding supporting row set. Due to characteristics of OPSM, the dataset has high dimensions and density [10]. Therefore, we applied the principle of Apriori to narrow the pattern searching space. According to the principle of Apriori, if one sequence has one subsequence that does not meet the row threshold, it will not meet the row threshold either. The common subsequences and their supporting row sets could be gained through traversing the total suffix tree based on the principle of Apriori, thus enabling to construct the OPSM biclusters.

further operations can only be made as long as

bound rule for element

The

4) Constraint 2

and the minimum column threshold is

record is m and

( n-a) -(m-j)

a, ::;; n+I+j-15 - m

Branch and bound method

a

15 -I::;;

corresponding constraint is:



Searching all common subsequences is an NP-hard problem, and we are interested in longer common subsequences in gene expression data analysis. In view of requirements of specific problems, we will use the branch and bound method to search common subsequences meeting the threshold, which reduces unnecessary operations effectively, thus lowering the time complexity. We put forward four rules of branch and bound. Suppose the length of sequence a is n, the ith element in

colwnn threshold, that is

A.

Visualization analysis

All OPSMs are explored by the proposed algorithm. A Deep OPSM is visualized and its broken line graph is drawn, through which their common characteristics could be discovered intuitively. As shown in Fig. 3, the X-axis is colwnn label of OPSM bicluster and the Y-axis is expression value of corresponding clustering elements on the gene expression data matrix. It can be seen from Fig. 3 that all broken lines show same variation trends, indicating that expression values of the corresponding cluster genes in the original matrix change in the same way. This is identified as a Deep OPSM.

has adequate

potential to prolong the increasing subsequence to the

1681

Ben-Dor's algorithm [13] and xMotifs [17] results. Next, biclustering results are gained using the proposed algorithm. Results are then packed and stored in the GO analysis tool Finally, (http://go.princeton.edu/cgi-biniGOTermFinder). after getting P-values, results are analyzed. Enrichment comparison is shown in Fig. 4. Fig. 4 shows the percentage of biclusters whose P-value are respectively less than 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001 in the whole biclusters found by the 6 algorithms. Figure 3.

Broken line graph of a Deep OPSM.

B.

GO analysis GO (Gene Ontology) annotation is used to test biological significance of OPSM biclusters generated by the proposed algorithm. It tests authenticity of the clustering results. GO is a database established by Gene Ontology Consortium, which aims to build a language vocabulary standard that is applicable to various species and could define and describe functions of genes and proteins as well as update with continuous deepening researches. Some OPSM biclusters are chosen from those generated by the proposed algorithm for GO annotation test. OPSM biclusters generated in the experiment could reflect biological features of some gene relation control. When P­ value is relatively low, genes of this bicluster have some special biological significance in one or more GO projects. Table 3 shows GO projects whose P-values are smaller than 10-10

Figure 4.

Less P-value and higher percentage represents more biological significance. Fig. 4 reflects that the proposed algorithm is superior to CC [2], HCL [15], K-means [16] and Ben-Dor algorithms [13] in term of Deep OPSM enrichment. In particular, smaller P-value could reflect advantages of our algorithm better. Although the xMotifs [17] algorithm gets similar results with our algorithm, it is still slightly inferior.

For example, the first row in Table 3 is GO:0002181, which is a total of 7166 genes in the gene database where only 171 genes (accounting for 2.4%) belong to cytoplasmic translation's GO term. While a Deep OPSM mined by our algorithm includes 12 genes (accounting for 100%) which all belong to the cytoplasmic translation's GO term. As a result, •

D.

the low P-value 9.27 x 10-19 confirms significant biological significance of the Deep OPSM. This indicates that the proposed algorithm could find out gene biclusters containing co-regulation information and proves to feasibility of the proposed algorithm from the biological significance. TABLE Ill. GOlD

GO:0002181 GO:0006412 GO:0043043 GO:0006518 GO:0043604 GO:0043603

C.

Bicluster percentages of biological enrichment.

PART OF GO ANALYSIS RESULTS Cluster

Genome

Corrected

term

Ontology

frequency

frequency

P-value

cytoplasmic

12 of 12 genes,

171 of 7166

translation

100.0%

genes, 2.40/0

Gene

12 of 12 genes,

733 of 7166

100.0%

genes, 10.2%

pepride biosyntheric

12 of 12 genes,

739 of 7166

process

100.0%

genes, 10.3%

12 of 12 genes,

765 of 7166

100.0%

genes, 10.7%

12 of 12 genes,

783 of 7166

translation

pepride

metabolic

process amide

biosynthetic

process cellular

amide

metabolic process

100.0%

genes, 10.9%

12 of 12 genes,

833 of 7166

100.0%

genes, 11.6%

9.27E-19 4.83E-ll 5.33E-ll 8.IOE-ll I.07E-1O 2.26E-IO

Enrichment analysis

Gene expression data are operated by using the BicAT toolbox, getting CC [2], HCL [15], K-means [16], OPSM by

1682

Algorithm performance analysis

Except for analyzing biological significance of searched Deep OPSMs, we also made an experiment by setting different parameters on real datasets, thus further evaluating the algorithm performances of the proposed algorithm. Subsets of the galactose saccharomyces dataset which contains 205 rows and 20 columns are used as the testing dataset. Relationship between running time and thresholds of rows and columns is analyzed. The dataset contains 200 rows and 10 columns and the column threshold is set 4, 5, 6, 7 and 8. Experimental results are shown in Fig. 5. In Fig. 5 (a), given fixed row threshold, the running time of the algorithm decreases significantly as the column threshold increases from 4 to 8, indicating that the proposed algorithm is very suitable for searching long patterns with many columns. On the other hand, the relationship between the running time and the row threshold is analyzed. The dataset contains 200 rows and 10 columns, and the row threshold is set 2, 3, 4, 5 and 6. Given fixed column threshold, it can be seen from Fig. 5 (b) that row threshold influences running time of the proposed algorithm slightly. Therefore, the proposed algorithm is very suitable for fully exploring Deep OPSMs with smaller row support degree and even could find all Deep OPSMs whose row threshold is 2. However, both time complexity and space complexity of OPSM algorithms based on traditional sequential pattern mining will increase sharply as the row threshold decreases. This is why they

us with valuable guidance. And thanks gratefully for the colleagues who participated in this work and provided technical supports. This work is supported by Guangdong Economy & Trade Committee under Grant No. GDEID2010IS034; National Natural Science Foundation of China (No.71272084, No.61272252); Natural Science Foundation of Guangdong Province (Grant No.2015A030313544) and the PCSIRT (Grant No.IRTl243).

prefer to larger row threshold which will make them miss abundant meaningful Deep OPSMs.

REFERENCES

Figure 5. Evaluating algorithm performances: (a) run time under different column thresholds, (b) run time under different row thresholds.

To study effect of the branch and bound method on performances of the proposed algorithm, the branch and bound processing part in the algorithm is eliminated. According to the experimental results based on the galactose saccharomyces dataset, the algorithm without branch and bound processing consumes a long operating time and halt when the minimum threshold of support degree is set 2. Such operation interruption is caused by excessive consumption of RAM. This confirms that the application of branch and bound method could make the proposed algorithm not to generate plenty of meaningless short patterns, thus reducing RAM consumption and increasing pattern mining efficiency significantly. V.

CONCLUSION

Since OPSM focuses on relative values relationship of elements rather than the specific element values, it has strong robustness to noise. It has been widely used in gene expression data analysis. However, OPSM is a NP-hard problem. Most of existing OPSMs mining methods are heuristic ones which could not promise to find all OPSMs in the data matrix, but will miss some Deep OPSMs that biologists will be interested in. In this paper, an accurate algorithm for searching OPSMs based on sequential pattern mining is proposed. Firstly, it combines the idea of dynamic programming and branch and bound rules to fmd out all common subsequences (ACS) between any two rows. This could prevent pattern missing. Secondly, ACS is stored in suffix tree which could improve efficiency of the algorithm. Finally, all OPSMs that meet the preset row and column thresholds could be gained from this suffix tree, including Deep OPSMs. This algorithm is tested by using real gene data, which confirms it a high-efficiency and meaningful method. Whereas, our algorithm can mine all OPSMs, and OPSM mining problem is proved to be an NP-hard problem, so it is not suitable for very large datasets. In the future we can apply cloud computing technology to our algorithm in order to handle large datasets. Meanwhile, our research could be applied to more fields such as finance and text mining.

[I]

A. Tanay, R. Sharan, and R. Shamir, "Biclustering algorithms: A survey, " Handbook of computational molecular biology, vol. 9, no. 120, pp. 122-124, 2005.

[2]

Y. Cheng and G. M. Church, "Biclustering of expression data, " in the 8th International Conference on Intelligent Systems for Molecular Biology, 2000, pp. 93-103.

[3]

K. Eren, M. Deveci, O. Kli