How to build a DNA search engine like Google?

3 downloads 0 Views 347KB Size Report
inverted index based search systems. ... But for inverted index system, the number of independent ... to index Chinese is to n-grams segment the sentence.
How to build a DNA search engine like Google? Wang Liang Tencent, SOSO, Shenzhen, 430074, P.R. China *To whom correspondence should be addressed. E-mail:[email protected]

[Abstract] This paper presents a novel method to build the large scale DNA sequences search system based on web search engine technology. Firstly, we find 12 bp may be the length of most DNA “words” by Zipf’s laws. Then the “vocabulary” of DNA is constructed by N-grams statistical method. After having a vocabulary, we could easily segment the DNA sequence, build the inverted index and provide the search services by mature search engine technology. Such system could provide the ms level search services in billions of DNA sequences. The DNA statistical language model is also built based on this vocabulary. This research may pave a completely new avenue to discover the secret of DNA. The exponentially increasing volumes of biological data pose new challenges for bioinformatics in the post-genome era. Now if you want to search a DNA sequence in all of current DNA databases, you will find it’s a very tough mission. But if you want to search a word sequence in billions of documents in Internet, you only need several ms through Google. So could we build a DNA search engine like Google? This paper just discusses this “simple” question. Now most DNA search and comparing methods are designed based on BLAST/ FASTA algorithm, which compare one sequence with all the other sequences on by one (1,2). So it’s very difficult to improve the speed of these methods greatly, especially in this DNA information explosion period. Many researchers agree that the biological need the completely new search algorithm. We may find some inspirations from the history of text information retrieval. If we need search a word in several documents, we could also scan each document and find the match words. But if there are millions of documents, we need scan all the words of these documents. The search time will become intolerable, especially in Internet period. So almost all current search systems evolved into the inverted index based search systems. The idea of inverted index is very simple (3). It uses the words as the field to index the documents. For example, three documents: T0= "it is what it is"; T1= "what is it"; T2 = "it is a banana"; Their inverted index: word

Document ID

a

T2

banana

T2

is

T0, T1, T2

it

T0,T1,T2

what

T0,T1

If we want to search “banana”, we only need search the words “is” in word column, and then obtain its corresponding Document ID: {T0, T1, T2}. If the search query is “what is it”, this query is divided into "what", "is" and "it" and get:

{T 0, T1} {T 0, T1, T 2} {T 0, T1, T 2}  {T 0, T1} Normally the position is also stored in inverted files. So T1 will rank first and T0 in second. The original scan method need scan all the words in each document. Its time complexity is more than O(n), n is the number of documents. But for inverted index system, the number of independent words in word column or vocabulary is normally a stable value. So the time complexity of inverted index search system is a constant. In many actual search systems, the vocabulary is usually stored in a “hash table” or other data structure in memory. It only need scan once to find the corresponding document ID of one word. Its time complexity is O(1). O(n, n>billion) is much more than O(1) . It’s just the “secret” of Google to provide the ms level search service for billions of documents. The inverted index could be used to index any symbol sequence and provide the quick search service, no matter it’s Na'vi, noise, music, etc. We only need solve two questions. First, how to divide or segment the sequence into “words” ? Second, we should ensure the word set or vocabulary is not too long. For some language like English, the sentence could be easily segmented into words according to space and punctuation. But for DNA sequence, there is no space and punctuation. It’s not a big problem. In some languages like Chinese, There is also no space between words. The simplest method to index Chinese is to n-grams segment the sentence. To obtain the n-grams segment, one shifts progressively by one base a “reading window” of length n along the sequence. For example, a Chinese sentence: T=”ABCDE”; The 1-gram segment divided the T into {A, B, C, D, E}; The 2-grams segment divided the T into {AB, BC, CD, DE} For a query “ABC”, it’s divided into “A”+”B”+”C” or “AB”+”BC” and search them in related inverted files. Obviously, we could also use the n-grams segments to divide the DNA sequences. The only question is to determine a proper “word” length to ensure the “vocabulary” is not too big. If we select 1 as the length of DNA word, there are only 4 words {A, T, C, G}. The DNA sequences corresponding to one word will too many to process the union operation. According to the practice, the word in vocabulary should not more than million level. So the proper word length of DNA could be [7,11]. 9 Here we could select length 9( 4  262,144 ).

After segmenting the DNA sequence, we could easily apply the current search engine technology to create inverted file and build a DNA search engine. Based on Luence/ Hadoop(open sources search engine), we only need write several lines code to build a usable DNA search system, which will be faster than BLAST. Till now, we could end this paper. But most current Chinese search engine all use the “vocabulary based segment”, but not the simple n-grams segment. For example:

T=”ABCDE”; its 2-grams segments: {AB, BC, CD, DE} The word AB may not a real meaningful “word”, we needn’t index this “AB”. If we have a vocabulary, its segment may be {ABC, DE}, which could reduce the size of inverted files. It’s very important to in building the large scale search system. More importantly, segmenting the sentence into the “meaningful” words is the key to apply the advanced search technology like classification and clustering. To build a better DNA search engine, we need build a “meaningful DNA vocabulary”. So we continue our paper to discuss this issue. Still for Chinese search system, if you didn’t know any Chinese, hot to improve the Chinese search engine? A simple but very effective method is to find the length of most Chinese words. This value is 2 for Chinese. This is the reason that most “foreign language” systems like My-sql all apply 2-grams to segment the Chinese. But if you don’t know the Chinese at all, how to determine this “proper length”? You may find a Chinese friend through QQ or calculate this value by Zipf’s law. There have been some research discuss the “linguistic features” of DNA using this law (4,5). Here we use the low frequency form of zipf’s law: In an enough long document, there are about 50% words only appear once. These words are called one frequency words. If we count the frequency of 1-gram word of a long Chinese story, we will find the percentage of one frequency word is less than 50% (about 20%). One frequency word in 2-grams segment is about 50%. For n-grams (n>2), this value is more than 50%. So the 50% one frequency words could identify the length of most words. We could also use this method to obtain the word length of DNA. We only need increase the length n of n-grams segment and find whether there is a clear 50% turning point for percentage of one frequency word. If we could find such point, the corresponding length is just the length of most DNA words. If not, we may say the DNA is not a language. The 9-grams segment will be enough for DNA search engine. We select the genome of Arabidopsis, one chromosome of Aspergillus, Fruit Fly and Mouse (select first 30M bp) as experiment data, which include all non-coding and coding DNA. The n-grams statistical results are shown in Fig.1:

1

Per cent age of one f r equency wor ds

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

9

10

11

12

13

14

15

Lengt h of wor ds

Fig.1 n-grams statistical results, (x axis is the word length of n-grams, y axis is the percentage of one frequency words). The blue line is five chromosomes of Arabidopsis. The red line corresponds to Fruit Fly, green line to Mouse. The red broken line is a chromosome of Aspergillus.

Except the Aspergillus, the other genomes all clearly show the 50% point of percentage of one frequency word corresponds to length 12. For chromosome of aspergillus, it only contains 4M bp and others are more than 25M. DNA of Aspergillus may be too short to show the clear statistical feature. So we could say 12bp is the length of most DNA “words”. This statistical result also shows the strong linguistic feature of DNA. Here we only apply this result to improve DNA search engine, but don’t discuss whether DNA is a language. Till now, we could use the 12-grams to build the index of DNA. But if knowing the length of DNA words, we could build a “vocabulary” of DNA and improve the DNA search engine further. There are also many methods to automatically build the “vocabulary” of words in Chinese search engine research. The basic principle for these researches is that the high frequency sequence is likely to be the word. So we only need count the frequency of 12-grams segment words, the word with frequency more than a threshold could be added to vocabulary. Here we select the chromosome of Mouse to build the vocabulary. The frequency threshold is set as 1. Then we use this vocabulary (about 3,000,000 words) to index the other chromosomes to determine whether different species use the same vocabulary. We rank the 12-grams words of one chromosome of Arabidopsis and Fruit Fly in reducing order of word frequency. We find vocabulary of Mouse could cover about 80% of top 10,000 words of Arabidopsis and Fruit Fly, about 70% of top 12

20,000 word.(Fig.2). There are all 4  16777216 words for 12-grams. Our vocabulary only account about 18% of all possible words. So we could say different species use the same vocabulary.

1

0.9

0.8

Cover r at e

0.7

0.6

0.5

0.4

0

5

10

15

20

25

30

Rank of wor d( *100, 000)

Fig.2 Coverage of vocabulary. (x axis is the rank of word, which is sorted by frequency, y axis is the coverage of vocabulary)The red line corresponds to Arabidopsis and blue line to Fruit Fly

To build an actual vocabulary for search engine, we could select a high frequency threshold for 12-grams words. We also add the high frequency words of [9-11]-grams and all possible words of [1-8]-grams into vocabulary. After having a vocabulary, we could apply some Chinese segment method like “maximum matching” to segment the DNA sequence into words. For example: T= “AGAGAGAGAGAG”, if we find “AGAGAGAGAGAG” is in the vocabulary, its segment is just “AGAGAGAGAGAG”, if not, it will be divided into the shorter words like “AGAGAGAGAG”+”AG”. For long sequence, the first line of

“>gi|240254421|ref|NC_003070.9| Arabidopsis thaliana

chromosome 1, complete sequence”: T=”CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT CTTTAAATCC” Its segment: “CCC”+”TAAACCCT”+”AAACCCTAA”+”ACCCTAAAC”+”CTCTGAATCCTT”+”AATCCCTAA A”+”TCCCTAAAT”+”CTTTAAATCC” In practice, we only need add our vocabulary into the “dictionary file” of a Chinese search engine and needn’t write any code to build a DNA search engine. We just build such a search system based on Lucene (6). After segmenting the DNA sequence, we could also use the statistical language model to describe the DNA sequence. The most common model is vector space model, which use the term to represent the document. Typically terms are single words, keywords, TF*IDF (3).Here we use the word to represent the documents. For example:

T1="it is a football"; T2 ="it is a banana"; T3 ="I need banana"; We use the word as term: it

is

a

football

banana

I

need

T1

1

1

1

1

0

0

0

T2

1

1

1

0

1

0

0

T3

0

0

0

0

1

1

1

So:T1=(1,1,1,1,0,0,0); T2=(1,1,1,0,1,0,0) ; T3=(0,0,0,0,1,1,1). Then we could use these vectors to represent the sequence and compare them. Normally, the cosine of two vectors is used to describe their similarity:

V 1  (v1,1 , v1,2 , v1,i ,

, v1, n ), V 2  (v2,1 , v2,2 , v2,i , n

Similar (V 1,V 2) 

v i 1

1,i

n

 v1,2i  i 1

, v2, n )

 v2,i n

v i 1

2 2,i

So: Similar(T1,T2)=0.75; Similar(T1,T3)=0, Similar(T2,T3)=0.2887 Similar(T1,T2)>Similar(T2,T3); We could also use the TF*IDF as term to improve this model. Most people will think the T2 is more similar to T3, because the words “to”, “the”, etc almost exit in any document and can’t be used to distinguish the documents. So we could use the document frequency to improve the vector space model. TF is the term frequency of word in one document. It’s the times of one word appearing in one document. This value is also divided by the number of all words to avoid the preference for long document. For T1, its TF vector: T1=(1/4,1/4,1/4,1/4,0,0,0) IDF is the inverted document frequency. The document frequency of word v is the number of documents which contain the word v. IDF is the reciprocal of document frequency. Here assume “football” and “banana” appears in 40 documents, the other words appear in 100 documents. So the IDF of “football” and “banana” is 1/40, the other words is 1/100.

1 1 1 1 1 1 1 1 , * , * , * , 0, 0, 0) 4 100 4 100 4 100 4 40

For T1, its TF*IDF: T1  ( *

T2=(1/400,1/400,1/400,0,1/40,0,0) ; T3=(0,0,0,0,1/30,1/300,1/300) Similar(T1,T2)= 0.0291; Similar(T1,T3)=0, Similar(T2,T3)= 0.9756 So: Similar(T2,T3)>Similar(T1,T2), this result complies with our experience.

We could build the same vector model for DNA sequence. For example, frequency of words “TTTTTTTTTTTT”, “AAAAAAAAAAA”,”AGAGAGAGAGAG”, “CTCTCTCTCTCT” are in the top 20 of all our test genomes. These words are like to be the “the”, “a”, etc in English. When comparing two vectors of DNA sequence, we could delete these words to reduce the data dimension. This is very important for long sequence comparing. BLAST could also get the similar result. But for long sequence, this method will be much faster than BLAST because it doesn’t consider the order of words. Based on this model, we could design many other fast DNA analyzing algorithms. This paper gives a proposal to build a large scale DNA search engine. It could provide the quickly search service for DNA sequence. The DNA statistical language model is also created. It could build the bridge between information retrieval/search engine and DNA research. Almost all the search technology could be directly applied in DNA analyzing. The methods based on this statistical model are fast but rough. So in the future, the combination of statistical language model and formal language model may be the key to discover the secret of DNA.

References and Note 1. 2. 3. 4. 5. 6.

WR Pearson, DJ Lipman, Improved tools for biological sequence comparison. PNAS, 85(8):2444−2448(1988). SF Altschul, W Gish, W Miller, EW Myers, DJ Lipman, Basic local alignment search tool. Journal of Molecular Biology, 215(3):403−410(1990). G.G. Chowdhury. Introduction to modern information retrieval.(Facet Publishing, 2003) Chikara Furusawa, Kunihiko Kaneko. Zipf ’s Law in Gene Expression. Physical Review letters. 90(2003) R N Mantegna,S V Buldyrev. Linguistic features of noncoding DNA sequences. Physical Review Letters, 3(23):3169-3172(1994) http://sourceforge.net/projects/dnasearchengine/