Persistent Su x Trees and Su x Binary Search ... - Semantic Scholar

2 downloads 0 Views 257KB Size Report
Oct 4, 2000 - prefix searches 25, 17] which are not su cient in sequence comparison, or use .... direction and maximum values to speed up node insertions and searching. ..... 8] Alvis Brazma, Inge Jonassen, Jaak Vilo, and Esko Ukkonen.
Persistent Sux Trees and Sux Binary Search Trees as DNA Sequence Indexes Ela Hunt, Robert W. Irving, Malcolm Atkinson Department of Computing Science, University of Glasgow Glasgow, G12 8QQ, UK fela,rwi,[email protected]

TECHNICAL REPORT TR-2000-63 October 4, 2000 Abstract

We constructed, stored on disk and reused sux trees and sux binary search trees for C. elegans chromosomes, and measured their performance using orthogonal persistence for Java (PJama). We compare our implementation with the performance of a transient1 sux tree, and discuss the suitability of such indexes in pursuing our long-term goal of indexing large genomes. We identify the potential for persistent DNA indexes in a variety of biological and medical contexts, and believe they will complement current string searching methods based on transient data structures. Keywords: sux tree, sux binary search tree, DNA sequence, genome storage, persistence.

1 Introduction The eld of sequence searching and analysis is changing rapidly. This is due to the exponential growth of available sequence data, and will change further with the improving knowledge of protein structure and expression. The publicly available DNA resources approach 1010 base pairs (bp), corresponding to 10 GB of string data2, but sequence searching and analysis still rely on at- le storage and high-throughput parallel computers reading all of the data sequentially. Because of the high cost of sequence comparison, heuristics are used. The results delivered by di erent methods are often hard to compare directly, and are increasingly being used in combination with databases characterising protein structure [9]. As expression databases become available, the picture will become We use transient to mean that some data structure has a lifetime only for the duration of one program execution, and persistent to indicate that a data structure is stored and reused by subsequent program executions [7] 2 http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html 1

1

even more complex, and the requirement to unify access to sequence, structure and expression data will become more pronounced. In this context we set out to explore the possible use of sux trees and sux binary search trees as persistent indexes to sequence. The issue of sequence indexing (which is di erent from indexing textual elds which accompany the sequence record) has not yet been successfully addressed by existing database technologies, to our knowledge. Known string indexing techniques either concentrate on pre x searches [25, 17] which are not sucient in sequence comparison, or use q-grams [10, 33] to index short substrings of DNA. On the other hand, there are exciting new developments in large-scale sequence analysis algorithms, and sux trees are used extensively to look for sequence alignments and repeats [26, 13, 42]. Such sux-tree structures have very good interrogation characteristics, but they need to be created every time they are needed and take considerable time to build. We propose to build on sux trees to develop a database solution to sequence searching, and start by investigating two alternative structures: the sux tree [43, 32, 41] and the sux binary search tree [24] (SBST). We aim to nd a data structure which will be suitable for a database server application. Our initial results were presented [23], and we now report on larger indexes, the comparison of the sux tree and the SBST, and on the rst use of multi-threading to access to a collection of trees. We also compare our work with a transient sux tree implementation [26]. This paper is structured as follows. Section 2 characterises sequence data, Section 3 reviews alternative data structures, Section 4 introduces sux trees, and Section 5 discusses sux binary search trees. PJama is introduced in Section 6. In Section 7 we present our results, and in Section 8 we discuss them. Section 9 outlines our future work, and Section 10 concludes.

2 Sequence data Biological sequence data can be conceptualised as strings, with DNA length counted in base pairs (letters), and protein length expressed in bases. DNA sequences form strings over the alphabet of 4 letters (A, T, C, G), and proteins use the alphabet of 20 letters. Example genomic sequence lengths are Saccharomyces cervisiae at 12 Mbp, Drosophila melanogaster around 180 Mbp, and the human sequence around 3 Gbp3 . Protein sequences are shorter but their total number may be around 500,000 for the human (based on 100,000 genes), and several thousand for each bacterium. The purpose of sequence searching and analysis is to derive biologically signi cant information from sequence data, and to direct laboratory work toward those sequences and model organisms which may shed light on the origin of diseases, and lead to the development of new drugs and preventive measures. Most sequence searching is currently performed using the BLAST [1] and FASTA [34] families of algorithms. BLAST and FASTA require piping all of the sequence data through computer memory. Similarly, the Smith-Waterman algorithm [37] reads all of the sequence data and compares it with the query. Such approaches are scalable only by purchasing more processors, disk channels, and RAM. Our research addresses an alternative strategy { storing indexes of the sequence data using persistent trees. If our goal is achieved, we can then design databases providing integrated access to sequence, mapping, 3

http://www.cbs.dtu.dk/services/GenomeAtlas/

2

protein, expression and clinical data, and compute sequence similarity measures as part of a database query evaluation.

3 String indexing structures Our choice of sux trees and SBSTs is based on three premises. We note rst the successful use of transient sux trees [31, 26, 13, 8], and the existence of approximate matching algorithms which may be easily reimplemented to use sequences stored in databases. MUMmer [13] and our own sux tree have already been combined in genetic map construction [18]. Secondly, we see the SBST to be a potential rival of a sux tree, in that the space requirement of a database resident SBST is approximately 50% of that of the sux tree. In this paper we begin to quantify the performance di erence between persistent forms of these two trees. Sux arrays used to store q-grams have been explored in a persistent context already [10, 33] but have not yet led to the development of a large database. Manber and Myers [30] reported that a sux array for 100 Kbp of DNA takes up 5 B per bp, as compared with 19.5 for its sux tree. However, newer sux tree implementations [27] report only 12.5 B per bp. Sux trees, sux binary search trees and sux arrays have all been investigated from the point of view of algorithmic theory, but have not been tested suciently in the database context. To select an appropriate persistent indexing technology for large sequences stored in databases, we need to consider the costs that are incurred each time a sequence is stored or replaced, and the costs of supporting searches and other inquiries about properties of a sequence. Although it is logically the combined cost that should be minimised, we currently assume that there will be large numbers of searches and enquiries. We therefore focus on minimising the search costs. Finally, as we are interested in developing database indexing for a wide variety of biological uses, we think that sux trees and SBSTs o er the advantages of versatility and good performance. Our plan is to develop compact indexes for exact string matching rst, and in the future progress to approximate matching algorithms.

4 Sux trees

4.1 Sux tree description

Sux trees are well established in string processing and a good introduction to their use in biology is provided by Gus eld [20]. A sux tree indexes all suxes of a given string [43, 32, 41], and the tree storage structure lends itself to space optimisations [27, 28]. A sux tree over a sequence of length n has construction costs of O(n) time and space, and can be searched within O(k + m) time, where k is the pattern length and m is the number of pattern occurrences. We show a sux tree for the string ACATCTTA in Figure 1, with a unique terminator character $ appended at the end, so that no sux of ACATCTTA$ can be a pre x of any other sux. A sux tree indexing a string of length n has n leaf nodes, one per sux (suxes being numbered from 1 to n). Each edge is labelled with a non-empty substring, and at each branching node the starting letters of the outgoing edges are di erent, so that each path from the root to a leaf spells the sux that starts at the sequence position held in the leaf. Sux tree optimisations, and ecient disk access 3

1

2

3

4

5

6

7

8 9

A C A T C T T A $ root

C T T

2

T T A $

A $

1

A T

$

T

T A

C T

T

8

A $

T C T T A $

T C

C

9

4

A

$

C A

T

$

$

A

T A $

7 6

5

3

Figure 1: The sux tree for ACATCTTA$ with search on substring T. The string length n = 9, query length k = 1, and the number of occurrences m = 3. schemes are discussed in [2, 27], and recent uses of sux trees in biological sequence analysis are reviewed in [31]. Sux trees are used in string searching and repeat identi cation. In Figure 1 an example search for the substring T of length 1 is traced. The search starts at the root and traces the substring along the edges. When the substring is completely traced, the search visits the children, leading to leaves 4, 6, and 7, indicating the starting positions of that substring. More complex operations include nding repeats where the entire tree is traversed to build a table of repeats and their frequencies. A comparison of two species may be based on such repeat identi cation.

4.2 Object-oriented sux tree prototype in Java

Our implementation of sux trees [11], based on Ukkonen's algorithm [41], is not optimised for space yet, and incurs overheads resulting from the use of object-orientation and persistence. We focus on the viability of object-oriented tree construction and retrieval in the persistent context. We store the DNA sequence as an array of bytes (1B per bp), and the tree consists of Nodes. Each internal Node represents a substring that occurs in two or more suxes and represents the leftmost of those by its start and end index in the 4

sequence. Leaf Nodes hold the sux numbers. The following data structure is used: class Node

children and the sux link int lLabel, rLabel; right and left indices of an incoming edge int suffix; sux number stored in leaf nodes. Space requirements re ect the Java implementation we are using. Each Node consists of three object references (4 B each) and three integers (4 B each). There is also an overhead of an object header (8 B) giving a total of 32 B per Node. We also store the byte array holding the sequence. Additional persistence overheads currently result in stores almost 100 times larger than the sequence array, e.g. 20.5 Mbp of sequence produces a 2 GB store. Some of this space is free space allocated in store segments but not used. Per bp indexed we used 70-75 B storage, plus 25-30 B for persistence. Node child, sibling, link;

5 Sux binary search trees in Java Sux binary search trees (SBSTs) are a new data structure [24]4 with potential for string indexing. They are more space ecient than sux trees, can be built in O(nh) time (or in O(nlogn) for a balanced version) where h is the height of the tree and n the string length, and can be searched in O(m + l) time where m is the length of the query and l the length of the path traversed in the tree. A simpli ed representation of an SBST for the string ACATCCTA is shown in Figure 2. The structure used for an SBSTNode uses direction and maximum values to speed up node insertions and searching. class SBSTNode

children boolean direction; direction for longest common pre x (lcp) int suffix, maximum; sux number, lcp length SBSTNode left, right;

Given any two strings w and u, lcp(w; u) is the length of the longest common pre x of w and u. For a node i in an SBST a left (right) ancestor is any node j such that j is in the right (left) subtree of i. The closest left (right) ancestor of i is the left (right) ancestor j such that no descendant of j is a left (right) ancestor of i. In this context, maximum is de ned to be 0 for the root, and max lcp(s ; s ) is taken over all ancestors j of node i. Direction is unde ned where maximum = 0, and left if node i is in the left subtree of the node j from which maximum has been adopted, or right, if the maximum was adopted from the left ancestor. The values of direction and maximum facilitate character skipping within the nodes, so that during search or insertion each character of the query, or the string to be inserted is involved in only one equality comparison. Using maximum values, we can decide which way to branch without having to compare characters, or, if a comparison is required, nd the index of the character within the node which is to be compared with the current query character, and depending on the result of the comparison, and the direction values, left or right subtree may be explored next. Our implementation [45] has SBSTNodes holding 2 object references (4 B each), 2 integers (4 B each), a bit and the object overhead (8 B), plus 1 B for the sequence itself, 4 http://www.dcs.gla.ac.uk/love/tech report.ps i

j

j

i

i

i

5

1

2

3

4

5

6

7

8

A C A T C T T A

root 1

max=0

ACATCTTA

max=1,d=l

8

2

A

max=1,d=r

CATCTTA

max=0 4 max=0 TCTTA

3 ATCTTA

max=1, d=r

direction d is undefined where max = 0

6 TTA max=1, d=r

5

CTTA

7 TA

max=1, d=l

Figure 2: Sux binary search tree structure, with nodes 4, 6, and 7 corresponding to search string T. totalling 25.125 B per bp. The largest space saving comes from the fact that an SBST has only n nodes, so for a tree indexing 20.5 Mbp, including the byte array, 515 MB are needed. A database overhead of around 500 MB leads to 1 GB store.

6 PJama Our database experiments use PJama [4, 3, 6, 35, 39] to abstract from the issues of data serialisation and disk access. PJama is a persistent distributed object platform for Java developed in cooperation between the University of Glasgow and Sun Microsystems Laboratories. PJama o ers orthogonal persistence by reachability. In practice, an application programmer wanting to store application data persistently adds just a few lines of code to her software, and all application data persists on disk for an unlimited length of time5. PJama has evolution facilities [14, 22, 5], enabling new versions of software and data to replace the old ones, recovery support [21] which uses logging to guarantee system recovery on failure, and distribution support [40, 38]. PJama stores o er good performance, and are ideally suited to fast application development, as they free the programmer from the 5

http://www.dcs.gla.ac.uk/pjama

6

chores of managing data transfer to and from disk. Several projects [15, 16, 19, 44] demonstrated the power of persistent programming in PJama, and our bioinformatics applets now give public access to stores which can be updated by external users6 . Transactional facilities for PJama are also being investigated [12]. Currently PJama can produce stores in excess of 5 GB. However, as our work on building even larger stores advances, improvements to PJama are becoming necessary. A recent modi cation involved a new log compaction scheme which allowed us to build larger sux trees, exceeding 15 Mbp.

7 Results

7.1 Test data

We performed tests with DNA for C. elegans7, using all 6 chromosomes totalling 97 Mbp. The query set used 72,451 C. elegans cDNA sequences retrieved in 1999 from the above ftp server, which have now been retired from that site, and replaced by longer cosmids. Sequence identi ers of those cDNAs can be traced in either Entrez or EMBL8 .

7.2 Computing environment

Tests were carried out using PJama 1.6.5, based on Java 1.2, with the new log compaction scheme, and using Java 1.2 production version for transient trees. We used an Enterprise 450 SUN machine with four 300 MHz UltraSPARC-II CPUs, 2 GB RAM, and disks mounted on UltraSCSI disk controller. The single sux tree and SBST code does not currently use multithreading and therefore used only one of the processors. Tests on 6 SBSTs, which used multithreading, took advantage of all 4 processors. The machine ran the Sun Solaris 7 operating system. The PJama con guration le for single trees used a 2 GB log le, and 1 GB (SBST) and 2 GB (sux tree) stores. Tests on 6 SBSTs used a 2 GB log le, and the store consisted of 3 les (2 x 2 GB, and one 0.7 GB). Log and data les were placed on 2 physical disks (9.1 GB Fujitsu MAB3091, 7,200 rpm [29]), so that disk contention could be observed.

7.3 Test overview and tree building

Transient and persistent sux trees and SBSTs were built and tests carried out on the longest chromosome, chromosome 5 (20.5 Mbp). Subsequently, a store was built with 6 SBSTs, one for each chromosome, indexing all 97 Mpb. Building a transient sux tree for Chromosome 5 took under 7 minutes elapsed time, and 21 minutes for the SBST. This di erence in due to lack of code optimisation. The persistent sux tree for 20.5 Mbp took between 34 to 38 hours9 to build on disk, and the SBST took between 50 minutes and 4 hours. The forest of 6 persistent SBSTs took between 2 hrs 50 min and 8 hours elapsed time. The use of shared disks by other users and the presence of background processes in uenced the store building times. http://bioinf.gla.ac.uk ftp://ftp.sanger.ac.uk/pub/C.elegans sequences/CHROMOSOMES/ 8 http://www.ncbi.nlm.nih.gov/Entrez/, http://www.embl-heidelberg.de/srs5/ 9 Work in progress [36] is supposed to reduce these times dramatically.

6 7

7

7.4 Measuring the query performance

Measurement followed a similar plan for both single-tree and multiple-tree tests. Queries were read from a le and passed on to the individual tree or a forest of trees. Timing for single trees was performed as follows. After each query was obtained, time was recorded, query submitted, time taken again, and results kept: Queries thisdata = new Queries(inFile, querySize, batchSize); for (int i = 0; i < batchSize; i++) f byte [] one = thisdata.getQueries(i); get next query starts[i] = System.currentTimeMillis(); Set matches = tree.findMatches(query);submit query ends[i] = System.currentTimeMillis(); results [i] = matches; save results

g

In case of a forest of SBSTs the strategy was modi ed in that large batches were broken into smaller groups of queries, and placed into a queue, cycling over the 6 trees, in pairs consisting of a pointer to the tree and a byte array of queries. A number of Java threads were initialised, and those would access the queue, remove an entry and process it. Results were recorded as above. After recording the number of hits resulting from each query, only the elapsed time for each batch was measured: after query preparation and job queue formation, time was measured, threads initiated, and after all threads completed, time was measured again. Mean time per query was computed.

7.5 Testing scenario

Tests on transient trees involved batches of 10,000 queries and query lengths: 8, 9, 10, 15, 50, 100 and 200. Test set for a persistent sux tree and a single SBST consisted of 70 invocations of PJama sending batched queries derived from the same cDNA set, submitted to both stores (35 batches per store, the same batch submitted to both), in alternation, so that the le system cache would be ushed each time a new batch/store combination was tested10 . Each batch contained queries of the same length. Batch sizes used were: 10, 100, 1000, 10000, and 50000 queries. Query lengths were as for transient trees, and for each cDNA the initial bases were taken. We calculated the mean query response time, and the number of hits per batch.

7.6 Transient tree query results

Transient sux trees and SBSTs were tested with chromosome 5 (20.5 Mbp). We summarise total query response times for di erent query lengths, for a batch of 10,000 queries. We show the sum of the time spent on performing all retrievals: This explores the less favoured scenario of single queries or small query batches being submitted where an index needs to be brought from disk into memory. 10

8

query length ST total time (seconds) SBST total time (seconds) 8 37.215 .503 9 11.123 .437 10 3.566 .450 15 .387 .460 20 .377 .449 50 .375 .448 100 .415 .456 200 .367 .485 We note that the transient SBST performance for short queries is better than the transient sux tree performance. The O(m + k) predicted search time for the sux tree, where m is the number of pattern occurrences and k the query length, is dominated by m and not by k. For the transient SBST, the query length has little in uence on performance, and the length of the path traversed down the tree is the deciding factor.

7.7 Persistent tree query results 7

50,000 queries 10,000 queries 1,000 queries 100 queries 10 queries

avg query time (s)

6 5 4 3 2 1 0 5

10

15

20

25 30 query length

35

40

45

50

Figure 3: Persistent sux tree for 20.5 Mbp: mean time per query. An overview of mean query times for the sux tree is shown in Figure 3, and SBST mean times are in Figure 4. We show only query lengths 8 to 50, as for longer queries the 9

0.8

50,000 queries 10,000 queries 1,000 queries 100 queries 10 queries

0.7

avg query time (s)

0.6 0.5 0.4 0.3 0.2 0.1 0 5

10

15

20

25

30

35

40

45

50

query length

Figure 4: Persistent SBST for 20.5 Mbp: mean time per query. mean is the same as for length 50. As expected, small batches of short queries (10 queries, length 8-10) are processed slowly, taking on average about 6 seconds per sequence looked up for a sux tree, and up to 0.8 seconds for the SBST. Query length is signi cant. Short queries bring back a lot of matches, and the persistent SBST returns such results faster than the sux tree. In this respect query behaviour resembles the behaviour of transient trees. Combinations of query length 8 and batch size 10,000 took very long to process (with possible 65,536 strings of length 8, using 10,000 queries brought back 8,568,303 results) and reported average query time of 0.115 s for the SBST and 0.920 s for the sux tree. With short queries a serial search of the original sequence may be faster because the sux tree index is insuciently selective. For queries of 8-15 bp the performance of the SBST is better than that of the sux tree, and average response times are around 0.030-0.110 s. For queries over 15 bp, response times level out for both trees, and 0.012-0.013 s are required for each response on both trees, for a batch of 50,000 queries. Batches containing 10,000 and 50,000 queries exhibited better performance, and those would be typical of database server operation. It is worth noting that we observed a quasi-linear relationship for both trees between the number of returned hits and the time taken to return all results, illustrated in Figure 5. For large batches most of the tree is cached in memory and slow disk performance is not a deciding factor. Similarly as for transient trees, the number of returned hits dominates the equation for the sux tree, and the sux tree is slightly inferior to the SBST for 10

short queries. For longer queries the behaviour of both trees is similar. 1

ST 50,000 queries ST 10,000 queries SBST 50,000 queries SBST 10,000 queries

response time per query (s)

0.9 0.8

length=9

0.7 0.6 length=10

0.5 0.4 0.3

length=8

0.2 0.1 0 0

100

200

300

400

500

600

700

800

900

mean hits per query

Figure 5: Quasi-linear relationship between the number of hits and time to retrieve them, for short queries from 8 to 10 bps. Query lengths 8-10 are marked as vertical bars.

7.8 Measuring the e ects of multithreading

Queries on a forest of SBSTs were set against a store containing 4.7 GB of indexes (indexing 97 Mbp). We modi ed the test plan, as already described. We timed runs of 1,000 queries of length 10, 15, 20, 100 and 200, and used 1, 2, 4, 8, 16, and 32 threads. Figure 6 shows the mean time recorded for 4 di erent iterations of the whole test. We observed large variations between the mean times per query for several runs, and the resulting graph is rough where a small number of threads are used. Optimum system performance is reached around 16 threads, and stays around 0.2 seconds per query. This time, for short queries { length 10 { the response was better than for longer ones.

7.9 Comparison with REPuter

We tested the REPuter [26] package to measure the performance of a highly tuned sux tree implementation. REPuter has a di erent purpose from our application. It is not interactive, but reports sequence repeats instead, by building the tree and performing a full traversal. Our test only allowed us to measure the tree creation time plus one 11

0.4

query length 10 query length 15 query length 20 query length 100 query length 200

0.35

avg query time (s)

0.3 0.25 0.2 0.15 0.1 0.05 0 0

5

10

15

20

25

30

35

number of threads

Figure 6: Average query times over 97 Mbp in 6 SBSTs (4.7 GB PJama store). traversal, and we chose parameters minimising the possible calculations. Our test data was a concatenated le of 6 C. elegans chromosomes. We used the following options: -mem (peak memory usage), -l (repeat length) equal to 50, 100, and 200, -noevalue (to suppress calculations), and -allmax (to print all maximum repeats). 7 runs were made on the same con guration as for other tests. The mean time observed was 18.06 minutes. Mean peak memory usage was 1275.78 MB. With exactly 96,934,461 Mbp, this gave 13.16 B/bp memory usage.

8 Discussion We are pursuing a new avenue of research which takes a di erent direction from the usual transient performance analysis. We are faced with lack of appropriate theory to explain the non-linear behaviour of large persistent sux trees, both during the building phase and at query time. We realise that disk caching and CPU caching play a signi cant role in the performance of our trees. We have produced good results for both types of tree, but signi cantly more needs to be done to understand fully the empirical implications of building large sux indexes, and to decide which of the tree structures is preferable. We have demonstrated that both the persistent sux tree and the persistent SBST deliver fast response to queries, based on stores measuring between 1 and 4.7 GB. Multi12

threading of requests to a forest of SBSTs also delivered satisfactory results with query results being returned within 0.2 seconds, but similar tests with sux tree forests still need to be carried out. Transient sux trees and SBSTs are currently limited in size. Our largest transient trees index 26 Mbp for a sux tree and 60 Mpb for the SBST. We think that it is worth investigating those structures further to produce more ecient implementations, and to surpass the limit of 134,217,724 bp inherent in REPuter, as human chromosomes require larger sux trees. We have shown that transient sux trees, as implemented in REPuter, could realistically be used in the context of sequence searching, if they supported a query interface, and cheap memory was available. For the human genome alone we would require 40 GB virtual memory dedicated to that data. If that memory had to be shared with other applications, query cost would have to include the time taken to build the tree. For the C. elegans genome the building time is 18 minutes. We assume hypothetical traversal time of 0 s. We equate the cost of tree construction with execution of x queries on a forest of trees (0.2 seconds per query). We evaluate: 18  60(s) = x  0:2(s) and get a batch of size 5400 required to amortise the cost of tree construction. In comparison with transient sux trees, once persistent trees are built, they are usable without further pre-processing. Using 150 GB of disk space on a multi-processor machine would enable us to index all of the human genome using our SBSTs. While constructing sux trees we identi ed bottlenecks which make the building of large sux trees extremely hard. The irregular topology of a sux tree and the splitting of nodes during the tree construction has two costs. It leads to a high rate of update of the data structures, with consequent write-back costs for any caching technology, and it leads to an object graph whose growth pattern correlates poorly with its nal structure. This latter e ect means that incremental allocation of disk sites for parts of the tree, e.g. during checkpoints [35, 36], results in poor locality of the nal structure. This is the reason why the Chromosome 5 sux tree currently takes over 30 hours to build. We conclude that new sux tree storage structures, suited to disk access patterns, are needed. Similar structures will be applicable in both tree types.

9 Future work Our future work will encompass ecient storage structures, approximate matching algorithms, and more testing, including other methods of providing persistence. We are planning to release a public web server to characterise the actual workloads and thereby to target our optimisation of persistent tree structures to their applications.

10 Conclusions The increasing demand for searches over large genomic sequences has provoked an investigation of the use of persistent indexes. The relatively large times needed to construct transient indexes and the relative stability of genomic sequence data, suggest that here, as in other applications, storing these indexes on disk would be advantageous. Our initial investigations have used an orthogonally persistent platform for Java, PJama, so that prototypes could be built and explored rapidly. Two di erent indexing 13

structures: sux trees and sux binary search trees have been subjected to preliminary performance evaluations with 97 Mbp of the C. elegans genome. Those were tested with cDNA query sequences, and server performance was evaluated for a range of query sizes and lengths, and using various levels of concurrency. These preliminary results suggest that the approach is promising. For large sequences, and with further optimisations and approximate matching algorithms, we expect to outperform current serial searching techniques and techniques which rebuild indexes each time they are needed. However, further research is needed to develop more ecient methods directly applicable in the biological context.

References [1] S.F. Altschul et al. Basic local alignment search tool. J. Mol. Biol., 215:403{10, 1990. [2] Arne Andersson and Stefan Nilsson. Ecient implementation of sux trees. Software Practice and Experience, 25(2):129{141, 1995. [3] M. Atkinson and M. Jordan. Providing Orthogonal Persistence for Java. Lecture Notes in Computer Science, 1445, 1998. [4] M.P. Atkinson, L. Daynes, M.J. Jordan, T. Printezis, and S. Spence. An Orthogonally Persistent Java. ACM Sigmod Record, 25(4), 1996. [5] M.P. Atkinson, M. Dmitriev, C. Hamilton, and T. Printezis. Scalable and Recoverable Implementation of Object Evolution for the PJama Platform. In Proc. 9th International workshop on Persistent Object Systems (POS9), Lillehammer, Norway. Springer-Verlag, September 2000. [6] M.P. Atkinson and M.J. Jordan. A Review of the Rationale and Architectures of PJama: a Durable, Flexible, Evolvable and Scalable Orthogonally Persistent Programming Platform. Technical Report TR-2000-90, Sun Microsystems Laboratories Inc and Department of Computing Science, University of Glasgow, 901 San Antonio Road, Palo Alto, CA 94303, USA and Glasgow G12 8QQ, Scotland, 2000. [7] M.P. Atkinson and R. Morrison. Orthogonal Persistent Object Systems. VLDB Journal, 4(3):309{401, 1995. [8] Alvis Brazma, Inge Jonassen, Jaak Vilo, and Esko Ukkonen. Predicting Gene Regulatory Elements in Silico on a Genomic Scale. Genome Research, 8:1202{1215, 1998. [9] Steven E. Brenner, Cyrus Chothia, and Tim J. P. Hubbard. Assessing sequence comparison methods with reliable structurally identi ed distant evolutionary relationships. PNAS, 95:6073{6078, 1998. [10] S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram Based Database Searching Using a Sux Array. In S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB), pages 77{83, Lyon, France, 1999. ACM Press. 14

[11] S.D. Cox. A pjama implementation of ecient dna or protein sequence storage and retrieval. Master's thesis, Department of Computing Science, University of Glasgow, 1999. [12] L. Daynes. Implementation of Automated Fine-Granularity Locking in a Persistent Programming Language. Software | Practice and Experience, 30:1{37, 2000. [13] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of Whole Genomes. Nucleic Acids Research, 27:2369{2376, 1999. [14] M. Dmitriev and M. P. Atkinson. Evolutionary Data Conversion in the PJama Persistent Language. In Proceedings of the 1st ECOOP Workshop on Object-Oriented Databases, Lisbon, Portugal, June 1999. [15] Huw Evans and Peter Dickman. DRASTIC: A Run-Time Architecture for Evolving, Distributed, Persistent Systems. In Proc. 11th ECOOP, 1997. [16] Huw Evans and Peter Dickman. Zones, Contracts and Absorbing Change: An Approach to Software Evolution. In Conference Proceedings of OOPSLA '99, 1999. to appear. [17] Paolo Ferragina and Roberto Grossi. The string B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236{ 280, March 1999. [18] Heather Fraser. Genome Annotation and Comparison System, M.Res. Dissertation, University of Glasgow, Institute of Biomedical and Life Sciences and Department of Computing Science, 2000. [19] A. Garratt, M. Jackson, P. Burden, and J. Wallis. A Comparison of Two Persistent Storage Tools for Implementing a Search Engine. In Proc. 9th International Workshop on Persistent Object Systems (POS9), Lillehammer, Norway. Springer-Verlag, September 2000. [20] Dan Gus eld. Algorithms on strings, trees and sequences : computer science and computational biology. Cambridge University Press, 1997. [21] C. G. Hamilton. Recovery Management for Sphere: Recovering A Persistent Object Store. Technical Report TR-1999-51, University of Glasgow, Department of Computing Science, December 1999. [22] C. G. Hamilton, M. P. Atkinson, and M. Dmitriev. Providing Evolution Support for PJama1 within Sphere. Technical Report TR-1999-50, University of Glasgow, Department of Computing Science, December 1999. [23] Ela Hunt. PJama Stores and Sux Tree Indexing for Bioinformatics Applications, 2000. 10th PhD Workshop at ECOOP'00, http://www.inf.elte.hu/phdws/timetable.html. [24] R.W. Irving and L. Love. The Sux Binary Search Tree and Sux AVL Tree. Technical Report TR-2000-54, University of Glasgow, Department of Computing Science, 2000. http://www.dsc.gla.ac.uk/love/tech report.ps. 15

[25] H. V. Jagadish, Nick Koudas, and Divesh Srivastava. On e ective multi-dimensional indexing for strings. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2000. [26] S Kurtz and C Schleiermacher. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics, pages 426{427, 1999. [27] Stefan Kurtz. Reducing the space requirement of sux trees. Software Practice and Experience, 29:1149{1171, 1999. [28] N. J. Larsson. Structures of String Matching and Data Compression. PhD thesis, Department of Computer Science, Lund University, 1999. [29] Fujitsu Ltd. 3.5-inch magnetic disk drives, 2000. http://www.fujitsu.co.jp/hypertext/hdd/drive/overseas/mab30xx/mab30xx.html, January 5, 2000. [30] U. Manber and G. Myers. Sux arrays: a new method for on-line string searches. SIAM J. Comput., 22(5):935{948, October 1993. [31] Laurent Marsan and Marie-France Sagot. Extracting structured motifs using a sux tree { Algorithms and application to promoter consensus identi cation. In Proceedings of the fourth annual international conference on Computational molecular biology RECOMB00, 2000. to appear. [32] E. M. McCreight. A space-economic sux tree construction algorithm. Journal of the A.C.M., 23(2):262{272, April 1976. [33] C. Miller, J. Gurd, and A Brass. A RAPID algorithm for sequence database comparisons: application to the identi cation of vector contamination in the EMBL databases. Bioinformatics, 15:111{121, 1999. [34] W.R. Pearson and D.J. Lipman. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A, 85:2444{8, 1988. [35] T. Printezis. Management of Long-Running High-Performance Persistent Object Stores. PhD thesis, Department of Computing Science, University of Glasgow, 2000. [36] T. Printezis and M. P. Atkinson. An Ecient Promotion Algorithm for Persistent Object Systems, 2000. Submitted to Software { Practice and Experience. [37] T. A. Smith and M. S. Waterman. Identi cation of common molecular subsequences. J. Mol. Biol., 284, 1981. [38] S. Spence. PJRMI: Remote Method Invocation for a Persistent System. In Proceedings of the International Symposium on Distributed Objects and Applications (DOA'99), 1999. [39] S. Spence. Limited Copies and Leased References for Distributed Persistent Objects. PhD thesis, University of Glasgow, May 2000. [40] S. Spence and M. P. Atkinson. A Scalable Model of Distribution Promoting Autonomy of and Cooperation Between PJava Object Stores. In Proceedings of the Thirtieth Hawaii International Conference on System Sciences, Hawaii, USA, January 1997. 16

[41] E. Ukkonen. On-line construction of sux-trees. Algorithmica, 14(3):249{260, 1995. TR A-1993-1, Department of Computing Science, University of Helsinki, Finland. [42] Anne Vanet, Laurent Marsan, Agnes Labigne, and Marie-France Sagot. Inferring Regulatory Elements from a Whole genome. An Analysis of Heliobacter pylori 80 Family of Promoter Signals. J. Mol. Biol., 297:335{353, 2000. [43] P. Weiner. Linear pattern matching algorithm. In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pages 1{11, Washington, DC, 1973. [44] Ray Welland and Malcolm Atkinson. A Zoned Architecture for Large-Scale System Evolution. In Proc. 3rd Intern. Software Arch. Workshop (ISAW3), 1998. [45] B. E. Young. Sux Binary Search Trees in Java. Master's thesis, Department of Computing Science, University of Glasgow, 2000.

17

Suggest Documents