A Cooperative Distributed Text Database Management Method ...

64 downloads 8217 Views 204KB Size Report
A new text database management method for distributed cooperative environments is proposed, which can collect texts in dis- tributed sites through a network of ...
A Cooperative Distributed Text Database Management Method Unifying Search and Compression Based on the Burrows-Wheeler Transformation

Kunihiko Sadakane

Hiroshi Imai

Department of Information Science, University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN fsada,[email protected]

Abstract. A new text database management method for distributed cooperative environments is proposed, which can collect texts in distributed sites through a network of narrow bandwidth and enables fulltext search in a uni ed ecient manner. This method is based on the two new developments in full-text search data structures and data compression. Speci cally, the Burrows-Wheeler transformation is used as a basis of constructing the sux array (or, PAT array) for full-text search and of performing the block sorting compression scheme. A cooperative environment makes it possible to employ these new methods in a uniform fashion. This framework may be also used in future for the Web text collection/search problem. The paper rst describes this method, and then provides preliminary computational results concerning I/O implementation of sux arrays and performing the sux sorting. These preliminary computational results indicate practicality of our method.

1 Introduction We propose a text database management method for distributed cooperative environments. In a cooperative environment, each user works at his private space and his data are transferred to a common database shared by all users. To search data from the common database, a search data structure is required. We treat text data such as source codes and documents. The sux array (Manber and Myers [8]) or the PAT array (Gonnet, Baeza-Yates and Snider [6]) is a memory-ecient data structure for searching any substring of a text. The PAT array is used for text databases (Yoshikawa, Kato, Kinutani and Watanabe [11]) and we also use it for the common database. For full-text databases, the String B-tree (Ferragina and Grossi [5]) has good worst-case performance. However, its size is more than two times larger than the sux array and therefore we use the sux array. To collect data of each user via narrow bandwidth networks, compression of data is necessary. Moreover, it is desirable that we could transfer a data structure

for search at a low communication cost because a huge sux array is constructed in the common database which is a very time-consuming task to overcome. As a compression scheme, we propose using the Block sorting compression. The Block sorting is a general-purpose compression algorithm (Burrows and Wheeler [2]). It is suitable for text data and its compression ratio is better than the gzip. Though some compression algorithms achieve slightly better compression than the Block sorting, they require much CPU time and memory. Compression by the Block sorting takes more time than the gzip. However, decoding speed is fast enough for common use. Moreover, Sadakane [9] proposed a fast algorithm for the Block sorting compression. In addition to good compression ratio and encoding/decoding speed of the Block sorting, we nd its nice feature, which is the relation between encoding/decoding process and sux array. In the encoder of the Block sorting, text symbols are permuted according to the sux array. The permutation is called the Burrows-Wheeler transformation. In the decoder, the reverse transformation is computed and therefore the sux array is also made. Using the interesting features of the Block sorting, we can newly unify search and compression in distributed environments. We transfer a text in compressed form and the common database server decodes it and gets the original text and its sux array into memory and merges it into a large sux array in disk (see Figure 1). To realize our method, we have to overcome two diculties. One is compression time for private data in local sites and the other is merging time in the common database server. Compression ratio of the Block sorting becomes better as the size of data grows. Each user should collect his data into one datum and compress it and send it to the server, that is a cooperative work. Because the size of data in the common database is huge, we have to incrementally update the database when new data arrives. The updating becomes an I/O algorithm using a disk. In this paper we show experimental results on sux sorting on memory and we also propose a sux array merging algorithm on disk. Our merging algorithm is faster than Gonnet et al. We nd that our management method utilizing Sadakane's sux sorting algorithm and a new merging algorithm proposed in this paper is practical for large text databases, for example Web search engines and genome databases, which are con rmed by experiments.

2 Text collecting method using the Burrows-Wheeler transformation In this section we rst explain the sux array brie y and describe relationship between the Block sorting compression and the sux array and modify the decoding algorithm. Next we propose a text collecting method unifying search and compression using the relationship. For a search data structure we use the sux array and for a data compression scheme we use the Block sorting.

private data text

text

compress (block sort)

transfer

decode Suffix array (mem) Suffix array (mem)

merge Suffix array (disk)

Central server

Fig. 1.

Text collecting method

2.1 Sux array The sux array is a memory-ecient data structure for searching any substring of a text. It is an array of indices. The indices represent suxes of the text and they are sorted in lexicographic order of suxes. A text T of length n is represented by T [1::n] and its i-th sux is Ti = T [i::n]. We assume that the text has a unique terminator $, i.e. T [n] = $. The sux array of T is an array I [1::n] and if I [j ] = i then the sux Ti is lexicographically j -th sux. Searching is done by a binary search on the array. Because each comparison in the binary search is string comparison, searching a string of length m takes O(m log2 n) time. The sux array requires only 5n bytes: n for T and 4n for I . Therefore it is suitable for large text databases.

2.2 The Block sorting and the sux array We use the Block sorting compression for transferring texts. The encoder of Block sorting consists of three processes: the Burrows-Wheeler transformation, moveto-front encoding and entropy coding. The Burrows-Wheeler transformation is the most time-consuming process and it is de ned by lexicographical order of suxes of a text. Therefore in the encoder the sux array I of a text T is calculated. The transformation is a permutation of text symbols and its output L[1::n] is de ned as follows: 1. i = 1 2. j = I [i] 0 1, if j = 0 then j = n

3. L[i] = T [j ], i = i + 1, if i  n goto 2 Though the transformation takes much time, its reverse transformation from L to T is quickly computed in linear time by a radix-sort-like procedure [2]. 1. C [0::255] = 0, for i = 1 to n P [i] = C [L[i]] + + 2. sum = 0 3. for ch = 0 to 255 sum = sum + C [ch], C [ch] = sum 0 C [ch] 4. i = pos 5. for j = n to 1 6. I [i] = j + 1, if I [i] = n + 1 then I [i] = 1 7. T [j ] = L[i], i = P [i] + C [L[i]] In this code, pos is the index of terminator $. In the procedure, only step 9 is added to the original. The sux array is implicitly made in the original procedure and therefore we simply output it to an array I . Therefore we can search any substring from a compressed text immediately the compressed text is decoded.

2.3 Collecting method By using the Burrows-Wheeler transformation, we can unify search and compression of text databases in distributed cooperative environments. When a user sends data in his private database to other person, he compresses the data by the Block sorting and transfer it. Person who received the compressed data decodes it and simultaneously he gets the sux array of the data and he can search any substring by using the sux array. When a user sends data to the common database, the Block sorting compression is also used. The server of the common database decodes the compressed data into memory and merges it into a large sux array in disk (see Figure 1). In the decoding step we can get the sux array of the data and therefore it is not necessary to recompute the sux array. A typical application of this method is a cooperative Web search server. HTML les in a Web server are collected by a search robot and they are compressed and transferred to the central server. By archiving and compressing the les communication cost is highly reduced. In the next section we describe sux array merging algorithms.

3 Sux array merging algorithm In this section we describe two sux array merging algorithms. One is Gonnet, Baeza-Yates and Snider [6] and the other is our straightforward merging algorithm. The algorithms merge a small sux array of size m on memory into a large sux array of size n on disk. If we create a sux array on disk from scratch, we can use an ecient algorithm of Crauser and Ferragina [4]. However, our method requires incremental creation of a sux array, which is very suitable for the management of texts incrementally collected from cooperatively working sites. We de ne variables as follows: TM [1::m] is a text on memory, IM [1::m] is its sux array, TD [1::n] is a text on a disk and ID [1::n] is its sux array.

3.1 Gonnet, Baeza-Yates and Snider algorithm First we describe the merge algorithm of Gonnet et al. The algorithm consists of two passes: counting and merging. A feature of the algorithm is sequential disk access.

Counting In the rst pass, each sux of TD is read and its lexicographical order in IM is calculated. We use an additional integer array C [0::m]. After the rst pass, C [j ] represents the number of suxes of TD which are between TM [IM [j 0 1]::m] and TM [IM [j ]::m] in lexicographic order. 1. C [0::m] = 0, s = 1; e = r (r is a constant) 2. read TD [s::e] into memory 3. for each sux of TD in memory TD [i::e] (s  i < s + r=2) (a) nd j s.t. TM [IM [j 0 1]::m] < TD [i::e] < TM [IM [j ]::m] by binary search (b) add 1 to C [j ] 4. slide TD [s + r=2::e] to the rst half of bu er, read TD [e + 1::e + r=2] into the latter half of the bu er and goto step 3

Merging In the second pass, IM is merged into ID according to the array [0::m]. The result is written to a new array ID [1::n + m]. 0

C

1. 2. 3. 4. 5.

= 0; j = 0; k = 0 = C [i] copy ID [j::j + c 0 1] to IM [k::k + c 0 1] and let k = k + c copy IM [i] + M to IM [k ] and let k = k + 1 i = i + 1; j = j + c, if i  n then goto step 2

i

c

0

0

The number of disk accesses of the algorithm is as follows. The number of read is n=B times for TD and 4n=B times for ID and the number of write is m=B times for TD and 4(n + m)=B times for ID . Therefore the total is (9n + 5m)=B times. Note that the size of character in a text is one byte, the size of elements of arrays is four bytes, B is disk page size and TM is appended to the last of TD . To create a sux array of a text of size n on disk, we have to divide the text into pieces of size m, sort them in memory and merge into disk. Therefore total number of disk accesses becomes 0

5m B

+

9m + 5m B

+

2 1 9m + 5m B

+ 111 +

(n 0 m) 1 9 + 5m B

=

5n B

+

9n(n 0 m) 2B m

and time complexity becomes ( log2 m 1 (m + 1 1 1 + (n 0 m))) = O(n2 log2 m):

O m

This algorithm requires 9m + 4 bytes memory (m bytes for TM , 4m bytes for IM and 4(m + 1) bytes for C ).

3.2 Our algorithm Next we describe our straightforward merging algorithm. It is also a two-pass algorithm which consists of counting and merging. However, counting is performed by not traversing all suxes of TD but traversing those of TM . 1. i = 0; C [m] = n 2. s = IM [i] 3. nd j s.t. TD [ID [j 0 1]::n] < TM [s::m] < TD [ID [j ]::n] by binary search 4. C [i] = j; i = i + 1, if i  m goto step 2 5. C [i] = C [i] 0 C [i 0 1] (i = n; 1 1 1 ; 1) The second pass is the same as Gonnet et al. However, by writing elements of a new array ID from right to left, we can overwrite ID to ID and therefore temporary disk space is not required. Moreover, the array C is not necessary if the merging step is combined with the counting step. This algorithm requires only 5m bytes memory (m bytes for TM and 4m bytes for IM ). The number of disk accesses to create a sux array of size n becomes as follows. One binary search requires 2 log2 n disk accesses and therefore merging a sux array of size m into that of size n requires 2m log2 n disk accesses in the counting pass and (8n + 5m)=B disk accesses in the merging pass. Therefore total number of disk accesses is 0

0

2m(log2 m+log2 (2m)+1 1 1+log2 (n0m))+

5n 8n(n 0 m) + B 2Bm

2

n

log2 n+

5n

+

B

8n2 : 2B m

Time complexity is ( log2 n 1 m 1

O m

n m

+ (m + 2m + 1 1 1 + (n 0 m))) = O(mn log2 n +

n

2

m

):

Roughly speaking, the time complexity is reduced by a factor of n=m, that causes practical speedup and it is con rmed by experiments. Though this algorithm performs binary searches and therefore random disk accesses occur, its time complexity is less than Gonnet et al. Furthermore, most of disk accesses by the binary searches are cached because we process suxes of TM in lexicographic order.

We assume that the size of disk cache is more than 2 (dlog2 e+1) bytes and in each string comparison in binary searches the number of symbol comparison is less than . When we merge a sux array M of size for a text M on memory into a sux array D of size for a text D on disk by our algorithm, the number of disk accesses is at most 4 for the sux array D on disk and at most for the text D on disk. Proof. First we show that all nodes in a binary search tree are read once from

Theorem 1.

B

B

I

T

I

n

m

T

n=B

n

n

I

T

disk and therefore the number of disk accesses for TD is at most n. We search all suxes of TM from the sux array ID in lexicographic order. Therefore traverse of nodes of the binary search tree is in-order. After a node u in the search tree

is read, all nodes in two subtrees of the node are read. While the subtrees are traversed, the node u is in disk cache and after the traverses u is not accessed any longer. Next we show that the number of disk accesses for ID is at most 4n=B . For any search path from the root to a leaf of the search tree internal nodes of the search tree are always in disk cache. Leaves of the search tree are read from left to right and adjacent leaves are consecutively accessed. Therefore total number of disk accesses is equal to the number of disk pages for ID . Because integer is 32bit, the size of ID is 4n bytes. For example, if n = 256M , m > 10M and B = 8192, 4n=B  n < m(log2 n + 1). Therefore the number of disk accesses is reduced by disk cache of size 2B (dlog2 ne + 1) bytes.

4 Experimental results We have experimented on making sux arrays in memory and on disks. We use a Sun Ultra30 workstation (UltraSPARC-II 296MHz) with 1GB memory running Solaris 2.5.1 and Ultra60 (UltraSPARC-II 360MHz) with 2GB memory running Solaris 2.6. To perform binary searches on disk, we use the mmap(2) system call. The page size is 8192 bytes. Read size for a binary search is therefore log n 1 8192 1 2  512K bytes. In our merging algorithm suxes of a text TM in memory are processed in lexicographic order. Therefore in the next binary search almost the same elements of ID are accessed and they are in disk cache. In the merging step, disk is sequentially accessed and therefore we do not use the mmap. Though we can execute our algorithm without any temporary disk, it is slow because of property of the mmap. Therefore we use temporary disk space in the merging step. Texts used in our experiments are HTML les and Genome databases. The HTML les are collected to a search server and les are frequently updated. The Genome databases are also updated when new part of DNA sequences are analyzed. Both les are merged into a sux array on disk. Because the les are very large, compression of the les is necessary to transfer.

4.1 Genome databases Sux sorting in memory First we have experimented on making sux arrays

in memory. If we have enough memory to store a sux array of a text, we can use a fast string sorting algorithm (Bentley and Sedgewick [1]) and a fast sux sorting algorithm (Sadakane [9]). The former requires 5m bytes memory and the latter requires 9m bytes memory to make a sux array of size m. BentleySedgewick algorithm can make larger sux array than Sadakane's algorithm in limited memory. As for merging sux arrays, algorithms of Gonnet et al. and ours require 9m bytes memory. If we sort suxes and merge their sux array into another sux array on the same workstation, we need not consider

memory requirements of both sorting algorithms. Moreover, Bentley-Sedgewick algorithm will become very slow if a text contains many repeated substrings. We use a measure of diculty of sorting suxes: average match length (AML). The AML of a text is de ned as AML =

X lcp( [ 0 1]

1 n

n

i=1

I i

[ ])

;I i

where n is the length of the text, I is the sux array and lcp is the length of the longest common pre x of two strings. The value varies among texts and if it becomes large, Bentley-Sedgewick algorithm becomes very slow. On the other hand, Sadakane's algorithm is not a ected much by the AML. Table 1 shows experimental results of sux sorting time (user time) and AML of texts on the Ultra 30 workstation. The texts are taken from a genome database of human (ddbjhum.seq) [3]. The database is a text of about 400M bytes and it consists of sequences of ATCG and their explanations in English. We use its rst 170M bytes and divide it into 17 les of size 10M bytes. They are numbered 0 to 16. In the table, AML represents the AML of a le, BS and Sadakane represent sorting time by Bentley-Sedgewick and Sadakane's algorithm respectively. Though Bentley-Sedgewick algorithm is slightly faster than Sadakane's algorithm when AML is small, it becomes very slow when AML is large. On the other hand, sorting time by Sadakane's algorithm is stable. Table 1.

sorting time and AML

le sorting time (s) No. AML BS Sadakane 0 45.8 89.6 49.4 1 26.4 56.4 49.4 2 21.4 48.8 50.1 3 19.0 45.9 49.3 4 18.8 45.8 49.3 5 20.6 48.0 50.0 6 19.8 47.3 49.3 7 22.5 50.0 49.0 8 20.1 48.1 49.0

le sorting time (s) No. AML BS Sadakane 9 35.3 59.3 48.2 10 55.3 83.4 48.7 11 81.0 109.0 49.8 12 178.6 170.2 50.9 13 137.4 215.4 54.7 14 180.5 270.1 58.4 15 75.0 118.8 50.9 16 85.2 132.8 51.6

Merging sux arrays into disk Next we experimented on merging sux arrays into disk on the Ultra60 workstation. Time is not user time but elapsed time measured by rusage command. First we merge the text database of human genome 10M bytes at a time. Figure 2 shows merging time of BGS and our algorithm. Line graphs show totals of merging times of each merging step. A text is divided into pieces of

size m and they are merged into the large sux array on a disk one by one. The value of m is 10M, 20M, 40M, 80M and 160M bytes. In the gure, BGS shows total time of BGS and Ours shows total time of our algorithm. Total time of BGS is reduced as the m becomes large. The reason is as follows. The second pass is O(n2 =m) time and it is inversely proportional to m. Though the rst pass is O(n2 log2 m) time, in average case it depends on the number of symbol comparisons in string comparison functions. The number of symbol comparisons is proportional to the AML. We assume that two texts whose sux arrays are merged have the same probability distributions of symbols. The rst pass becomes O((AML=m)n2 log2 m) time and it is inversely proportional to m.

Total time BGS (10M) 35.00

30.00

25.00

BGS(20M) 20.00

BGS(40M) 15.00

BGS(80M) 10.00

Ours(10M) Ours(20M) BGS(160M) Ours(160M)

5.00

Ours(40M) Ours(80M)

0.00

100

200

Fig. 2.

300 file size (M bytes)

Merging time

On the other hand, total time of our algorithm is not much a ected. The rst pass is O (mn log2 n) time in the worst case and O(AML 1 n log2 n) time in the average. The latter does not depend on m. The Ultra60 workstation has 2G byte memory. Our merging algorithms requires 5m byte memory and the rest of the memory is used for disk cache. Therefore many of accesses to the text TD are cached. To examine e ects of disk cache, we allocate a dummy space of size 1.5G on memory by using mlock function. Therefore we can use only 512M bytes. Table 2 shows merging time of our algorithm. The rst row shows available memory and other rows show total time in seconds to make a sux array of a text of size 240M. The value m, unit of merging, is 10M and 40M. Our algorithm becomes 2.98 times slower when

= 10M and 2.55 times slower then m = 40M if available memory is 512M. In our algorithm the text TD on disk is accessed randomly and therefore it becomes slower if available memory for disk cache is small. However, our algorithm is still faster than BGS. m

Table 2.

Available memory and merging time

main memory ours(10M) BGS(10M) ours(40M) BGS(40M) 2048M 5407.70 29678.6 3848.11 9870.68 512M 16119.7 46269.3 9816.64 21857.3

Our algorithm is fast when m = 80M and it becomes slower when m = 160M . The reason is that sux sorting in memory becomes slower as the m grows. We assume that sorting m suxes takes O(m log m) time. A text of size n is divided into n=m pieces of size m and suxes of each piece are sorted. Therefore sorting takes O(n=m 1 m log m) = O(n log m) time. Table 3 shows total sorting time for a text of size 320M. Sorting time of m = 160M is about log10 160 = 2:2 times slower than that of m = 10M . However, if the text is compressed by the block sorting and we receive it by a compressed form, the sorting step is not necessary. In such case our algorithm becomes fast as m grows. Table 3.

Total sorting time

Unit of merge 10M 20M 40M 80M 160M sorting time (s) 1444 1632 1814 2188 3123

4.2 HTML les Table 4 shows sorting and merging time for HTML les on Ultra 30. We use HTML les which servers of University of Tokyo, Kyoto University, Osaka University and NTT (Nippon Telegraph and Telephone Corporation) have. The HTML les are collected by the ODIN [7]. Each column represents the size of compressed les, the size of original les, the size of les whose HTML tags are removed and their AML's, sux sorting time by Bentley-Sedgewick algorithm and Sadakane's algorithm, sorting and merging time of BGS, and sorting and merging time of our algorithm. We rst make a sux array of les in University of Tokyo on disk and then merge that of les in Kyoto University, Osaka University and NTT. Note that the tar le for les of University of Tokyo are compressed from 111 M bytes to 22 M bytes by a Block sorting compressor

bzip [10]. The columns of merge time show the time for making a sux array of U. Tokyo and merging sux arrays of other HTML les to disk. In the merge algorithms Sadakane's algorithm is used for sorting suxes in memory. The table shows that AML of HTML les is very large and therefore sorting by the Bentley-Sedgewick algorithm becomes very slow. The table also shows that our merging algorithm is faster than the BGS algorithm. Table 4.

Sorting and merging time for HTML les

size of .tar.gz size of .tar size without tags AML sort time (BS) (s) sort time (Sadakane) merge time (BGS) merge time (ours)

U. Tokyo Kyoto U. Osaka U. NTT 28M 16M 10M 2.7M 111M 63M 41M 13M 45M 26M 17M 3.4M 779 481 441 123 2016.9 729.1 439.5 28.2 172.8 90.1 28.2 8.5 302 874 888 724 337 503 222 128

5 Concluding remarks We have proposed a text database management method for distributed cooperative environments. Our method can transfer full-text data in compressed form and the receiver can simultaneously obtain the original text and a search data structure called sux array. The key idea of our method is using the BurrowsWheeler transformation. The transformation is used in the Block sorting compression and it is de ned by the sux array of a text. We can obtain the original text and its sux array from the compressed text. Our method is applicable to large text databases such as Web search servers and Genome databases. In such cases databases are often updated. Therefore merging new data into a database is necessary. We also proposed a simple but ecient sux array merging algorithm. The algorithm is faster than that of Gonnet, Baeza-Yates and Snider. We experimented on sux sorting and merging for HTML les and Genome databases. Both les have large AML's and therefore ordinary sorting algorithms take much time for sorting suxes. We found that Sadakane's sux sorting algorithm is e ective in sorting such les. Concerning merging sux arrays, our algorithm is faster than that of Gonnet, Baeza-Yates and Snider especially when merged texts are small and the di erence becomes larger if we can use large disk cache. Therefore our merging algorithm is more applicable for frequent updating of large text databases.

Though our algorithm can merge sux arrays, it cannot handle deletion of texts. As future works we develop a deletion algorithm for a sux array and compare it with other data structure such as the String B-Tree.

Acknowledgment We would like to thank Mr. Masanori Harada, who gave us HTML les of the ODIN and thank Dr. Akio Nishikawa, who helped us to obtain genome databases. The work of the second author was supported in part by the Graph-in-Aid for Scienti c Research on Priority Areas (A), `Advanced Database Systems for Integration of Media and User Environments' of the Ministry of Education, Science, Sports and Culture of Japan.

References 1. J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 360{369, 1997. http://www.cs.princeton.edu/~rs/strings/. 2. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithms. Technical Report 124, Digital SRC Research Report, 1994. 3. Center for Information Biology, National Institute of Genetics. DNA Data Bank of Japan. http://www.ddbj.nig.ac.jp/. 4. A. Crauser and P. Ferragina. External memory construction of full-text indexes. In DIMACS Workshop on External Memory Algorithms and/or Visualization, 1998. http://www.di.unipi.it/~ferragin/Latex/WSA.ps.gz. 5. P. Ferragina and R. Grossi. An external-memory indexing data structure and its applications. Journal of the ACM, 1998. (to appear). 6. G.H. Gonnet, R. Baeza-Yates, and T. Snider. New Indices for Text: PAT trees and PAT arrays. In W. Frakes and R. Baeza-Yates, editors, Information Retrieval: Algorithms and Data Structures, chapter 5, pages 66{82. Prentice-Hall, 1992. 7. M. Harada. ODIN. http://odin.ingrid.org/. 8. U. Manber and G. Myers. Sux arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935{948, October 1993. 9. K. Sadakane. A Fast Algorithm for Making Sux Arrays and for Burrows-Wheeler Transformation. In Proceedings of Data Compression Conference (DCC'98), pages 129{138, 1998. 10. J. Seward. bzip, 1996. http://www.cs.man.ac.uk /arch/people/j-seward/bzip-0.21.tar.gz. 11. M. Yoshikawa, H. Kato, H. Kinutani, and M. Watanabe. The ParaDocs Document Database System and Visual User Interface for Information Retrieval. In Advanced Database Systems for Integration of Media and User Environments '98, pages 81{ 86. World Scienti c Publishing, 1998.

Suggest Documents