Bit-Sliced Index Arithmetic - FTP Directory Listing

0 downloads 0 Views 153KB Size Report
paper, which we call the BITSLICE architecture, bitmap compression simply involves converting sparse bitmap pages into ordered lists of Segment-Relative ...
Bit-Sliced Index Arithmetic Denis Rinfret UMass/Boston Dept. of CS, UMass/Boston Boston, MA 02125-3393 819-376-3691

[email protected] .edu

Patrick O'Neil

Elizabeth O'Neil

UMass/Boston & Microsoft Research UMass/Boston & Microsoft Research Dept. of CS, UMass/Boston Dept. of CS, UMass/Boston Boston, MA 02125-3393 Boston, MA 02125-3393 617-354-6460 617-354-6460

p [email protected]

[email protected]

Research for this paper was supported by NSF Grant IRI 97-11374 at UMass/Boston. Microsoft Research, where Patrick O’Neil and Elizabeth O’Neil spent their Sabbatical Year, also provided support.

ABSTRACT The bit-sliced index (BSI) was originally defined in [ONQ97]. The current paper introduces the concept of BSI arithmetic. For any two BSI’s X and Y on a table T, we show how to efficiently generate new BSI’s Z, V, and W, such that Z = X + Y, V = X - Y, and W = MIN(X, Y); this means that if a row r in T has a value x represented in BSI X and a value y in BSI Y, the value for r i n BSI Z will be x + y, the value in V will be x - y and the value i n W will be MIN(x, y). Since a bitmap representing a set of rows is the simplest bit-sliced index, BSI arithmetic is the most straightforward way to determine multisets of rows (with duplicates) resulting from the SQL clauses UNION ALL (addition), EXCEPT ALL (subtraction), and INTERSECT ALL (min) (see [OO00, DB2SQL] for definitions of these clauses). Another contribution of the current paper is to generalize BSI range restrictions from [ONQ97] to a new non-Boolean form: to determine the top k BSI-valued rows, for any meaningful value k between one and the total number of rows in T. Together with bit-sliced addition, this permits us to solve a common basic problem of text retrieval: given an objectrelational table T of rows representing documents, with a collection type column K representing keyword terms, we demonstrate an efficient algorithm to find k documents that share the largest number of terms with some query list Q of terms. A great deal of published work on such problems exists in the Information Retrieval (IR) field. The algorithm we introduce, which we call Bit-Sliced Term-Matching, or BSTM, uses an approach comparable in performance to the most efficient known IR algorithm, a major improvement on current DBMS text searching algorithms, with the advantage that it uses only indexing we propose for native database operations.

1. INTRODUCTION The bit-sliced index (BSI) was originally defined in [ONQ97], where it was demonstrated how to use a BSI representing column quantities to evaluate SQL aggregate queries (specifically, SUM queries), and to impose range restrictions in a SQL WHERE clause. In the current work, we introduce the concept of BSI arithmetic: addition, subtraction, and min, and show how such BSI operations provide a natural way t o determine results of SQL clauses UNION ALL, EXCEPT ALL, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGMOD 2001 May 21-24, Santa Barbara, California, USA Copyright 2001 ACM 1-58113-332-4/01/05 $5.00

and INTERSECT ALL (see [OO000, DB2SQL]), where row sets resulting from subqueries can be combined into multisets (also called bags) of rows: that is, sets with duplicates permitted. For example, Query (1.1) below conforms to ANSI Standard SQL-99 and executes in Microsoft SQL Server to provide a multiset result: (1.1) SELECT COUNT(*) CT, PRID FROM ( SELECT PRID FROM T WHERE Col_1 = const_1 UNION ALL SELECT PRID FROM T WHERE COL_2 = const_2 UNION ALL . . . SELECT PRID FROM T WHERE COL_M = const_M) AS NEW_T GROUP BY PRID; Query (1.1) retrieves the various COUNT(*) multiplicities with corresponding primary key identifiers PRID, from the UNION ALL of the Equal Match predicates in the FROM Clause of the outer Select. The GROUP BY PRID, would normally select only one row in each group of T, but in this case it selects the appropriate multiplicities of individual rows arising from the UNION ALL. No current database product keeps track of these multiplicities using BSI addition, but we will show that BSI addition is extremely efficient for this purpose. We can also construct examples of queries where multiplicities are subtracted, using EXCEPT ALL, and the minimum multiplicity of two multisets is determined, using INTERSECT ALL. Note that in the case of EXCEPT ALL and INTERSECT ALL, any negative numbers in the result BSI must be replaced with zeros, since rows do not appear with negative multiplicities in SQL. The current paper also generalizes BSI range restrictions to a non-Boolean form: instead of finding all rows in a table T with a BSI value greater than some constant C (however many rows that might be), we show how to efficiently determine the top k BSI-valued rows, 1 = c1, C a value column having a BSI, can be found quite efficiently. In Algorithm 4.1 below we provide a variation of this algorithm to find k rows that have the maximum C values in T. Finding the rows with the k largest values in a BSI. Given a BSI, S = SP SP-1 . . . S1 S0 over a table T and a positive integer k