Querying sparse matrices for Information Retrieval Roberto Cornacchia Centrum voor Wiskunde en Informatica Kruislaan 413, 1090GB Amsterdam, The Netherlands
[email protected]
Abstract. This research hypothesises the array data-model as a key ingredient for the integration of information retrieval and database technology. The main research goals focus on a seamless blending of the efficient evaluation of sparse array computations in the relational domain, with the challenges posed by information retrieval applications.
1
Motivation
Information Retrieval researchers develop methods to assess the degree of relevance of data to user queries. While ideally such a retrieval model could be considered ‘just’ a (somewhat complicated) query for a database system, in practice the researcher attempting to deploy database technology to information retrieval will stumble upon two difficulties. First, database implementations of IR models are still inefficient in runtime and resource utilisation if compared to highly optimised custom-built solutions. Second, the set-oriented query languages provided by relational database systems provide a fairly poor abstraction in expressing information retrieval models. Specifically, the lack of explicit representation of ordered data has long been acknowledged as a severe bottleneck for developing scientific database applications [1], and we believe the same problem has hindered the integration of databases and information retrieval. This research investigates a solution based on the integration of the following ingredients: the Matrix Framework for IR, a well-defined formalism to express IR problems in terms of matrix operations, an interface based on the array datamodel, and the final mapping to a relational backend, where the computations are finally performed. The Matrix Framework for IR, recently presented in [2], maps IR concepts to matrix spaces and matrix operations, providing a convenient logical abstraction that facilitates the design of IR systems. A natural implementation of such a matrix-based formalism is based on the array datamodel, finally mapped to the relational domain. However, this appears to be a viable solution only with specific support for processing sparse arrays. The data representation that matrix spaces offer is very redundant compared to a set-based representation (although documents contain only a small fraction of the possible terms, all the presence/absence combinations are represented explicitly). This results in matrices that are typically very large and extremely sparse (densities are easily lower than 0.0001%). The sparse array evaluation presented
2
in this research proposal allows such an integration, which would not be possible in practice otherwise (see [3] for a preliminary study). This research focuses on the specific optimisation issues raised by the application of the afore-mentioned techniques to the information retrieval domain. The main research questions can be formulated as follows: – What are the key issues posed by the processing and the optimisation of sparse arrays on top of a Relational DBMS? – How do those issues relate to the issues encountered for both dense array and classical set-based processing and optimisation in a Relational DBMS? – How to exploit the IR-specific knowledge in the optimisation process?
2
Related Work
While information retrieval applications are commonly developed or tested using general purpose programming languages and array-based mathematical tools, or (less commonly) implemented on top of database systems, we are not aware of previous works using the array data-model as a ‘gluing layer’ for information retrieval and database technology. A relevant work about multidimensional arrays in scientific applications is [4], where an algebra for the manipulation of irregular topological structures is applied to the natural science domain. In [5] the authors express the optimisation of sparse matrix operations as a query optimisation problem, confirming the potential of database technology for array processing. Multidimensional database technology is also related, where cubes, dimensions, facts, and measures are comparable to the concept of multidimensional arrays.
3
Proposal
We propose the combination of the theoretical Matrix Framework for IR with a declarative array-based abstraction to database engines, for disclosing relational technology to IR research. In particular, the evaluation and optimisation aspects that such an integration entails are discussed from a query processing point of view. The key contributions of this proposal are presented in this section, organised as follows: Section 3.1 briefely discusses the possible role of database technology in computational-intensive applications; Section 3.2 introduces the prototype system that constitutes a background for this research; sections 3.3, 3.4 and 3.5 deal with the storage, the evaluation and the optimisation of multi-dimensional sparse arrays in relational engines respectively, taking into account IR-specific requirements and challenges. 3.1
The database approach to sparse array computations
Although sparse array computations are a not a settled topic in the numerical
3
analysis research field, a large amount of literature and software implementations exists, in particular for two-dimensional arrays (matrices) [6]. This research is highly inspired by such works and aims at using the most interesting results in the presented context. However, such solutions focus on the optimisation of single operations rather than full expressions (for example, countless different implementations exist for the matrix multiplication problem), assuming specific data encoding, or even tuned for specific hardware. This is in contrast with the ‘database approach’ proposed here, which turns the numerical problem into a query optimisation problem, providing the following potential benefits: – Data independence: relational expressions are transparent to the physical organisation of data. The data access optimisation problem is taken care of by the relational engine, rather than being bound to specific numerical algorithms. – Resources utilisation: modern database engines are tuned for effective exploitation of the (possibly limited) hardware resources, and use cacheconscious, CPU-friendly algorithms, becoming more and more attractive for computational-intensive applications. – Open ‘black boxes’ : instead of providing ad-hoc implementations of, say, matrix multiplication or matrix transposition algorithms, express them as a combination of native database primitives (join, selection, etc..). This allows the query optimisation process to take such operations into account within a larger problem and look for the overall best query plan. 3.2
The background: relational processing of dense arrays
The RAM (Relational Array Mapping) system is a prototype tool for mapping arrays to relations and array operations to relational expressions. The current implementation has been designed to support dense arrays only and will be used as a starting point for the research on sparse array mapping and optimisation. RAM defines operations over arrays declaratively in comprehension syntax (see [7]). For example, the expression B = [ log(A(x,y)) | x