A Universal Full Text Index with Access Control and Annotation Driven Information Retrieval Edgar Ch´avez Universidad Michoacana
[email protected]
Abstract Full text databases are tightly linked to the application layer. Currently IR projects must be integrated in the backend using, at best, a general-purpose language-independent API. This architecture limits and precludes the rapid prototyping. In this paper we present a new approach, a very simple architecture, towards the development of a general purpose full-text database. We implemented a standard inverted file index, providing various extra capabilities. For each document stored we simply added a set of qualifiers, MD5 hashes and keywords, algorithmically unrelated to the document content. This allows to hierarchically control access to the document, iteratively improve document categorization, add and delete annotations, and document versions. All transactions are done through a standard web service interface. This feature facilitates system integration, and testing. We describe a set of applications where our concept can be useful. The universe of applications for our concept encompass those areas where document annotations are relevant. Once stored and annotated (with qualifiers), the documents can be retrieved by a combination of qualifiers and document content. Additionally, we show our prototype in action, explaining how can be extended to support retrieval and storage models appeared in some popular sites recently.
1. Introduction Rapid prototyping is a powerful tool to quickly test new ideas, services and research proposals. Rapid prototyping has been used in the software industry as a milestone for proof of concept developments. In this paper we propose an architecture of a full text index suitable for use as a rapid prototyping tool, an access control tool, and an annotations storage and retrieval aid tool. All the above facilities are embedded in a unified model, the full text index.
Eric Sadit T´ellez Universidad Michoacana
[email protected]
Design principles of our architecture focuses on a quick way to test different components of an information retrieval system, and supporting (within the index, with the same technology) user access control and document qualifiers. The central idea is transparent and it is effortlessly implemented using a standard full text index, and standard IR models. We propose document enrichment using a set of special keywords we called document qualifiers. Document qualifiers can be interpreted as modifiers associating special properties to documents. A particular use of the qualifiers can be restricting access to documents at user level, or group level. A different use can be to add information about formats, categories, dates, etc. The central idea is to retrieve documents based on two different similarity operations, namely: qualifiers similarity and document full-text content similarity. The qualifiers similarity could be a simple set of boolean operations or a more complex function as a vector model [1]. Ultimately any IR model can be used creatively in both parts of the retrieval. The second part, the full-text content similarity is independent of the qualifiers similarity. Here the use is dictated by the information retrieval application. We can use, for example the T F × IDF [1], probabilistic models [1], or the PageRank [4]. In our prototype, we support boolean operations in the qualifiers similarity and cosine distance in the full-text similarity. The system is implemented as a Python application using the SPyRO middleware [8] for distributed information retrieval. The system can be seen as a web service, or a web page, and the indexer server could exists in a set of heterogeneous computers under a distributed architecture. A user will perceive it as a single centralized system. Every registered user can push documents, and iteratively improve or modify such documents. Every user can produce a hierarchy of documents allowing other users to access the hierarchy at any level, below the creator. The perception is completed with personalized indices, shared indices, and protected indices (restricting access to searches of their documents). Users can annotate documents and re-
trieve document groups by either qualifiers, document contents or any combination of those. Recognized formats for the documents are: plain text files, pdf, postscript, html, rtf and microsoft word. It can be extended to many others with the simple plug-in interface of the prototype. Our Work in Context. Many indexers exists in the literature with special characteristics and features. Notably Google [4] which is a fault tolerant, distributed, large scale hyper-textual web search engine. Another example is GLIMPSE [7] a fast and scalable web search software. Harvest [3] is an efficient distributed information gathering architecture and indexer. Lucene [6] is a high-performance, text search engine written in the Java programming language. We propose similar operations to standard Relational Database Manager (RDBM) operations where we can join two categories or access a table using a key. The RDBM’s can implement complex operations with keys, but they are not suited for full-text retrieval. Furthermore, RDBMs have an additional overhead in the size of the indices and storage of tables (generally the records have a fixed storage size). In the same way, we cannot disregard the overhead due to SQL parsing and additional SQL RDBMS requirements without important role the IR approach. There exists many SQL RDBMS in the market, some of the most popular are PostgreSQL1 , MySQL2 , Oracle Database 3 , SQLite4 , Microsoft SQL Server5 , as some examples. A set of new Multimedia Information Retrieval web applications sites are based on file qualifiers, often called tags. Some of the most popular sites are: Video Google6 , YouTube.com7, Flickr8 , etc. These web applications uses tag marks and text descriptions on multimedia entities to index and search multimedia files. Our work is, up to the best of our knowledge, the first attempt to manage annotated documents and manage user access within the same IR model as the documents. This document is organized as follows. In Section 2 we present the Qualifiers Boolean Model. The Section 3 shows a brief list of possible applications of the model that can 1 PostgreSQL is a highly scalable, SQL compilant, and mature ORDBM. PostgreSQL is available at the project webpage http://www.postgresql.org 2 MySQL is a really fast RDBM. Its webpage is http://www.mysql.com 3 Oracle and important enterprise devoted to first level products. The Oracle website is http://www.oracle.com 4 SQLite is a small C library that implements a self-contained, embeddable, zero-configuration SQL database engine. It can be downloaded at http://www.sqlite.org 5 The Microsoft SQL Server is a product offered by Microsoft Corp. http://www.microsoft.com/sql 6 http://video.google.com 7 http://www.youtube.com 8 http://www.flickr.com
be easily implemented. The universal full-text index server prototype is presented in the Section 4, followed by the future work in Section 5. Finally, Section 6 comments about future work and conclusions are presented.
2. Qualifiers boolean model A document is represented as: D = (K, T ) where K is the set of all special keywords “qualifiers”, “properties” or annotations of the document D. T is the set of terms in the content of the document. Let n be the number of qualifiers in the system and m the number of terms in the collection, the qualifiers of the document i are represented as the vector of booleans Ki = (k1i , k2i , k3i , . . . , kni ), and the vector Ti = (ti1 , ti2 , ti3 , . . . , tim ) describes documents terms. Then, k1a ∧ k1b , k2a ∧ k2b , . . . , kna ∧ knb filters results of documents with all the qualifiers; k1a ∨ k1b , k2a ∨ k2b , . . . , kna ∨ knb obtains the documents with any of the qualifiers. The model can be easily extended to support more complex expressions and ultimately to support fully Conjunctive Normal Form (CNF) or Disjuntive Normal Form (DFN) (or any boolean expression), or a different IR model. The similarity between two documents D and Q is: m X kjD kjQ ≥ a] sim(D, Q) = sim(TD , TQ ) × [
(1)
j=1
Pn Where a is 1 for the union operation and a is i=1 kiQ for the intersection. [A ≥ B] returns 1 if A ≥ B is true else returns 0. The number of qualifiers in the collection is represented by m. The function sim over two vectorized documents TD and TQ is the similarity function used by non-qualified models. We implement in the prototype a T F × IDF similarity ranking. The similarity function is as follows:
sim(D, Q) =
Pn
wiD wiQ qP × n Q 2 D )2 + (w (w ) i i i=1 i=1 i=1
qP n
m X kjD kjQ ≥ a] [ j=1
Where wiD is the weight of the term i in the document D (we are aware that these IR models are folklore, for the sake of a self contained lecture, the reader may review [1]). The D and Q are used as TD and TQ without distinction to simplify notation and emphasize that D and Q are supersets of TD and TQ , respectively.
2.1
Qualifier vector model
The boolean model is a strict document filter. It is not well fitted to retrieve qualifiers with relaxed accuracy. The boolean perspective of is similar to schema used by RDBMs. We can use a different, richer, approach to handle qualifiers. Many applications may benefit from a relaxed perspective in the qualifiers. A dot product may be fittest for this objective. In the simplest setup we set the k-th qualifier weight to 1 if the qualifiers is present and 0 otherwise. The dot product for the qualified ranking is written as: sim(D, Q) =
sim(TD , TQ ) × Pm D Q i=1 ki ki qP qP m m Q 2 D )2 + (k i i=1 i=1 (ki )
Or specifically for the prototype, the similarity function is as follows: Pn D Q i=1 wi wi qP × sim(D, Q) = qP n m Q 2 D )2 + (w (w ) i i i=1 i=1 Pm D Q i=1 ki ki qP qP m m Q 2 D )2 + (k i i=1 i=1 (ki ) Where n is the number of distinct terms in the collection and m the distinct qualifiers in the collection. A particular choice of the IR model to be used in either the qualifier or the content parts of our model is application dependent. We want to emphasize the joint model, rather than a particular application.
3. Applications The set of applications are motivated by the access control need, and annotation-based retrieval applications. The possible uses of the Universal Index goes from the desktop single user utilities to entire server, full multiuser applications. In the personal sphere, the Universal Index could be used to index and annotate entire filesystems, mail clients, instant messengers, Integrated Development Environments (IDEs), bookmark managers, application managers, address books, etc. Full text indexed file systems are very popular tools in this days. An example of a desktop search engine is Google Desktop9 presented as a free tool for Windows users and as a commercial tool for enterprises. Beagle10 gives to 9 http://desktop.google.com/ 10 http://beaglewiki.org/
gnome11 vfs and other applications, it uses SQLite to represent the meta-data information of the files, and the indexes as well. Beagle uses mono12, then it can be remote accessed via web services. Beagle is open source and runs in many architectures, probably in many as its dependencies, but support is only for Linux13 . Apple14 shows the new filesystem full-text indexer Spotlight in its new version of the popular Mac OS X Tiger. Spotlight facilities are incorporated in many desktop applications to find files, emails, contacts, images, calendars and applications. Spotlight is an exclusive application of Mac OS X. Microsoft Windows Vista15 will be shipped with a new filesystem full-text search based on a reduced version of its SQL Server, this new Filesystem will have similar functionality to the previous ones, and surely it will be running only under the Vista platform. Our application may have the same above funcionalities, plus annotation based retrieval and access control, it is a portable, standalone application. The Universal Index, could be used to perform many tasks of a Database Filesystem. The Universal Index can perform searching and indexing capabilities included storage of meta-data information as qualifiers of the documents. Furthermore, we can perform filter of data using similarity functions as is shown in Section 2, the filter can be over the document content or the meta-data or both of them. The retrieval can be exact or approximated, yielding to a more powerful results that the above. Another, useful extension is to provide full SQL additions to the queries, using an SQL RDBMs like SQLite, MySQL, PostgreSQL, etc. The SQL extension is useful to structure some qualifiers as SQL tables and perform some operations hard and useless to achieve without the structure of a SQL database. The indexer can be used in instant messengers to search chat sessions, the IDEs (and version managers) can be capable fast findings in big projects. Bookmark managers can be benefited with some fancy features if we save the pages and then indexed and classified. Features like automatic classification of bookmarks, searching over them, searching related data over the Internet, etc. The qualifiers of the bookmarks can be simple time modifiers, categories, some about popularity, etc. Application managers can be used to easy access to applications and categorization of them. So, application can be launched using its definition like “Text editor”, “Programming editor”, or even using qualifiers “A good text ed11 http://gnome.org/ 12 Mono is a platform for .NET, a microsoft framework, available at http://www.mono-project.com 13 Linux is a popular, free, open source, multi-architecture, UNIX like operating system. Linux is available in the Internet at many sites, for example http://www.debian.org 14 The Apple Inc. website is http://www.apple.com 15 Vista will be the new version of the Windows Operating System launched by Microsoft http://www.microsoft.com
itor” or “The new text editor”. Other applications, can use the API to search over its help documents. Other important applications are the server side or provided by a server. Ranging from blogs, web mail providers, forum sites, etc. Gmail16 supports grouping emails in a “conversation” style, so the emails that looks like they are related to the same conversation (i.e. by its subject) its grouped to represent its semantic affiliation. Gmail supports searching over the entire pool of mails, allowing to find a mail or a conversation using its content. If we suppose that every user has an identifier, we can treat the identifier as an email qualifier to allow filter searches over users, if the user explicitly specifies we can use group qualifiers to accept other users search some of the emails, and search over the entire group. Searches over a set of users have not penalty because of the number of users in the group (If every user have its personal index we must search every index and then join the results). In general, applications can benefited from personal level to enterprise level using the access control feature. Only the key of a document can be used to access or make searcheable a document. A document can have several access controls, and is just limited for a public access flag control, because this flag gives instant full access.
4. The Universal Full Text Index Server The prototype shows the model in action, it can receive a set of URIs to be retrieved and support push files. The pushed files are stored and saved in a known format. The prototype support several formats and a simple plug-in interface to add more formats. Additionally we can add qualifiers keywords to the document. A set of keywords could be added in automated way to the system. Keywords could be used to add a format, a category, a data or any useful information. In addition, the user can qualify documents with any useful information related or unrelated to document’s content, like a security flag or accessor qualifiers. Qualifiers can be written in many ways, but it is useful to use more-representative format like URIs. For example we can insert a category qualifier with: cat://category[/subcategory] Any resource can be identified using URIs [2]. Another example is to find documents by date, using a special URI describing a date date://year/month/day. This disipline allows to support some range and approximated searches with low overhead. 16 Gmail (http://www.gmail.com) is a popular web mail provided by Google Inc.
4.1
Web Services Interface
The prototype is built over SPyRO [8], supporting a Web Service Interface. The Web Services Interface provides access from remote applications, so any one with an internet connection is capable of controlling the prototype. The remote controller can add documents to the collection and qualifiers as well. The searches could performed through the web service API, and the results can be formated in a variety of formats. A Web Services client can use results in any application, not just to present results in web pages.
4.2
Web Interface
The Web Interface help us to test the prototype without using the Web Services Interface directly. It provides access to most useful operations of the prototype: 1. Push files. Sending a file through a Web Form. 2. Harvesting of documents asocciated to URIs (An existent plug-in for the URI must exists (currently just http, ftp, and every supported by the Python urllib2 [5]). Special optimization treatment is done with the http protocol because is the most popular and principal source of documents. 3. Direct insert. The direct insert is used to send text directly to the server and index it. 4. Change qualifiers. Modifies the qualifiers of a document. Figure 2 shows the prototype web page. It can be access in the Internet address http://www.spyron.org.
5. Future Work The web interface breaks many web usability principles and lacks of a formal set of web usability guidelines, to solve this problems we will perform usability tests and will try to fix this problems. The easy navigation and accessibility will have special treatment. Regular expression are powerful tools that will be implemented in the next version of the system, because many applications and useful behaviors can be achieved using regular expressions. Another text based improvement is the case sensitive characters in the URI index. The model can be easily extended to support different sets of qualifiers (e.g. access control and annotations separated) with different IR models in each. Access control is more likely to be implemented as an strict boolean model, for example.
(a) Search results, qualifiers are part of the query
(b) Help
Figure 1. Search and help web pages in the web user interface
(a) Sending a file
(b) Send a text
Figure 2. Push files and plain text web pages to the Universal Index. Both interfaces allow addition to qualifiers.
In order to increase the functionality, some qualifiers can be interpreted as functions applying to other qualifiers, obtaining expressions to represent more complex queries. For example functions to compare qualifiers as q1 > q2 , q1 < q3 or any other function set.
6. Conclusions In this paper was described a qualifier enhancement Information Retrieval System, that yields to a qualifier driven recovery capable of individualize an index to a single user, a group, several groups, and all users. The qualifier model can use qualifiers not just to make access control, results can be filtered using a fast and elegant approach. We presented a prototype that implements the qualifier model. The prototype has two interfaces: a web page interface and a web service interface. Then, people that just want to prove can do it using the web interface, and developers can create a new applications using the web services API. Further more, our model covers (in the IR framework and without the optimization level of the comercial systems) some of the most popular web applications in the market, such as video google, flickr, or youtube.
References [1] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. [2] T. Berners-Lee, R. Fielding, U. Irvine, and L. Masinter. Uniform resource identifiers (uri). rfc 2396, 1998. [3] C. M. Bowman, D. R. Hardy, D. P. Wessels, M. F. Schwartz, P. B. Danzig, T. Corp, and U. Manber. Harvest: A scalable, customizable discovery and access system, Apr. 20 1995. [4] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine, 1998. [5] P. Homepage. http://www.python.org, August 1995. [6] A. Lucene. http://lucene.apache.org/java/docs. [7] U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 23–32, San Fransisco, CA, USA, 17–21 1994. [8] E. S. Tellez, E. L. Chavez, and J. Contreras-Castillo. Spyro: Simple python remote objects.