A Bulk Synchronous Parallel approach for indexing ...

4 downloads 24041 Views 149KB Size Report
Aug 25, 2011 - the parallel programming paradigm called Bulk Synchronous Parallel, im- .... a communication phase, each processor can send or receive ...
A Bulk Synchronous Parallel approach for indexing large hierarchical taxonomy datasets Tommaso Teofili August 25, 2011 Abstract Working with structured data in information retrieval can be tricky because such information is usually stored in systems like DBMSs being able to represent relations between conceptually distinct entities in the dataset, while modern search engines are based on flat representations of data where entities (documents) simply constitute of streams of unstructured information (like text, audio or video). This inherent difference often shows up in real life in the industry as both storage and search systems have to live together and respond to users’ information needs with regard to the same data according to their capabilities; therefore a software layer responsible for keeping the information up to date between the mentioned systems is often needed, which involves crunching a huge amount of records, files, etc. In this article an efficient and parallel way of transforming hierarchical information about taxonomies stored in a storage system based on MongoDB into flat information for indexing into an Apache Solr based search server is described. Such an approach leverages the parallel programming paradigm called Bulk Synchronous Parallel, implemented using the Apache Hama BSP framework.

1

Introduction

A common scenario when dealing with search engines is the one where different systems work together in order to provide good accuracy and a satisfactory user experience, more specifically the search engine works as (or can be assimilated to) an inverted index where domain entities and their relations are usually transformed into a simpler, flattened, structure that better addresses search needs. However some structured information may have to be preserved in their original form in systems different than the information retrieval platform in order to address other business processes; such scenarios may involve integration of multiple software systems like CMSs or ontologies / thesauri / taxonomies management systems. Also, such structured information often decorates search results (e.g. tags of documents coming from a hierarchic taxonomy) to provide, often automatically extracted, context.

1

Apache Solr Apache Solr [5] is an open source search platform based on Apache Lucene [3]; it provides a wide set of features for indexing, search, hit highlighting, faceting and more and it offers an HTTP API to perform most of the operations and a convenient set of bindings to create and consume such HTTP calls. One of Solr’s most important features is the ability to dynamically scale, this is achieved by manually set up master / slave architectures, with multiple shards eventually, or by leveraging the SolrCloud architecture, a dynamic Solr cluster setup which provides high availability and fault tolerance. A Solr instance / cluster can host multiple sets of homogeneous documents, called ”collections”. Each collection maps at least to one Lucene index for each Solr instance (but may be spread/sharded among different instances). The Lucene index is in the end the data structure holding the inverted index to be used for indexing and search. MongoDB MongoDB is an open source document database part of the NoSQL ecosystem, where each document is represented as a JSON document, according to a dynamic schema. It provides high availability and scaling capabilities through its clustering capabilities, where multiple replicas and shards can form such clusters. The problem A search engine system (built on Apache Solr) already provides indexing and search over a huge knowledge base of financial articles. The search system has to be enhanced in order to be able to index data from a structured taxonomy managed by another system to: • show items in the taxonomy that have been used to label specific documents in the search results • search over all the taxonomy items metadata • make a ”join search” in order to find metadata of taxonomy items which have been used to label documents resulting from a specific query The first requirement means that at indexing time the article text has to be decorated with ”tags” coming from the taxonomy, this can be achieved either manually via human intervention or automatically via a ”concept tagging” algorithm fed with taxonomy items being applied to the article text. The second requirement implies the taxonomy items have to be indexed themselves too, together with all their metadata and such information can be retrieved separately from the financial articles. The latter is a RDBMS-like requirement but that’s often the case in the industry that search engines have to deal with such joint searches that take into account and have to retrieve results from heterogeneous datasets. The taxonomy is structured as a tree of 10 million nodes and managed by a proprietary system running on top of MongoDB.

2

The challenge is to be able to extract the taxonomy items with all their metadata from MongoDB and index them in a proper way into Apache Solr to make it possible to satisfy the given requirements. This poses two main problems: 1. how to extract and index such a huge dataset in a performant way 2. how to structure the taxonomy items to be indexed in Solr The former challenge should be quite clear as taxonomy data is subject to change and changing a taxonomy item will likely impact also other items depending on that, so that if the name of the taxonomy /finance/personalfinance changes to /finance/personal, all the descendants taxonomies will be changed (e.g. /finance/personal/loans, /finance/personal/creditcards) and therefore taxonomy items need to be in MongoDB and in Solr need to be up to date, perhaps with a reasonable delay of e.g. some minutes. The latter deals with how items are managed in a tree structure and in a flattened structure, MongoDB provides a hierarchical structure while Solr provides a flattened structure as each document is just a collection of fields, no relationship concept exists there, at least in the document model. Using out of the box Solr’s capabilities to index information coming from external sources (aka DataImportHandler) is not possible in this case as integration with MongoDB is not currently supported and, regardless of that, the performance debt of such a huge dataset import would be too high to be handled by Solr itself anyway. This leads to the conclusion that a separate application / connector has to be developed so that neither Solr nor MongoDB get impacted from this potentially long task, other than exporting the taxonomy for MongoDB and indexing some documents for Solr, which needs to be done and cannot be delegated.

2

A BSP based approach

A Bulk Synchronous Parallel (BSP) [6] approach is used for the task of transforming MongoDB taxonomy export into Solr documents to be indexed. BSP is a bridging model for parallel computations which poses an abstraction on how to perform parallel algorithms and how parallel processors should work together; such a model was published in 1990 and had a significant use over time but Google’s adoption of BSP for their large scale graph processing platform called Pregel [4] in 2010 gave it a remarkable boost. Such driver originated open source projects dedicated to graph processing like Apache Giraph [1] but also more general purpose platforms like Apache Hama [2]. BSP in a nutshell Bulk Synchronous Parallel model defines a set of processors as the elaboration units having access to a fast local memory, they can communicate with each other in a MPI fashion (one sided communication) 3

through a bus they’re all connected to. A BSP algorithm is composed by a a number of so called supersteps that gets executed concurrently by, most often, all the available processors. Each superstep is composed by: • a local computation phase, each processor is allowed only to access its local memory and execute operations locally • a communication phase, each processor can send or receive messages to or from other processors (also called peers) • a barrier synchronization phase, all the processors entering the barrier express their willing to terminate the superstep, and wait for the communications to finish, after this phase a new superstep can be executed Apache Hama in a nutshell Apache Hama is a generic BSP platform built on top of Hadoop MapReduce and HDFS which provides a generic pure BSP API, a vertex centric BSP API for graph processing together with some machine learning algorithm implementations based on the Bulk Synchronous Parallel model. It allows to deploy and execute such parallel BSP algorithms on many machines as well as on one computer. It uses HDFS as the default input and output resource for reading and writing data and Apache Zookeeper for synchronization. BSP taxonomy indexing For this particular problem BSP seems to fit well as such a model has proved to have good performances, better than MapReduce, for highly iterative algorithms and graph processing algorithms. The taxonomy data exported by MongoDB Is a tree, and therefore also a graph so that an interesting approach would be to use the same vertex centric approach proposed in Pregel paper [4] to properly process taxonomy items in the tree. Unfortunately this would mean having to do a set of ”online” operations on MongoDB which is not good in the given scenario to avoid putting too much load on it. In the chosen approach MongoDB is instructed to extract a set of views on the taxonomy tree by depth, in a breadth first like way, so that the first export file will contain nodes with depth(node) < 2, the second export will contain nodes with depth(node) < 3 and so forth till there’re no leaves to be exported in the tree; note that the numbers may change so that more than one depth levels get exported at once. Unfortunately, in the given real life industry scenario, this constraint makes everything much harder as it may be possible that for each node the whole list of parents will have to be backward rebuilt so that each node may have to be traversed a number of times proportional to the product of the breadth of its upper depth level for the breadth of its lower depth level. Moreover nodes have to be handled while traversing the exported file, which is quite costly. To implement the below algorithm Apache Hama is chosen, as it provides a generic and flexible BSP API. The MongoDB export files can be either placed

4

into local file system or on a distributed one like HDFS, the algorithm will ingest one file at a time in depth ascending order (so the leaves export file will be the last to be processed). The first part of the algorithm consists on spreading the lines of each file among the different processors (they can be local cores or remote machines, it doesn’t matter for the sake of the algorithm); this will be done iteratively for each export file. Each processor will enter a superstep and parse the assigned nodes (for the sake of simplicity assuming each line in the export is a node in the tree) extracting the metadata and keeping the nodes and branches state into local memory (local computation), then nodes missing its parent (as it was assigned to another processor) will broadcast a request for the parent with given id (communication phase). After having synchronized a new superstep will be started that will be used to send, join and receive the missing parents. The above couple of supersteps will be iteratively executed for all the export files. Once all the exports have been processed the final superstep will consist of transforming the subtrees in Solr documents and either sending them to a Solr instance / cluster or to write them onto disk in one of the supported file formats for Solr (e.g. XML, JSON) to be later indexed in some other manner (e.g. bash script).

3

Conclusions

The main advantage obtained by the BSP implementation is that the most costly operations of resurrecting the trees’ structures and converting the data into the Solr format is split among a number of processors without having to keep the whole exports in memory. That resulted in avoiding failures because there was not enough memory available and also a better performance when compared to other non BSP or serial solutions. Sample code used for this experiment can be found on Github at: https://github.com/tteofili/m2s.

References [1] Apache Giraph. http://giraph.apache.org, 2011. [2] Apache Hama. http://hama.apache.org, 2010. [3] Apache Lucene. http://lucene.apache.org/core, 2010. [4] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pages 135– 146, New York, NY, USA, 2010. ACM. [5] Apache Solr. http://lucene.apache.org/solr, 2010. 5

[6] Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111, August 1990.

6