independent and capable of solving queries on local data. Data recovery is done in the nodes, and distribution is done at the load balancer. This strategy neither ...
Distributed Search on Large NoSQL Databases Fernando G. Tinetti1, Francisco Paez, Luis I. Aita, Demian Barry III-LIDI, Facultad de Informática, UNLP, La Plata, Argentina Fac. de Ingeniería, UNPSJB, Sede Pto. Madryn, Puerto Madryn, Argentina 1 Investigador Comisión de Investigaciones Científicas de la Prov. de Bs. As.
Abstract - This work focuses on performance and scalability of different policies for solving queries on large noSQL databases with clusters. Distribution of data and queries are amongst the main problems, given the distributed nature of clusters: basically a set of networked computers. The basic centralized model (for both, data and processing) is used as a departure point and different distributed configurations are experimented with, in order to determine several guidelines for performance improvement. Apache Solr has been used for database management and search server. The current contents of the Wikipedia in Spanish (with about 4.5 GB) have been used as an example of a NoSQL database for experimentation. Keywords: Parallel and Distributed Computing, Distributed Search, Data Sharding, Mapreduce, Apache Solr.
1
Introduction
Currently, there are large amount of websites with large amount of data available, which is necessary to handle in an efficient way. Several reasons for the information volume growth have been [1] [11] [13] [7]: • The popularity of content management systems (CMS, Content Management Systems) as portals general and as platforms for collaboration in particular. • The so called Web 2.0, roughly defined as the current set of applications with high levels of interaction and access to multimedia data. • The data generated within organizations, either as output or intermediate process of production systems or by digitizing existing documents. In summary, there has been an exponential growth in volumes of information produced which, in turn, implies handling terabytes and petabytes of information instead of gigabytes. This scenario has led to the challenge of improving the so called information retrieval search tools using different/new techniques. Scalability, availability, and performance in handling large volumes of information are now mandatory for most applications in this context, usually requiring techniques of distributed systems. Some of the techniques presented in this work include: load balancing, replication, and horizontal distribution (sharding) of information [1] [8]. The White House has used a combination of Drupal and Apache Solr in
its portal of document access/contents [15] [12]. In general, solutions to this problem must include strategies for scalability, availability, and performance.
1.1
Information Retrieval in Large Volumes of Data
Sequential search has to be discarded, given its lack of scalability. Some auxiliary data structures are necessary, that allow quick searches. Indexing provides data structures that facilitate information searching and retrieval quickly and accurately. Some indexing examples are: inverted index [4], a citation index, a matrix or a tree [9] [6] [2]. The indexing process usually requires analysis and processing of documents to include in the index: stemming, tokenization, phonetic analysis, etc. These steps introduce important issues and challenges for processing [4] [8], which are beyond the scope of this work. Instead, this work is focused on alternatives for distributing the indexes and queries in a heterogeneous and scalable environment. A set of desirable properties of a feasible solution should are related to [1]: performance and heterogeneous data, fault tolerance, and heterogeneous hardware platforms. Performance and Heterogeneous Data: traditional databases (usually called relational or SQL databases) have a lot of effort on the issue of performance where most techniques take advantage of parallelization and partitioning of information, relying on structured data stored in (relational) tables. The problem is different with heterogeneous information, not just the information stored in the database but also all the surrounding information such as documents, pictures, videos, sound, mail, etc. Heterogeneous information causes an increase in the volume of stored data which requires restructuring and rethinking the forms of storage. Also, heterogeneous information fundamentally requires some at least some restructuring of the way in which information must be retrieved. It is worth noting that information retrieval usually requires knowledge of meaning and understanding of the data to retrieve. Fault Tolerance: one of the effects of distributing information in order to increase performance in information retrieval is that the distributed system must consider fault tolerance. The retrieved information must be consistent, even when one some of the nodes become unavailable. Even when handling failures is beyond the scope of this paper, transactions of a database should have the so-called ACID
(Atomicity, Consistency, Isolation, Durability) property or properties. This paper focuses exclusively on the consistency and handling of distributed information indexing and recovery. Heterogeneous Hardware Platforms: in order to guarantee good performance in information retrieval it should be possible to increase the number of participating nodes in a search. Traditional databases parallelization are usually focused on homogenous hardware, this limiting the growth in number of nodes. Moreover, noSQL solutions for managing large volumes of information are usually based on a set of heterogeneous computing nodes. There are various techniques for configuring heterogeneous environments which at least will be discussed in this paper.
2
Techniques on NoSQL Databases
Several common techniques are applied in current NoSQL databases: indexing on “shards”, shared nothing data distribution, data replication for load balancing, scatter and gather on distributed data, and map/reduce processing. These techniques are briefly explained below. Indexing on Shards: basically, sharding is a process similar to that of horizontal partitioning of data in a standard (structured) SQL database [5]. Sharding provides multiple capabilities for scaling, allowing to divide data and indexes on multiple servers which are known as shards. Indexing shards is the process of producing a data structure that facilitates searching and retrieving some kind of information from data in its original form [4]. Common generated data structures are inverted index, index of pointers, matrix, or tree. The indexing process usually requires analysis and processing of documents to include in the index: stemming, tokenization, phonetic analysis, etc. These steps introduce important issues and challenges at processing [4], which are beyond the scope of this work. Every query has to be processed in every shard, and finally a single response is built as an aggregate result of individual shard results. This technique specially suited on large volume of data. Database sharding is directly related to the shared nothing data distribution. Shared Nothing Data Distribution: shared nothing focuses independence of nodes, distribution of information and processing. A shard is a shared nothing node which handles a set of documents indexed by any criteria. Also, a shard has its own mechanisms for ranking, sorting, and retrieval of information, depending on information or application needs. Possible data distributions can be thematic, ontological, segmented according to preferences, or even combinations of them. In all cases, techniques can be combined with traditional databases, such as replication and parallelization on shared disk (a traditional cluster with a storage area network) [10]. In general, the concept of Shared Nothing ensures some information consistency, but it is not necessarily ACID compliant. Also, these distributions make easier using independent heterogeneous nodes with their own memory unit, disk storage and processing. Nodes are
necessarily interconnected by a network and, clearly, the architecture requires extra effort in coordination and synchronization. Data replication for load balancing, scatter and gather on distributed data, and map/reduce processing are some techniques used for coordinating the shared nothing nodes. Data Replication for Load Balancing: the architecture must guarantee a set of nodes with consistently replicated information across all nodes. The search engine has a pool of data nodes in which the information is searched. Queries are not parallelized, but distributed between nodes, which are independent and capable of solving queries on local data. Data recovery is done in the nodes, and distribution is done at the load balancer. This strategy neither solves the space problem nor parallelizes search [14]. Scatter and Gather on Distributed Data: this method is used when data are not replicated, where the query is broadcasted from a coordinator to every node known to have data. Then, each node processes and sends a reply with to the information locally found. All replies are processed in the coordinator which, in turn, consolidates into one consistent reply to the request source. An additional advantage of the method is that data nodes may additionally distribute the data into other new nodes. A hierarchical distribution is then constructed, which is not visible to the “overall” coordinator. In general, the information is partitioned among nodes and queries are effectively parallelized. However, there are also disadvantages: distribution overhead specifically with logical segmentation and some query overhead in the coordinator, which has to generate the query result from gathered data. Some logical segmentation/s, such as the ontological requires knowledge and information about the contents (data) to be stored. In some cases this knowledge is relatively complex to obtain, especially with the implementation of ontological rules [14]. Map/reduce is an effectively used technique in this context. Map/Reduce Processing: traditional databases (usually called relational or SQL databases) have included a lot of effort on the issue of performance where most techniques take advantage of parallelization and partitioning of information, relying on structure. Usually, NoSQL databases start from text and/or heterogeneous and not necessarily structured data. Map /reduce is a good technique for processing a large volume of data in parallel. The model provides a mechanism for data partitioning that can make a “smart” distribute according to predefined rules on selfcontained different nodes. An additional advantage lies in saving space in the result of shared keys by reducing them within a document [3].
3
Experimentation Guidelines
The work in paper is focused on verifying the effectiveness of a NoSQL database manager as a model for scalability and efficiency in information retrieval. The specific chosen manager is the Apache Solr implementation of Apache
Lucene. Apache Solr has several advantages for this analysis, since: • Apache Solr is freely available. • Apache Solr allows testing heterogeneous document indexing. • Allows future tests and other analyses with similar platforms of the same family. There are several performance indices which can be measured by experimentation: CPU load, average response time to queries, memory and swap usage, disk accesses, and network bandwidth involved, among others. The two most important indices have shown to be CPU load and average response time, since in some way include (are more or less directly related to) the other indices. Performance and state is measured in both server/s and clients, thus obtaining an approximation of the distributed client/server system state as well as the state of individual computers and processes. At a higher level, the analysis is focused on performance scaling as well as quality of service/individual query response time. The full backup of the current Wikipedia articles in Spanish has been used as a real environment of documents to be indexed and searched, with approximately 1800000 items (4 GB of data on disk). Several tests were designed in order to measure the performance of different server configurations. Every test involves the simulation of several concurrent user processes and multiple specific queries with the following characteristics: • The whole set of words in the dictionary of the Spanish Royal Academy of approximately 86,000 words was used. • Each query was generated by random grouping from 2 to 4 words. This ensures randomness and heterogeneity of queries. • Queries are issued from several concurrent client processes (simulated by runtime threads). The numbers of threads used were 64, 128, 256, 384, 512, 768 and 1024, thus allowing a progressive analysis of workload/ requirements. • Every test is repeated 10 times in order to obtain the corresponding average of each index measurements. The open source tool Siege [16] was used for generating and measuring the concurrent client’s environment of each experiment, and awk was used for constructing each specific query. It is worth noting that database update is not taken into account in these experiments, since only the number of query (recovering information) requests is being considered from concurrent clients. There are not delete/change requests which would change database content/s. The experiments were also carried out with three server configurations: a centralized server, a server with two shards and a replicated server. Fig. 1 shows the centralized server configuration, which is standard, and used in this work for comparison, in order to have a reference point. Fig. 2 shows the server configured with two shards (both shards have the same number of documents). The server front-end has a minimum workload: query replication to both shard servers
and aggregation of results from both shard servers in order to send a unique result to each query. Fig. 3 shows the specific configuration defined for a replicated server with two replicas. The Master Server originally indexes the documents and manages replicas. Every replica independently handles its own queries, and the server front-end has a low workload: balances queries (round-robin) and sends results to the proper client. Even when the Master Server has to deal with new documents and their indexing, most of the problem is still found at the query problem and the workload generated by multiple concurrent client processes. Centralized Server
Query/Result
Query/Result
Indexing
Storage
Figure 1: Centralized Server. Two Shards Server
Storage Query/Result
Aggregate Result
Indexing
Server Front-End
Server Back-End
Storage
Figure 2: Two Shards Server.
Replicated Server New documents Result Query Storage Front-End (Balancer)
Storage
Replicas
Master Server Indexing
Storage
Figure 3: Replicated Server with Two Replicas. The sar tool is used in every server configuration to monitor and account the selected performance indices. Experiments are triggered at the server side, synchronized with the sar command at the server side. Each performance index sample is taken every two seconds at experiment runtime and stored for later processing. In case of several servers, indices are aggregated and processed after running the experiments.
Results 5,5 5 4,5 4 3,5
Seconds
Several combinations of hardware and software were used in order to test the server configurations and client requests. The main characteristics of the computer from which the script that collects all the client-side information and performs all the requests to Solr server/s are: • AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ • 1 GB DDR2 RAM • 100 Mb/s Ethernet NIC • Fedora Core 11 64 bits Two different computers were used for the server side, so that heterogeneous hardware can be measured and evaluated. The most powerful computer/hardware is used for the stand alone/ centralized server (Fig. 1 above) as well as one of the shard and one of the replica servers (Fig. 2 and Fig. 3 above). The main characteristics of this computer (the most powerful at the server side) are: • Intel i3 CPU 540 @ 3.07GHz • 8 GB DDR3 RAM • 1000 Mb/s Ethernet NIC • Ubuntu Server 10.04 LTS 64 bits The less powerful computer used at the server side has the following characteristics: • Intel Core 2 Duo CPU E7400 @ 2.80GHz • 2 Gb DDR2 RAM • 1000 Mb/s Ethernet NIC • Ubuntu Server 10.04 LTS 64 bits Apache Tomcat is used in every server configuration with the default installation; the only changed value is the number of threads, raised to 1024. The JVM (Java Virtual Machine) running Apache Tomcat was specifically configured so that the server is monitored by enabling JMX (Java Management eXtensions). Apache Solr is used almost with the default configuration and installation from binaries, specifically defining the Wikipedia in Spanish documents to be indexed and searched for specific data. Fig. 4 shows the performance in seconds for different number of clients triggering queries for the three defined server configurations. Even when all the configurations have a performance degradation starting at the 512 clients, the shards configuration has a near linear increase in the time which, in turn, suggests a better scalability at least in terms of number of requests. As expected, the stand-alone server configuration (shown as “1 Server” results) has performance degradation for lower number of concurrent clients: between 256 and 384 and, also, degradation is far from being linear
3 2,5 2 1,5 1 0,5 0 64
128
196
256
384
512
768
1024
Clients (threads) 1 Server TR (Server)
2 Shards TR (Shards)
2 Replicas TR (Réplicas)
Figure 4: Response Time Performance. In Fig. 5 the CPU usage is shown for the experiments already shown in Fig. 4. The stand-alone server configuration 95 90 85 80 75 70 65
% CPU
4
starting at 512 clients. As shown on Fig. 4, the replicated server has an intermediate behavior in performance between that of the stand-alone server and the sharded server. Perhaps the worst replicated server behavior characteristic is that shown starting at 512 clients, since performance degradation is far from being linear, even worse than that for the standalone server. This, in turn, shows that replication servers should be very carefully designed, configured and monitored in order to avoid hot spots and/or high performance degradations under requests stress.
60 55 50 45 40 35 30 64
128
196
256
384
512
768
Clients (threads) 1 Server Proc (S-Aalone)
2 Shards Proc (Shard)
2Proc Replicas (Réplica)
Figure 4: Average Server CPU Usage.
1024
% CPU
is almost overloaded starting at 85 concurrent clients, with more than 85% CPU usage. Sharded and replicated server configurations are not necessarily lightly loaded (both are above 65% CPU usage starting at 256 concurrent clients), but do not get overloaded, the servers are always below 80% CPU usage. The CPU usage explains almost directly the reason for the performance obtained with each server configuration, since results are almost directly proportional. The direct relationship Response Time - CPU usage allows to exclude the analysis on other important factors such as network load, disk accesses, RAM usage/footprint, etc. Since heterogeneous computers are used at the server side, further analysis would be useful in order to explain results as well as define better balancing and scaling load strategies. Fig. 5 shows the CPU usage per each computer at the server side. Clearly, the stand-alone server CPU usage (shown as “1 Server”) is the same as that shown in Fig. 4. Results shown as “S1” correspond to the sharded server configuration on the best computer and those shown as “S2” to the sharded server on the worst computer. Clearly, the worst computer is almost always overloaded starting at 196 concurrent clients. The best computer used at the server side is almost always lightly loaded. There is a huge unbalanced workload for heterogeneous servers, and it seems to be a clear index for workload balance: CPU usage. 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15
(S1Proc Server Aalone) S1 Proc
(Shard1) S2 Proc
(Shard2) R1 Proc (Ré -
plica1) Proc (Ré R2 plica2)
64
128 196 256 384 512 768 1024
Clients (threads)
Figure 5: CPU Usage per Server Computer. Results shown in Fig. 5 for the replicated servers (R1 and R2, respectively) are analogous to those explained for the sharded servers. In general, analyzing data shown in Fig. 3, Fig. 4, and Fig. 5, several interesting remarks can be done • The sharded server configuration almost always provides the best performance. • Distributing data and/or queries at the server side almost always improves performance.
• •
5
There is still work to be done for enhancing workload balance, which could further improve performance, specifically with heterogeneous hardware. Even when some of the distributed server computer/s are overloaded, the average CPU usage is directly related to performance, i.e. overloading does not necessarily imply too high performance penalties provided there are not overloaded server computers which are able to handle incoming requests.
Conclusions and Further Work
This paper has shown several configuration guidelines and obtained results for NoSQL databases. Experiments with heterogeneous hardware are also included, basically as a proof of concept and, also, for further analysis of specific results. Working with heterogeneous data can be considered as granted on NoSQL databases by its own nature. Infrastructure software such as Apache Solr has proven to be successful not only for starting a (server side) content management system, but also to experiment and measure runtime tests. Specific experimentation and measurement on a distributed system imply using tools and methodologies on the client side, and a combination of Siege and awk has been used for generating representative requests load on different server configurations. Several tests have shown that a completely distributed data and information recovery configuration by defining shards provides the best runtime results. There are several immediate tests and hardware and software configurations to experiment with: • Using more servers for shards as well as for replicas, also with more heterogeneity. • Fine-tuning of data and workload on each heterogeneous server. • Combinations of sharding and replication, since those options do not exclude each other. Other research lines are not so immediate, since require more analysis and experimentation, among other tasks. A possible step forward is the heterogeneous distribution of indexes according to different criteria, such as server heterogeneity and type of query (e.g. combination of words).
6
References
[1] Azza Abouzeid, Kamil BajdaPawlikowski, Daniel Abadi1, Avi Silberschatz, Alexander Rasin, “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads”, Proceedings of the VLDB Endowment (2009), Vol. 2, Issue: 1, pp. 922–933. [2] J. Chris Anderson, Jan Lehnardt, Noah Slater, CouchDB: The Definitive Guide, O'Reilly Media, Jan. 2010, ISBN 1449379680. [3] Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplied Data Processing on Large Clusters,” Communica-
tions of the ACM - 50th anniversary issue: 1958 - 2008, Vol. 51, Issue 1, Jan. 2008. [4] Erik Hatcher, Otis Gospodnetić, Lucene in Action, 2nd. ed, Manning Publications Co., 2004. [5] Cal Henderson, “Building Scalable Web Sites”, O'Reilly Media, 2006. [6] Eben Hewitt, Cassandra: The Definitive Guide, O'Reilly Media, Nov. 2010, ISBN 1449390412. [7] Curt Monash. The 1-petabyte barrier is crumbling, Networkworld, Aug. 2008 http://www.networkworld.com/ community/ node/ 31439. [8] Ken North, “The NoSQL Alternative, Low-cost, highperformance database options make gains,” Information Week, May 2010. [9] Eelco Plugge, Tim Hawkins, Peter Membrey, The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing, Apress, October 2010, ISBN 1430230517.
[10] Michael Stonebraker, “The Case for Shared Nothing”, Database Engineering, Vol. 9, No. 1, 1986, http://db.cs. berkeley. edu/papers/hpts85- nothing.pdf [11] Carl W. Olofson, Worldwide RDBMS 2005 vendor shares, Technical Report 201692, IDC, May 2006. [12] Thoughts on the Whitehouse.gov switch to Drupal, http://radar.oreilly.com/2009/10/whitehouse-switch-drupalopensource.html [13] Dan Vesset, Wrldwide data warehousing tools 2005 vendor shares, Technical Report 203229, IDC, August 2006. [14] Zhou Wei, Guillaume Pierre, Chi-Hung Chiy: CloudTPS: Scalable Transactions for Web Applications in the Cloud. Technical report IR-CS-53, Vrije Universiteit, February 2010. [15] WhiteHouse.gov Goes democracy.com/node/15131
Drupal,
http://personal.
[16] Joe Dog Software, Siege, http://www.joedog.org/index/ siege-home