Standardization of NoSQL Database Languages Malgorzata Bach and Aleksandra Werner Silesian University of Technology, Gliwice, Poland {malgorzata.bach,aleksandra.werner}@polsl.pl http://www.polsl.pl
Abstract. NoSQL database systems have been becoming more and more popular and accepted by a database users thus their rapid development is nowadays observed. Because of this fact, modern database engines and their categories in the form of the Venn diagram are mentioned in the paper. Besides, the possibilities of using declarative languages that are modeled on SQL - the language for relational databases – in NoSQL, are presented. For this purpose selected NoSQL technologies are given in more details and their query languages are described. Moreover, the NoSQL language commands’ equivalents of SQL standard are provided in this document. Keywords: NoSQL, key-value databases, column family databases, documentoriented databases, graph databases, declarative language
1
Introduction
The rapid growth of the NoSQL databases can be observed nowadays, although their conception has appeared quite recently. The term ”NoSQL” was first used in 1998 by Carlo Strozzi to name his lightweight relational open-source database without SQL interface. The term appeared again in 2009 during the conference referring to open, distributed, no relational databases, took place in San Francisco. Cassandra, Voldemort, Dynamite, HBase and other dynamically developing databases were presented at that conference and the acronym ”NoSQL” was voted to describe the class of the databases that differ from the classical relational model. Some scholars, among them Carlo Strozzi [15] suggest that, as the current NoSQL movement departs from the relational model, it should be called more appropriately ”NoREL” (i.e. Not RELational systems) or ”NoJOIN” (i.e. systems without JOINing) [17]. It should be emphasized that the departure from relational model is not equivalent to belonging to the NoSQL one. For example the Hierarchical and Network database models, prevailing in the 60’s, used neither relational model nor SQL language, nevertheless they are not to be included to the NoSQL trend now. This is because the development of the new movement has been related to the expansion of the Internet, Web 2.0 network and social portals which have
2
M. Bach, A. Werner
generated problems with computer systems’ performance and scalability. That’s why it is used to satisfy the above mentioned requirements [18]. NoSQL translated as a ”Not Only SQL” expression may imply the departure from SQL language, but the authors of the new conception meant the departure from relational model 1 rather than the language. However, the fact is that at the beginning of the appearance of this technology, the majority of NoSQL haven’t offered the possibility of a usage of the declarative languages, such as SQL. Queries to the NoSQL databases have been usually created on the very low level of abstraction so data operating has required specialized knowledge of programming. Various and often unintuitive languages that should be learnt for the NoSQL systems usage, can be a serious problem for a really great number of their potential users. For the tens of years with the relational databases domination, users used to operate the declarative language, but not the imperative one. As it is easier to specify what, in contrast to how, data should be retrieved. Therefore, more and more people talk about the need of language standardization and about the necessity of creating more user-friendly languages that access the non-relational data. This paper presents and evaluates the work that has been achieved in the NoSQL area so far. For selected examples of the NoSQL solutions, the possibilities of using declarative structures in data processing are presented. The article can be regarded as a guide in the stated range for all who have to deal with the need of choice whether, and if so, what NoSQL solution to choose.
2
The Database Market
Nowadays, it is difficult to find an area where databases do not have the application. However, the variety of applications makes it difficult to give a single solution (i.e. one/single data model) that would be appropriate for all situations. This diversity is reflected on the diagram of Venn, presented in the Fig. 1. It can be seen that the database market is varied. There is a place for relational and non-relational databases. Database systems can also be divided into transactional (operational) and analytical. Solutions called NewSQL are also added in the Fig. 1. It is a class of modern relational database management systems that try to provide the scalability performance comparable to NoSQL systems. It is applied for online transaction processing (read-write) workloads while still maintaining the ACID 2 guarantees of a traditional relational database systems. In the last years a variety of NoSQL databases has been developed mainly by practitioners and web companies to fit their specific requirements regarding: scalability, performance and feature-set. Because of the diversity of these 1
2
SQL is commonly identified with Relational Database Management System – RDBMS. ACID (Atomicity, Consistency, Isolation, Durability) – a set of properties that guarantee the database transactions are processed reliably.
Standardization of NoSQL Database Languages
3
Fig. 1. The Database Market (based on: [14])
approaches, the classification of NoSQL systems is very difficult. Nevertheless, the very wide range – often niche – applications, NoSQL databases implementations can generally be classified into one of four main categories (based on data model): – – – –
Key-Value Stores, Column Family Databases (Wide Column Store, BigTable Clones), Document-Oriented Databases, Graph Databases.
In the next chapters, the possibility of using declarative languages in data processing is discussed with respect to several selected NoSQL systems.
3
Cassandra
Apache Cassandra system was created especially for the Facebook network. The project started in 2007 in order to improve the process of searching users’ messages (so called: InboxSearch problem). This solution is a very interesting hybrid of the model taken from the Google BigTable and the model of replication and partitioning from the Amazon Dynamo. Mentioned duality is marked in the Fig. 1, by placing Cassandra across the key-value and column-oriented databases [1]. 3.1
Data model
The data model of Cassandra looks like the model in the Google BigTable. It is said to be the prototype of all databases implementing column families. However there are some differences between both models – also in terms used. There is a special nomenclature in the Apache Cassandra system:
4
M. Bach, A. Werner
– Column – is the smallest portion of information. The name, value and date of last modification are specified for each column. Only the newest version of data is stored in the Cassandra unlike in the Google BigTable, where a lot of data versions are stored. – Column Family – is a set of rows and each row can have any number of columns. Thus the following rows can have different collection of columns. This structure can be compared with a table in a relational database, but it must be remembered that a significant difference is the lack of a rigid structure. – Keyspace – is used to group column families together so a kind of namespace is created. It is said to be similar to a schema in a relational database. – Super Column – is a structure that allows to group many columns. – Super Column Family – the set of rows containing super columns. 3.2
CQL Cassandra Query Language
At the beginning, the Cassandra using was considered to be rather difficult, so it wasn’t very popular and doesn’t gain developers’ approval. Especially those of small applications. One of the attempts to make the system easier to use was the introduction (in 2011) of the conception of CQL (Cassandra Query Language) similar to the SQL. Currently, CQL-3 version is available. For example, a keyspace creation is made by a command: CREATE KEYSPACE test WITH strategy_class = ’SimpleStrategy’ AND strategy_options:replication_factor = 1; where parameters strategy options:replication factor and strategy class determine the number of replicates and the strategy of their placement, respectively. The sample statement that creates column family can be written as follows: CREATE COLUMNFAMILY users (KEY text PRIMARY KEY, full_name text, email text, state text, gender text, profile text, birth_year int); Instead of CREATE COLUMNFAMILY, the CREATE TABLE statement can be used. Data can be inserted by a suitable INSERT command – for example: INSERT INTO users (KEY, full_name, email, state, gender, birth_year) VALUES (robnow’, ’Robert Nowakowski ’, ’
[email protected]’, ’TX’,’M’, ’1980’); Data retrieval from a Cassandra table is realized analogously to the SQL language by a SELECT statement. For example: SELECT * FROM users WHERE birth_year=’1975’;
Standardization of NoSQL Database Languages
5
The syntax of SELECT statement is simplified in comparison with the original SQL language in RDBMS. There is no possibility of joining column families. The lack of JOIN servicing is one of the directives of a NoSQL idea. Filter conditions (WHERE phrase) can refer only to a key or the indexed columns. Besides, the only implemented aggregate function is COUNT. Additionally, it is not possible to group rows and sorting (ORDER BY) can take place only with a reference to the column that is a part of the composite key (Composite Primary Keys).
4
Hypertable
Hypertable is the open-source project, inspired by the Google BigTable system. It was started in 2007 by the engineers sponsored by Baidu, Rediff.com and Zvents Inc. Hypertable runs on the basis of a distributed file system, such as for example the Apache Hadoop DFS, GlusterFS or Kosmos File System (KFS). It is written in C++ language. 4.1
Data Model
In a database that implements Hypertable family columns, as it was previously described in Cassandra system, the data is represented in the form of tables with the various structure of the rows. The key-value pairs are associated with the individual table cells. A key contains the ID of the row and the column. It means that there is the exact address information for each cell. Depending on the configuration used it is possible to store a great number of versions of each cell that differ in the timestamps. 4.2
HQL Hypertable Query Language
HQL is a declarative language similar to SQL that simplifies the work with Hypertable system. Logical table grouping is achieved by namespaces that can be compared with the hierarchy of folders in the file system. For example the following commands: CREATE NAMESPACE "/test"; USE "/test"; CREATE NAMESPACE "subtest"; cause the space ’test’ and subspace ’subtest’ are created. If the name begins with the ’/’ sign, it is treated as an absolute path, otherwise - it is treated as a subspace with respect to the current one. Table is created by executing the command CREATE TABLE: CREATE TABLE User (full_name, email, state, gender, profile, birth_year ACCESS GROUP default (full_name, email, state), ACCESS GROUP profile (profile));
6
M. Bach, A. Werner
Hypertable does not support data types but values are treated as an opaque of bytes sequences, thus declaration of the particular columns cannot be found in the example. The possibility of access groups declaration is one of the most interesting features of the Hypertable system, because it has an influence on data storage. All data from the columns that belong to one group, are located together on a disk, which can reduce the number of Input/Output operations. In order to insert new data, the command INSERT is used: INSERT INTO User VALUES ("row1", "full_name", "Robert Nowakowski"), ("2009-08-02 08:30:00", "row1", "email",
[email protected]); Data is in a form of tuples’ list that are separated by a comma. Each tuple represents the cell and can have one of two forms: (row, column, value) or (timestamp, row, column, value) In the first case, timestamp of a cell is automatically added, in the second one – it is explicitly given by the user. Queries are implemented by a SELECT statements – e.g.: SELECT full_name FROM User WHERE name = "Robert Nowakowski"; SELECT * FROM User WHERE ’2008-07-28 00:00:02’ < TIMESTAMP < ’2008-07-28 00:00:07’; It is possible to define the conditions for rows, cells or timestamps, but – similarly to Cassandra system – there are a lot of limitations in comparison with the SELECT statement in RDBMS. For example there is no way of grouping, sorting or using the aggregate functions [4].
5
Neo4J
Neo4j is an open-source graph database that stores data in a graph, the most generic of data structures, capable of clear representing any kind of data in a highly accessible way. It is implemented in Java and is one of the older NoSQL systems. This system has been used in the production environments for 10 years. The community edition of the database is licensed under the free GNU General Public License (GPL) v3 [10]. 5.1
Data Model
Neo4j is a graph database, that is, it stores data as nodes and relationships. Both nodes and relationships can hold properties in a key/value form. Property values can be either a primitive or an array of one primitive type. Nodes are often used to represent entities, but depending on the domain the relationships may be used for that purpose as well. Both – the nodes and relationships, have internal unique identifiers that can be used for the data search. The semantics can be expressed by adding directed relationships between nodes.
Standardization of NoSQL Database Languages
5.2
7
Cypher query languag
Cypher is a declarative graph query language that allows to query and update of the graphs. Being a declarative language, Cypher focuses on the clarity of expressing what to retrieve from a graph, not how to do it, in contrast to imperative languages like Java, and scripting languages like Gremlin and the JRuby, which anyway can also be used in the Neo4j. Compared to the previously described languages, the syntax of Cypher commands is the least similar to the classic SQL [5, 3]. For example, the equivalent SQL query: SELECT * FROM User WHERE full_name = ’Mike’ is the following command in Cypher: START User=node:User(full_name = ’Mike’) RETURN User START clause specifies the starting point on the graph, from which the query is executed. Thus, the role of this phrase is something between FROM and WHERE clause of the SQL SELECT statement. Cypher commands can embrace several parts, namely: – START: Starting points in the graph, obtained via index lookups or by element IDs. – MATCH: The graph pattern to match, bound to the starting points in START (it’s equivalent to the SQL JOIN clause). – WHERE: Filtering criteria. – RETURN: What data set should be return. It is equivalent to the SQL SELECT clause. – ORDER BY: Sorts the output. – CREATE: Creates nodes and relationships. – DELETE: Removes nodes, relationships and properties. – SET: Set values of the properties. – FOREACH: Performs updating actions once per each element in a list. – WITH: Divides a query into multiple, distinct parts (the WITH clause is used to pipe the result from one query to the next one and to separate reading from updating of the graph). The Cypher command mentioned above not only allows for data searching, but also their insertion, modification or deletion. Therefore, it is not only an equivalent of the SQL SELECT statement, but the UPDATE, INSERT and DELETE statements as well.
8
M. Bach, A. Werner
6
SPARQL
Neo4j supports Semantic Web technology which means that we can use RDF (Resource Description Framework) – directed, labeled graph data format. Each RDF data repository has implemented its own query language, making it difficult to move data between different documents. It caused a serious need to develop a common query language for semantic web. So, W3C organization has involved in the issue and in 2008 recommended their SPARQL (SPARQL Protocol And RDF Query Language) product as a language and protocol standard for RDF files. The conception and syntax of SPARQL language is similar to the SQL and allows to query data set restricted by a criteria specified by the RDF predicates. RDF is a triple (entity1, property, entity2) that captures both entity attributes and relationship between entities as statement: entity1 has property related to entity2 (entity2 can be defined as a value of the property). The SPARQL query comprises: 1. Prefix declaration where URI addresses of data, ontologies or other documents are defined. 2. Part that describes the form of a query (SELECT, CONSTRUCT, ASK, DESCRIBE). 3. Part that consists of a query pattern in the form of RDF triples. 4. Query modifiers (FILTER, ORDER BY, OPTIONAL etc.) rearranging query results. There are four various types of SPARQL query that differ mainly in the form of returning result: – SELECT form is used to create a list of URIs in the form of a table, which satisfy the pattern-matching requirements specified in the query [7]. – CONSTRUCT query returns an RDF graph, which is created by taking the results of the equivalent SELECT query and filling in the values of variables that occur in the CONSTRUCT template [9]. – ASK form is used to test whether or not a query pattern has a solution (is there any result for a given query pattern); it returns a boolean True/False value depending on whether or not the query pattern has any matches in the dataset [8]. – DESCRIBE query form returns all triples which contain all URIs which satisfy the pattern-matching requirements specified in the query – so it returns one RDF graph that describes a resources found. The implementation of this return form is up to each query engine.
7
Other solutions
It is not possible to describe exactly all declarative languages working in the NoSQL products in one paper. Therefore in the consecutive paragraphs of this point, only selected – interesting in authors’ opinion – projects that have been developed so far in this area are mentioned and/or labeled.
Standardization of NoSQL Database Languages
9
The OrientDB – graph-document database, written in Java language, supports SQL language, but in comparison with other NoSQL implementations, offers extended syntax with graph operators. There are, e.g. ORDER BY and GROUP BY phrases (although the current release supports only one field to group by) and the results can be extracted by the aggregate functions’ usage. Besides, OrientDB allows a subqueries [6]. ArangoDB, earlier known as AvocadoDB, is a multi-purpose open-source database with the flexible data models for documents, graphs and key-values. This database is equipped with the declarative SQL-like query language called AQL (ArangoDB Query Language) [2]. It is worth mentioning here about the project known under the draft name UnQL (Unstructured Query Language) that was founded by Damien Katz and Richard Hipp. According to D. Katz ”UnQL stems from our belief that a common query language is necessary to drive NoSQL adoption in the same way SQL drove adoption in the relational database market.” [11] So, the idea was to create a language that would allow the handling of documentary databases and semi-structural ones as well as all types of data stored in the JSON (JavaScript Object Notation) format. The assumption about the syntax of the language was the similarity to the SQL. It was due to personal experiences with SQL of both project initiators, who believed that the relational query language was enough to extend with new concepts typical for a not relational databases. Unfortunately, after very intensive work connected with UnQL, there was the significant slowdown in 2011 and the latest information about the project was in the first half of 2012. A lot of scientific centers run researches that concern the intermediate systems between relational and non-relational databases. The prototype of such a system that translates queries formulated in SQL form used in MongoDB and Cassandra systems, is described for example in [13]. Also, the Quest Software company has developed the Toad for Cloud Databases software to migrate data between SQL and NoSQL and to perform queries to multiple databases. A question formulated in SQL is in the next step converted to the corresponding APIs that is necessary to return data from a database with the specific NoSQL solution. Toad for Cloud Databases cooperates with the Apache HBase, Amazon SimpleDB, Azure Table, MongoDB, Apache Cassandra, Hadoop and all databases that use Open Database Connectivity (ODBC) [16, 12].
8
Summary
The described researches focused on the area of NoSQL databases – particularly on the analysis of declarative languages availability in the NoSQL solutions. Authors attempted to assess the capability of developing standards in this area. Such solutions seem to be necessary because of several years of NoSQL databases presence in the modern systems. More and more companies use this model of a database rather than typical relational one or join it with the existing RDBMS
10
M. Bach, A. Werner
(as both types cooperate together). Hybrid solutions and single non-relational, are increasingly popular, but the necessity of using different API for accessing various databases makes programming difficult. Besides, the problems with solutions portability appeared. Accordingly, the interface standardization of NoSQL database access is inevitable. Without the standardization, the study of a new NoSQL system will always be associated with learning a new programming language that can effectively discourage potential users. As mentioned in previous chapters, NoSQL solutions market is quite diverse. There are used a lot of data models, and even within the same model differences are significant, as it can be seen for example with respect to Cassandra and HiperTable systems. Thus, for sure, it will be very difficult to develop a single standard for all categories. In the authors’ opinion, the first step of this process should be taken for each group of databases separately – i.e. key-value databases, databases implementing column family, graph and document ones.
References 1. Apache Cassandra 1.1 Documentation, 10/11/2013, http://www.datastax.com/doc-source/pdf/cassandra11.pdf 2. ArangoDB Documentation, 04/11/2013, http://www.arangodb.org/documentation 3. Cypher queries, 10/10/2013, http://docs.neo4j.org/chunked/milestone/rest-api-cypher.html 4. HQL Reference, 10/10/2013, http://hypertable.com/documentation/reference_manual/hql/ 5. Intro to NOSQL, and Cypher vs SQL: a declarative graph query language, 14/09/2013, http://www.meetup.com/Friends-of-Neo4j-Stockholm/events/87662782 6. OrientDB, 15/10/2013, https://github.com/orientechnologies/orientdb/wiki/SQL-Query 7. SELECT query form, 10/10/2013, https://code.google.com/p/tdwg-rdf/wiki/Beginners6SPARQL#6.4.3. _SELECT_query_form 8. SPARQL 1.1 Query Language, 10/10/2013, http://www.w3.org/TR/2013/REC-sparql11-query-20130321 9. SPARQL by Example, 10/10/2013, http://www.cambridgesemantics.com/pl/semantic-university/ sparql-by-example 10. The World’s Leading Graph Database, 10/09/2013, http://www.neo4j.org/ 11. Welcome to the UnQL Specification home, 10/10/2013, http://www.unqlspec.org/display/UnQL/ 12. Welcome to Toad for Cloud Databases Community, 10/10/2013, http://toadforcloud.com/index.jspa 13. Cur’el, O., Hecht, R., Le Duc, C., Lamolle, M.: Data Integration over NoSQL Stores Using Access Path Based Mappings, 10/11/2013, http://hal.inria.fr/docs/00/73/83/56/PDF/finalDEXA.pdf
Standardization of NoSQL Database Languages
11
14. Kaskade, J.: Making Sense of Big Data, 10/10/2013, http://www.slideshare.net/infochimps/making-sense-of-big-data 15. Lith, A., Mattson, J.: Investigating storage solutions for large data: A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data, 10/08/2013, http://publications.lib.chalmers.se/records/fulltext/123839.pdf 16. Mewald, M.: Quest zaprezentowalo narz¸edzie do zarz¸adzanie bazami NoSQL, 10/10/2013, http://webhosting.pl/Quest.zaprezentowalo.narzedzie.do.zarzadzanie. bazami.NoSQL 17. Strauch, Ch.: NoSQL Databases, 12/11/2013, http://oak.cs.ucla.edu/cs144/handouts/nosqldbs.pdf 18. Tiwari, S.: Professional NoSQL. John Viley & Sons (2011)