SQL/MR functions - The Stanford University InfoLab
Recommend Documents
COMPUTER SCIENCE DEPARTMENT. REPORT STAN-CS-74-44. PROGRESS REPORT ON PROGRAM-UNDERSTANDING SYSTEMS by. C. Cordell Green ...
facturer's database, while price may come from the dealer's database. Distributed ... and performs translations to convert source information to a common form or ...
Table 3 shows the contents of the Japanese car database. Japanese Japa- Japa- Japa- Japanese. Make nese nese nese. Yen. Doors Seats Km. Accord Honda.
Best Masters Thesis Award in Computer Science, Stanford University (Christofer
Stephenson ... History Search Engine to build a fresher Web Search Engine that
...
Chapter 7. Clustering. Clustering is the process of examining a collection of “
points,” and grouping the points into “clusters” according to some distance
measure.
formal notions of privacy obtained from data perturbation and show that our perturbation ... permission and/or a fee. SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA. ...... [12] M. J. Freedman, K. Nissim, and B. Pinkas. Efficient ...
c 2010 C. Olston and M. Najork. DOI: 10.1561/ ... 1 Introduction. 176 ... Microsoft Research, 1065 La Avenida, Mountain View, CA, 94043, USA [email protected] ..... each access potentially causing a disk seek, which would limit the data.
SQL Queries. Principal form: SELECT desired attributes. FROM tuple variables |
range over relations. WHERE condition about t.v.'s;. Running example relation ...
manage a software engineering project are collectively referred to as a
methodology. There are two well-known software .... 2.5.3 Unit Testing . ..... Page
21 ...
Use SQL together with a general-purpose programming language: embedded ...
Cursors. Inside a PL/SQL program, the result of a SELECT must go somewhere:.
Further, consider the following very simple SQL query: .... commercial DBMSs
e.g., DB2-Client Server and Illustra , although not in all of them. Most of.
Chapter 5. Link Analysis. One of the biggest changes in our lives in the decade
following the turn of the century was the availability of efficient and accurate Web
...
In this chapter, we introduce propositional logic, an algebra whose original
purpose, ... shows how propositional logic can be used in computer circuit design
.
Such a facility is called a recommendation system. We shall begin this chapter
with a survey of the most important examples of these systems. However, to bring
...
2. RE's: Introduction. ◇Regular expressions are an algebraic way to describe
languages. ◇They describe exactly the regular languages. ◇If E is a regular ...
Anish Das Sarma. Basic Information. Home. Internet. Nationality. 5600 Stevens
Creek Blvd., #205 [email protected]. Indian citizen. Cupertino, CA ...
In this paper, we present Google, a prototype of a large-scale search engine which makes .... 1994 -- Navigators, "The b
Google is designed to crawl and index the Web efficiently ...... We hope Google will be a resource for searchers and res
traditional search techniques to data of this magnitude, there are new technical challenges involved with using the addi
It is still free. The new page allows you to select chapters without downloading the entire book. Anand Rajaraman and Je
Crawler-Friendly Web Servers. Onn Brandman, Junghoo Cho, ..... download the minimum number of archives to suit their needs. 3.4 Does exporting meta-data ...
A prototype for the ubiquitous computing environment has ... expanding computing power, coupled with broadband and mobile ... exchanged among various engineering services, for which a common data model is needed. ...... University, a Stanford Graduat
This happens because crawlers try to avoid downloading infinite numbers of dynamically created pages, or because they try to avoid bombarding a single.
It is still free. The new page allows you to select chapters without downloading the entire book. Anand Rajaraman and Je
SQL/MR functions - The Stanford University InfoLab
Jan 16, 2009 ... SQL/MR functions: ▫ Are Java functions meeting a particular API. ▫ Are compiled
outside the database, installed via a command line tool, and ...
SQL/MR Peter Pawlowski Member of Technical Staff January 16, 2009
ASTER BACKGROUND
2
Our Founders 3 PhD students from Stanford C.S.
Cool ideas…
… but no funding, no product, no clients!
OK, they had $ 10,000…
3
Our Product: nCluster A massively scalable database designed for analytics. Runs on a cluster of commodity nodes. Scales from GBs to 100s of TBs and beyond. Standard SQL interface (via a command line tool, JDBC, ODBC, etc). Support MR-like functionality via user-defined SQL/ MR functions.
4
Our Approach: Commodity Nodes
Server nodes
Query Results
5
5
Queen
Processing + Storage
SQL/MR
6
What are SQL/MR functions? SQL/MR functions: Are Java functions meeting a particular API. Are compiled outside the database, installed via a command line tool, and then invoked via SQL. Take a database table of one schema as input and output rows back into the database. Are polymorphic. During initialization, a function is told the schema of its input (for example, (key, value)) and needs to return its output schema. Accept zero or more argument clauses (parameters), which can modify their behavior. Are designed to run on a massively parallel system by allowing the user to specify which slice of the data a particular instance of the function sees. 7
First Example: Word Count Problem: Count the word frequency distribution across a set of documents. Input: A database table containing the documents in question. Map Phase: For each word in each document, outputs a row of the form (word, 1). Shuffle Phase: Brings all rows with the same value for word together. Reduce Phase: Count the number of rows for each word about output (word, ). 8
Input: The Documents Table BEGIN; CREATE FACT TABLE documents (body varchar, PARTITION KEY(body)); INSERT INTO documents VALUES (‘this is a single test document. it is simple to count the words in this single document by hand. do we need a cluster?’); END; SELECT body FROM documents; 9
Map Function: tokenize public class tokenize implements RowFunction {
... public void operateOnSomeRows(RowIterator inputIterator, RowEmitter outputEmitter) { while ( inputIterator.advanceToNextRow() ) { String[] parts =
splitPattern_.split( inputIterator.getStringAt(0) ); for (String part : parts) { outputEmitter.addString(part); outputEmitter.addInt(1); outputEmitter.emitRow(); } } } } 10
Reduce Function: count_tokens public class count_tokens implements PartitionFunction {
... public void operateOnPartition( PartitionDefinition partitionDefinition, RowIterator inputIterator, RowEmitter outputEmitter) { int count = 0; String word = inputIterator.getStringAt(0); while ( inputIterator.advanceToNextRow() ) count++; outputEmitter.addString(word); outputEmitter.addInt(count); outputEmitter.emitRow(); } } 11
Invoking the Functions BEGIN; \install tokenize.jar \install count_tokens.jar SELECT word, count FROM count_tokens ( ON ( SELECT word, count FROM tokenize(ON documents)) PARTITION BY word ) ORDER BY word DESC; ABORT; 12 12
Even Better: Forget the Reduce BEGIN; \install tokenize.jar SELECT word, sum(count) FROM tokenize(ON documents) GROUP BY word ORDER BY word; ABORT;
13
Types of SQL/MR Functions RowFunction Corresponds to a map function. Must implement the operateOnSomeRows method. Must be invoked without a PARTITION BY. “Sees” all the appropriate rows on a particular worker.
PartitionFunction Corresponds to a reduce function. Must implement the operateOnPartition method. Must be invoked with a PARTITION BY, which specifies how rows are reshuffled. “Sees” all the appropriate rows in a partition. 14
Requirements of a SQL/MR Function Must implement either RowFunction or PartitionFunction. Must have a single-argument constructor which takes a single RuntimeContract as a parameter. Class name must be all lowercase. Name of jar file must be the same as the SQL/MR function name. Note: can also upload a .zip file, containing multiple jars. The jar with the SQL/MR function must have same name as the function, but other jars can be included. Useful for including libraries. 15
The Constructor public tokenize(RuntimeContract contract) { ArrayList output = new ArrayList(); outputColumns.add( new ColumnDefinition("word", SqlType.varchar())); outputColumns.add( new ColumnDefinition("count", SqlType.bigint())); contract.setOutputInfo(new OutputInfo(outputColumns)); contract.complete(); }
16
The Constructor The constructor can throw exceptions. If the exception is a subclass of ClientVisibleException, the user sees a descriptive message on the command line tool. Otherwise, they see a generic error message. A full stack trace of the exception can be viewed via the AMC.
17
Full Syntax SELECT ... FROM FunctionName(
ON {tablename | (subquery)} [PARTITION BY ...] [ORDER BY ...] ARGCLAUSE1 (..., ...) MYCLAUSE (...) ... );
18
Tip 1: CTAS BEGIN; \install tokeninze.jar CREATE FACT TABLE counts (PARTITION KEY(word)) AS SELECT word, sum(count) FROM tokenize(ON documents) GROUP BY word; ORDER BY word; END;
19
Tip 2: Use Transactions BEGIN; \install tokeninze.jar CREATE FACT TABLE counts (PARTITION KEY(word)) AS SELECT word, sum(count) FROM tokenize(ON documents) GROUP BY word; ORDER BY word; END;
20
Tip 3: PARTITION BY c BEGIN; \install exact_percentile.jar SELECT * FROM exact_percentile( ON source_data PARTITION BY 1 PERCENTILE(25, 50, 75) ); ABORT;
21
Tip 4: Using act To connect to the cluster, use the command line tool act. bash$ act -h -d -U
Useful commands \d \d
\dF \? \timing
22
List all tables. Show table details. List installed SQL/MR files. More detailed help. Enable query timing.
Beyond Java: Stream BEGIN; \install tokeninze.py SELECT word, sum(count) FROM STREAM(
ON documents
SCRIPT(‘tokenize.py’) OUTPUTS(‘word varchar’, ‘count int’) ) GROUP BY word ORDER BY word; ABORT; 23
Netflix Data Schema movie_titles. Stores movie id, year, and titles. training_set. Main training dataset. Stores (customerid, movieid, viewdate, and rating). probe_set. A random sample of (customerid, movieid) pairs from the training set. Designed to be used for testing your classifier. qualifying_set. A set of (customerid, movieid, viewdate) rows not in the training set. To enter the contest, submit your classifier’s ratings for these movies. 24
Netflix Data Notes Both the probe and qualifying sets are ordered. The file you submit to Netflix needs to be in that same order. Therefore, the probe_set and qualifying_set tables have an extra entryid column. See www.netflixprize.com for more details about the dataset and on entering the contest.