Early Experience with Model-driven Development of MapReduce based Big Data Application Asha Rajbhoj*, Vinay Kulkarni#, Nikhil Bellarykar$ Tata Consultancy Services 54B, Industrial Estate, Hadapsar Pune, 411013 INDIA *
[email protected], #
[email protected],
[email protected] Abstract— With internet becoming increasingly pervasive, data analytics is playing increasingly critical role in the business. Data to be analyzed exists in large quantity and in multiple formats. Many technologies exist to support Big Data analytics. However, they remain somewhat of a challenge for average developer to use. It's been seen that model-driven development (MDD) approach can eliminate accidental complexity to a large extent. We discuss MDD approach for development of MapReduce based Big Data applications, its efficacy and lessons learnt. Keywords- Model-driven engineering, Meta-model, Big Data, MapReduce
I.
INTRODUCTION
Data analytics is playing increasingly critical role in the business. Enterprises are benefitting from use of data analytics in multiple ways from coming up with personalized recommendations and offerings to better, smarter and faster decision making. Development of best-in-class Big Data business solutions demands deep understanding of domain, in-depth knowledge of data analytics algorithms, and total familiarity with implementation technology platforms. This is a daunting task especially when average developer struggles to come to grips with using MapReduce programing model. Model-driven development (MDD) approach has been effective in eliminating accidental complexity, for instance, in developing typical business applications [1]. Encouraged by this experience, we tried using MDD techniques for developing a small but reasonably complex analytics intensive Big Data application. We share our experience, problems faced and lessons learnt from this endeavour. Though experience is shared in a specific case study context, we believe, MDE researchers, practitioners, and tool vendors will find the takeaways from this experience applicable even in a more general context. The rest of the paper is organized as follows. Section II sets background. Section III presents IPTV case study. Section IV presents MDD for MapReduce based applications. In Section V we discuss early results of its usage. We present the related work in Section VI before concluding in section VII.
Figure 1: Reflexive meta meta model
II.
BACKGROUND
A. Modeling Framework Our modeling framework provides a reflexive modeling language [2] shown in Fig.1 that is compatible with OMG MOF [3]. Using this language purpose specific modeling languages are defined. Modeling framework provides facility of creation of diagramming notation and tree based user interface to edit instance models. To specify consistency and completeness checks on instance models, we use OCL [4]. B. Model-driven Development We have developed several large business applications using homegrown MDD toolset [1, 2]. Our MDD approach helps in shifting the focus of software development from code to high-level specifications i.e models. Models are annotated to encode design and architectural decisions. These annotated models are then automatically transformed into the desired implementation pertaining to the various technology platforms of choice. Eliminating the accidental complexity of developing business applications led to improved productivity as well as uniformly high code quality. Only the implementers of code generators had to deal with the accidental complexity. Success in using modeling and model based techniques for eliminating accidental complexity in development of business applications encouraged us to do so for Big Data applications too. C. MapReduce Various parallel computing technologies have emerged for Big Data analytics. MapReduce [5] is a popular framework for developing Big Data applications. Overall
click-stream data analysis for the personalized advertising, pay as per viewership, defining viewing policy, most viewed channel reports, most viewed programs reports and so on.
Figure 2: Hadoop MapReduce data flow
approach for MapReduce programming model is shown in Fig. 2. Generally, MapReduce [6] has two steps: the Map step and the Reduce step. The Map step reads the input and assigns a key to the input record. The Reduce step receives all values which have the same key and processes these groups. A MapReduce job splits the input dataset into independent chunks which are processed by map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks via partitioner. The partitioner class determines which partition a given (key, value) pair will go to. The default partitioner computes a hash value for the key and assigns the partition based on this result. Combiner can be optionally used for optimization. Combiner, when used, receives input emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers. Combiners are "mini-reduce" process which operate only on data generated by one machine [6]. Hadoop [7] is the open-source implementation of the MapReduce. MapReduce inputs and outputs are stored in distributed file systems i.e. HDFS. III.
CASE STUDY
We chose as a case study a Big Data application that was small enough to be completed in a reasonably short time and yet had most of the complexities associated with development of analytics intensive applications. Our case study involved use of various data analytic techniques such as clustering, prediction, integer programming, and reports thus appreciably covering the canvas of essential and accidental complexities in Big Data analytics domain. Though these complexities are discussed in specific context, these are applicable to any Big Data analytics application. A. IPTV Internet Protocol Television (IPTV) is a system wherein digital television service is delivered using the internet architecture and networking methods [8]. With bi-directional connectivity provided in IPTV platform, it is possible to collect channel surfing data i.e. click-stream from the individual subscriber. From such data collected over a period, dominant profiles of TV viewership can be mined. A viewership profile can be mapped to a customer profile identifying products of interest. Thus, analysis of clickstream data helps in inserting the right advertisement of the right product for the right subscriber. We explored use of
B. Essential Complexity For implementing the case study, first step was to formulate the business problem as a mathematical problem so as to identify the best-suited algorithm. Personalized advertisement involved identification of viewership profiles, profile matching and selection of advertisement according to demography. Typically, viewership is influenced by various factors such as age group, financial conditions, geographical location and so on. User profile identification could be solved by mining past click-stream data. We selected k-means clustering algorithm for mining click-stream data [9]. This algorithm needs input data to be represented as a set of numerical vectors of dimensions. Two dimensions for viewing behavior were identified, they are (1) Time spent on each genre of programs e.g. news, movies etc (2) Number of transitions from one genre to other genre. The number of distinct profiles i.e. value for ‘k’ is not known. To determine correct value for ‘k’, we used Silhouette [9] distance measure. Once the profiles are identified, click-stream of every IPTV user is matched with the mined user profiles to determine which profile is watching TV and accordingly, personalized advertisement is inserted. Profile matching is done using Euclidean distance measure. The Euclidean distance of the transformed click-stream from each of the cluster centroids is calculated and the category corresponding to the centroid whose Euclidean distance from the click-stream is least is assigned to the corresponding subscriber. Advertisement insertion and revenue maximization problem were formulated as a binary integer programming problem. Various reports such as most viewed channel reports and program reports were implemented as query over click-stream Big Data. C. Accidental Complexity We used Hadoop MapReduce for implementing analytics tasks discussed earlier. Each was an atomic task catering to a specific purpose. As a result, code size for each task was small. However, implementation involved many technological aspects. Comprehending MapReduce paradigm and coming up with first running program took long time. Incompatible input / output datatypes for mapper, reducer, combiner, partitioner etc was oft-repeated mistake. Output type of the mapper needs to match with the input type of the reducer. If a partitioner or combiner is included, the output of the mapper needs to match the input of the combiner, the output of the combiner needs to match the input of the partitioner, and finally, the output of the partitioner needs to match the input of the reducer. There is also redundancy in specifying types at multiple places. For instance, types need to be specified at the time of extending generic Mapper class, in signature of map function and also in JobConf instance. Mistakes in these lead to runtime type mismatch errors.
Figure 3: MapReduce meta model
Typically, analytics applications use machine learning algorithms implementations. Mahout [10] provides implementation for few machine learning algorithms that can run on Hadoop using MapReduce. For profile identification and matching, we used Mahout k-means clustering algorithm implementation. It took considerable amount of time to understand the interface and integration of the algorithm. Using MapReduce program we converted raw click-stream data into vector form of dimensions of time spent on genre, number of transitions from one genre to other. This required program schedule and program to genre mapping as input files. When running in distributed environment, these files need to be made available across mapper on all nodes. Hadoop provides distributed cache API to make data files accessible over all nodes in the cluster environment. Developers had to be aware of existence of such API and understand its use. Though size of the application was small, it took considerable time for the implementation. As shown in Fig. 5 arriving with first working implementation took ~16 weeks, out of which ~4 weeks were spent in addressing essential complexity and ~12 weeks in addressing implementation accidental complexity and functional coding. In this section we discuss MDD approach for MapReduce based Big Data applications to eliminate accidental complexity.
association. Each analytics job takes a set of input files and produces output data in set of output files. Path for the input and output is specified using job property inputPath and outputPath respectively. Format of these input / output files can be different; for example, data in files could be structured in vector form or it could be in unstructured text form and so on. This format information is specified by StreamFormat. Default value for input / output files is considered text. Each job has one Mapper and can optionally have Reducer. Mapper transforms input records into intermediate records. It maps input key-value pairs to a set of intermediate key-value pairs. Mapper class has properties named keyInType, keyOutType, valueInType and valueOutType to specify data types of input key and output key, input data values and output data values respectively. Combiners can be optionally used to cut down the amount of data transferred from Mapper to Reducer. These are intermediate Reducer and hence modelled with hasCombiner association of which target is Reducer. Reducer class has properties named keyOutType and valueOutType to specify data types of output. Every MapReduce job has default partitioner. Hadoop MapReduce also provides mechanism to have user defined partitioner class. This is modelled as Partitioner. To leverage parallel data processing capabilities, machine learning algorithms have to be implemented in MapReduce paradigm. Meta model shows representative algorithms classes. These inherit the structure from Job class.
A. MapReduce Meta model We came up with MapReduce meta-model shown in Fig. 3 to specify analytics application program structure at a higher level of abstraction. It is created as an instance of the reflexive meta meta-model shown in Fig. 1. Problem to be solved using analytics is modelled as AnalyticsTask. Typically, each analytics task is broken down into a set of small sub-tasks. These subtasks are modeled as Job. Each job may need to refer to input data files; for example, in IPTV example, we had to use program time table. These input data files are modelled as DataFile with path of the file. Typically, multiple sub-tasks of an analytic task need to be executed sequentially. They are designed such that output of a job is used as input by subsequent job. This execution sequence of jobs is modelled with nextJob
B. Model to Code Generators Accidental complexity was reduced by encoding technology related concerns into code generators thus generating mapper, reducer, partitioner, and combiner classes. For every modeled analytics task, a driver class is generated that integrates mapper, reducer, partitioner, and combiner classes. Driver class invokes contained jobs sequentially. MapReduce isolates programming task in two functions namely map and reduce functions. Both these functions are quite procedural and specific to the analytics task. Also data these functions operate on is rarely structured. Hence we did not observe the need of high level language for specifying map and reduce function bodies. Instead, we generated placeholders, marked with appropriate comments, for users to insert code for the desired processing
IV.
MDD FOR MAPREDUCE
logic. Generative approach also ensured consistency in input / output data types for mapper, reducer, combiner and partitioner. It also generated code to invoke HDFS API for every DataFile modelled for the job. For jobs using machine learning algorithms it generated appropriate wrapper code to invoke Mahout [10] library code. Model to text transformers encoded the relevant design patterns e.g. use of SequenceFile as they are splittable, compressible as well as compact and hence efficient. MapReduce programs are usually written in Java using Hadoop APIs, but they can also be coded in languages such as C++, Perl, Python, Ruby, R etc. As MapReduce metamodel is an abstract and programming language independent, it can be automatically transformed into any of the desired languages. V.
EARLY RESULTS, BENEFITS AND LESSONS LEARNT
Using our MDD approach we generated IPTV case study code. Instance of meta model shown in Fig. 3 was created for the case study. It comprised of 6 analytics tasks having corresponding model elements for jobs, mappers and reducers. MapReduce meta model provided help in expressing the mathematical formulation in terms of MapReduce paradigm. Fig. 4 shows modeled clustering analytic task in textual form. This task invokes three jobs. First and second job combines and aggregates data in required form and third job performs clustering. Accidental complexity discussed in the implementation was simplified through model based generative approach. Generated code comprised of ~25 classes and ~2K lines of code. Early results of using MDD approach for MapReduce based Big Data application have shown improvement in developer productivity. As shown in Fig. 5 effort spent in addressing accidental complexity is reduced from 7 weeks to 2 weeks using MDD approach. We believe, though typical size of analytics applications is quite small, as number of analytics tasks getting developed increases, developer productivity gain perceived will be significant. It is also observed that generated code performance is comparable with hand written code. No overhead is introduced due to wrapper library code. We would, however, point out that this work is a preliminary step towards Big Data framework and we intend to strengthen the framework in subsequent research. We could not find a way to reduce essential complexity through use of modeling and model based techniques. Designing efficient data transformation strategy in terms of MapReduce key-value pairs remained purely manual task. We think, there is need to investigate if model based techniques can reduce the essential complexity of developing analytics intensive Big Data applications, and how.
Figure 5: IPTV case study effort data
VI.
Several productivity aids exist for MapReduce programming. Languages such as Pig Latin [11] and HiveQL [12] provide SQL-like syntax for specifying various data transformations such as merging data sets, filtering, and querying. Pig query planner then generates mapper and reducer from Pig Latin code that is then executed on a Hadoop cluster. Hive [13] provides a logical RDBMS environment on top of the MapReduce engine. Using HiveQL, users can write declarative queries that are optimized and translated into MapReduce jobs. We observed that both Pig and HiveQL are suitable only for simple queries over structured data. We found them inadequate for implementing complex algorithms in SQL-like syntax they support. To the best of our knowledge, model-driven development approach for Big Data analytics has not yet been explored and evaluated by research community. VII. CONCLUSION We discussed essential and accidental complexities involved in constructing MapReduce based application with the help of IPTV case study. To overcome accidental complexities, we presented a MapReduce meta-model and proposed MDD approach for MapReduce based applications. We demonstrated effectiveness of MDD approach using case study. Further investigations are needed to see if model based techniques can help in reducing essential complexity. REFERENCES [1]
[2]
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
Figure 4: Computing cluster task model in textual form
RELATED WORK
Vinay Kulkarni, R. Venkatesh, Sreedhar Reddy: Generating Enterprise Applications from Models. OOIS Workshops 2002: 270279. Vinay Kulkarni, Sreedhar Reddy, Asha Rajbhoj: Scaling Up Model Driven Engineering - Experience and Lessons Learnt. MoDELS (2) 2010: 331-345 Model Object Facility, http://www.omg.org/spec/MOF/2.0 Object Constraint Language, http://www.omg.org/spec/OCL/2.2 Dean, J. and S. Ghemawat : MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150 http://developer.yahoo.com/hadoop/ Hadoop: Open source implementation of MapReduce. http://hadoop.apache.org/. "ATIS Std. ATIS-0800007", IPTV High-level Architecture, 2007 http://www.mathworks.in/ Apache Mahout , http://mahout.apache.org/ Pig, http://pig.apache.org/ Hive, http://hive.apache.org/ A. Thusoo, et al. Hive: a Warehousing Solution over a Map-Reduce Framework. In PVLDB 2(2), 2009.