Proceedings of the 9th INDIACom; INDIACom-2015; IEEE Conference ID: 35071 2015 2 International Conference on “Computing for Sustainable Global Development”, 11 th - 13th March, 2015 Bharati Vidyapeeth's Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) nd
Data Model for Big Data in Cloud Environment Imran Khan
S. K. Naqvi
BVICAM, New Delhi, INDIA Email Id:
[email protected]
Additional Director, FTK-CIT, JMI, New Delhi
Mansaf Alam
S. N. A Rizvi Dept. of Mathematics, JMI, New Delhi
Assistant Professor, Comp. Sc. Dept., JMI, New Delhi Abstract - As the enormous amount of data is generated with a high speed a new era of data processing age has started ‘The Big Data Processing’. Big Data deals with the data that have high Volume, high Velocity, high Variety and high Veracity also known as 4V’s of Big Data. Managing Big data with the traditions systems are not possible due to the 4V’s property of Big Data, so a system is required that could manage Big Data economically and efficiently. Cloud Computing is emerging as a major breakthrough in information technology that provide economical solution for the clients on pay-as-you-use basis. So using services of the cloud for managing Big Data could provide an effective and economic solution. In this paper we have proposed a framework that could manage Big Data in cloud environment and providing a schema for Big Data so that an end user could easily query on the system without having some extra programming skills. I. INTRODUCTION Big Data is a buzzword used for representing huge amount of data generated from many sources such as meteorological data, social media, mobile devices, web log data etc. The rate at which data is increasing is very high that brings a new era in data processing „The Big Data Processing‟. As the datasets in Big Data are extremely large and is having a variety of data like structured, semi-structured or unstructured, so it is difficult to manage Big Data effectively, using traditional databases like relational databases. Different techniques have developed to tackle the high volume, high velocity and high variety data. As the Internet of things (IoT) applications has gained popularity and the cost of Information and Communications technology (ICT) has reduced a huge data oceans has been generated from a number of devices and many valuable information can be mined from it. Let‟s take some examples suppose a manufacturing company can leverage its Big Data for making more profit and to compete with other companies, Metrological data could be used for providing a better weather information, In healthcare data analyses helps in analyzing the effect of pharmaceuticals that have been prescribed widely[1]. The results obtained from the analyses of data are being used to
overcome previous shortcomings and to make better decisions. A number of developed and developing nations have invested in Big Data research in order to leverage Big Data. Big Data term not only focuses on the volume but it has other factors also like Velocity, Variety, Veracity and Value also [2]. Earlier most of the DBMS packages manages structured data only but data have some other unstructured component also and during analyses that unstructured data component plays a very important role in decision making for example web logs makes a very large component of website data and is generally unstructured but in order to understand the interest of the customer it is an important resource. Today the majority of data generated is unstructured and managing that data is still a challenge. Velocity is also an important aspect as we see the speed of data generated is very high, we can think of data generated on social networking sites like facebook, twitter where several messages and posts get viral in fraction of seconds. In this paper we have proposed a system that could manage Big Data in Cloud environment. The system will control the high velocity data generated and the also extract the structured information from the unstructured component. The paper will proceed as in the second section we have discussed about the related work, in the third section the system has been proposed and in the forth section conclusion and future work has been discussed. II. RELATED WORK A number of systems has been developed and many are in progress to manage Big Data in Cloud environment such as Google File System (GFS)[3] is distributed file system for managing large datasets and stores data in the form of chunks with the fault tolerant support. Hadoop is also very popular solution for managing Big Data, it is having a distributed file system known as Hadoop Distributed File System (HDFS) [4] that stores data in Namenode and Datanodes. Different organizations have different needs for data management some may need to store the structured data only and others may need to manage structured as well as unstructured data. Bigtable[5] introduced by Google is a data management
Copy Right © INDIACom-2015; ISSN 0973-7529; ISBN 978-93-80544-14-4
1.251
Proceedings of the 9th INDIACom; INDIACom-2015; IEEE Conference ID: 35071 2015 2 International Conference on “Computing for Sustainable Global Development”, 11 th – 13th March, 2015 nd
solutions for structured data, Bigtable is a distributed data that could store a large volume of data on thousands of commodity servers and it provide a dynamic control over data layout. Similar to Bigtable, PNUTS is designed to support Yahoo‟s web application data and Dynamo for supporting Amazon‟s applications. PNUTS [6] is a distributed database that organizes data in hashed or ordered tables and it provide automated load balancing. Dynamo [7] is distributed key/value based data store that provide high availability and scalability of data to a number of applications of Amazon. Depending on the requirement many big organizations has created their own data management solutions. As a major breakthrough in information technology cloud computing services have been used by a number of organizations for providing a better and efficient solution for Big Data. A number of architectures and frameworks for cloud computing have been proposed and popularly among them are two tier architectures in which there are only two agents one is cloud service provider and other is client. [8] proposed a service composition framework in which the service management and request receiving is accomplished by the composition agent. [9] proposed a three layered service framework for cloud in which services, application logic and user interface are divided into three layers. [20] proposed a framework that consists of 5-layers for cloud database in various layer many aspect like manageability, transparency, security, interoperability etc has been discussed. Some work has been done specially for unstructured databases as in [21] cloud algebra is proposed that dealt with the unstructured component of Big Data in CDBMS. Apart from the storage work is been continuously going on the variety front of Big Data. As Big Data contains mainly unstructured component i.e. around 80% [12] and if that component left alone no meaning can be drawn from it. So it is vital to manage unstructured data [13, 14]. In order to manage unstructured data three categories has been identified namely creating database scheme, developing new data model and structuring query [15]. Mans uri & Sarawagi [16] proposed a technique that establishes a connection among unstructured data in a relational database. In the technique first they extracted named entities and then extracted entities are matched
HSCN
with the database entities already exist. In [17] Chu et al. proposed an architecture for that extract structured information from the textual document. Doan et al. [18] introduced system (UDMS) to manage unstructured data that extracts structured information from unstructured data. Liu et al. [19] proposed advanced unstructured data management system (AUDR), that manages multimedia data like images and audio files. Big Data and cloud computing are the most emerging areas and the work is continuously going on in both the areas and many researchers are working on combining the two areas to leverage the benefits of both the fields and providing a most efficient and economical solution for the organizations that are handling large amount of structured as well as unstructured data. III. DATA MODEL FOR BIG DATA IN CLOUD In this work we are proposing a framework that will provide a schema for Big Data and store Big data in cloud. It is a three layered architecture in which each layer is independent on other and interact with the other layer by an interface. The framework is divided into three levels i.) External level or User level ii.) Schema Level iii.) Cloud Storage level A. External level Velocity is one of the measure concern of Big Data and in order to control it high speed computing nodes(HSCN) will be used to hold and the data that is arriving from the user or applications. These nodes will have high cache and buffer that could buffer the data and maintain the performance of the system. When the data is arriving at a very high velocity then the HPCN will cache some transactional data in order to maintain the availability of the data. And also system buffer will hold the data so that the data at the inner layer could be easily stored on the commodity hardware in the cloud. From this level user can query for the data using HiveQL which is a SQL type language for querying data from the database. For analyzing the data we will use Complex Event Processing (CEP) since we need to analyze the real time data. When the data enters into the system it will be applied on the queries stored using Event Processing Language (EPL) and on applying the data the results will be analyzed and alarmed.
HSCN
HSCN
Analyzed Results using CEP
Fig. 1. External View for Analyzing Data
B. Schema level
Copy Right © INDIACom-2015; ISSN 0973-7529; ISBN 978-93-80544-14-4
1.252
Data Model for Big Data in Cloud Environment
This level will create a unified schema for Big Data. In order to create the schema it will first identify the type of data arriving through Information Extraction system. If the data is application data or structured data then its schema will be used. If the data is unstructured then the useful information from it will be extracted first and is stored in a table to form unstructured data schema. For metadata development from the unstructured data entity extraction which includes information
regarding the names, publisher etc and fact extraction which includes the information about the type of content, issues etc are very much important [11]. The useful information that could be find out from unstructured data will be based on the 15 elements from the Dublin Core Metadata Elements (DCMI) [10]. They are
1. Title
2.Explanation
3.Date
4.Identifier
5.Creator
6.Publisher
7.Type
8.Source
9.Subject
10.Contributor
11.Format
12.Language
13.Relation
14.Location
15.Rights
Once the required information is extracted it will be categorized according to the type of data and the table is created that will be mapped with the structured data schema. After mapping both the schemas a unified schema is prepared that will hold data about all the data stored in the database in Big Data Dictionary.
[1]
http://iveybusinessjournal.com/topics/strategy/why-big-datais-the-new-competitive-advantage#.U6Ajm5RdX78
[2]
http://www.linkedin.com/today/post/article/2014030607340764875646-big-data-the-5-vs-everyone-must-know S. Ghemawat, H. Gobioff, and S. Leung, “The google file system,” in ACM SIGOPS Operating Systems Review, vol. 37, no. 5. ACM, 2003, pp. 29–43. Dhruba Borthakur, “The Hadoop Distributed File System: Architecture and Design” Hadoop Project Website, vol. 11, 2007. F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, “Bigtable: A distributed structured data storage system,” in 7th OSDI, 2006, pp. 305–314. B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, “Pnuts: Yahoo!‟s hosted data serving platform,” Proceedings of the VLDB Endowment, vol. 1, no. 2, pp. 1277–1288, 2008. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: amazon‟s highly available keyvalue store,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 6. ACM, 2007, pp. 205–220. Pham, T. V., Jamjoom, H., Jordan, K., & Shae, Z.-Y. (2010). A service composition framework for market-oriented high performance computing cloud. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (pp. 284–287). Chicago, Illinois: ACM. Zhang, M., Ranjan, R., Nepal, S., Menzel, M., & Haller, A. (2012). A declarative recommender system for cloud infrastructure services selection. In Proceedings of the 9th International Conference on Economics of Grids, Clouds, Systems, and Services (pp. 102–113). Berlin, Germany: Springer-Verlag. Abdullah, M. F., & Ahmad, K. (2013, November). The mapping process of unstructured data to structured data. In Research and Innovation in Information Systems (ICRIIS), 2013 International Conference on (pp. 151-155). IEEE. R. Rao, "From unstructured data to actionable intelligence", IT Professional, vol. 5, pp. 29-35, 2003.G. S. L. Vishal Gupta, "A Survey of Text Mining Techniques and Applications," JOURNAL OF EMERGING TECHNOLOGiES iN WEB INTELLIGENCE, vol. VOL. I, NO. 1,2009. R. Blumberg and S. Atre, "The problem with unstructured data," DM REViEW vol. 13, pp. 42-49, 2003.
[3]
[4]
C. Cloud Storage Level At the Cloud Storage level data is stored on commodity hardware by using the Hadoop‟s HDFS. The clusters are formed on the basis of the type of data i.e. there will be two type of clusters will be formed on to store the structured data and the other to store the unstructured data. Among the cluster of unstructured data further clustering will be done on the basis of category of unstructured data for example if we are having three category of unstructured data like text, audio and video we can form three clusters for that. The data among the clusters will be stored in the form of chunks of 64MB with a replica of two with the underlyingfilestructureofHDFS IV. CONCLUSION AND FUTURE WORK Data is being generated for years and is collected by the organizations for proper functioning of the organizations. By the advancement in information technology the rate of data generation and its volume has increased so many times and is termed as Big Data. Although the term Big Data is an umbrella term for a high velocity, high volume, high variety and veracity of data that is difficult to manage by traditional solutions. In this paper we have proposed an economical and effective solution for Big Data. We have proposed a framework that will provide and economical data store data in cloud on the commodity hardware. Our framework will also extract some metadata information from the data that will be used to provide a schema for Big Data. In future we will extend this work through proper experimentation and results.
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
REFERENCES Copy Right © INDIACom-2015; ISSN 0973-7529; ISBN 978-93-80544-14-4
1.253
Proceedings of the 9th INDIACom; INDIACom-2015; IEEE Conference ID: 35071 2015 2 International Conference on “Computing for Sustainable Global Development”, 11 th – 13th March, 2015 nd
[13]
[14]
[15]
[16]
W. M. S. YafoozA, S. Z. Abidin, and N. Omar, "Towards automatic column-based data object clustering for multilingual databases," Control System, Computing and Engineering (lCCSCE), IEEE International Conference on. IEEE, 2011. W. M. S. YafoozA, S. Z. Abidin, and N. Omar, “Managing Unstructured Data in Relational Databases” Systems, Process & Control (ICSPC2013), Kuala Lumpur, Malaysia, IEEE Conference on 2013. I. R. Mansuri and Sarawagi, "Integrating unstructured data into relational databases," Data Engineering, ICDE'06. Proceedings of the 22nd International Conference on. IEEE, 2006. E. Chu, A. Baid, T. Chen, A. Doan, and 1. Naughton, "A relational approach to incrementally extracting and querying structure in unstructured data," Proceedings of the 33rd international conference on Very large databases, vol. VLDB Endowment, 2007.
[17]
[18]
[19]
[20]
A. Doan, 1. F. Naughton, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao, C. Gokhale, 1. Huang, W. Shen, and B.-Q. Vuong, "The case for a structured approach to managing unstructured data," ar Xiv pre print ar Xiv: 0909. 1783. 2009. X. Liu, B. Lang, W. Yu, 1. Luo, and L. Huang, "AUDR: an advanced unstructured data repository," Pervasive Computing and Applications (iCPCA), 6th International Conference on. IEEE, 2011. Alam, B., Doja, M. N., Alam, M., & Mongia, S. (2013). 5Layered Architecture of Cloud Database Management System. AASRI Procedia, 5, 194-199. Alam, M. (2012). CLOUD ALGEBRA FOR HANDLING UNSTRUCTURED DATA IN CLOUD DATABASE MANAGEMENT SYSTEM. International Journal on Cloud Computing: Services & Architecture, 2(6).
Text
Audio VideoUnstructured data Cluster
Structured data cluster as User Level
user
user
user
Schema Level
user
user
user
user
user
user
user HSCN
HSCN
HSCN
Identify type of Data
Cloud Storage level
Information Extraction based on DCMI parameters
Unstructured
Structured
Unstructured Data Schema
Structured Data Schema
Unified Schema
Fig.2.Data model for Big Data in Cloud
Copy Right © INDIACom-2015; ISSN 0973-7529; ISBN 978-93-80544-14-4
1.254