A Survey on Big Data, Mining: (Tools, Techniques, Applications and Notable Uses) Nour E. Oweis1,4, Suhail S. Owais2, Waseem George1, Mona G. Suliman3, Václav Snášel1,4 of Computer Science, VŠB-Technical University of Ostrava - Czech Republic Computer Science, FIT, Applied Science University, Amman - Jordan 3Faculty of Information Technology, MEU-Middle East University, Amman - Jordan 4IT4Innovations, VŠB-Technical University of Ostrava, Ostrava - Czech Republic
[email protected]
1Department
2Deptartmentof
Abstract. Big Data is a massive set of data that is so complex to be managed by traditional applications. Nowadays, it includes huge, complex, and abundant structured, semi-structure, and unstructured data as well as hidden data that are generated and gathered from several fields and resources. The challenges for managing Big Data include extracting, analyzing, visualizing, sharing, storage, transferring and searching such data. Currently, the traditional data processing tools and its applications are not capable of managing such revolutionized data. Therefore, there is a critical need to develop effective and efficient Big Data Mining techniques. This, in turn, has opened opportunities for research frontiers by using the exploiting artificial intelligence techniques for Big Data management. This study investigates the most effective Big Data Mining techniques and their rationale applications in various social, medical and scientific fields.
Keywords: Big Data; Data Mining; Smart Devices.
1
Introduction
From the ancient graphics symbols, proto-writing age to our new and modern digital data age, the data generation has not been stopped and the amount of digital data has been exploding to an unlimited rate. Small data has a limited store schema, well structured, and relational database. The traditional techniques for analysis small data is not so complex and built on the relational database model between subjects [1]. Generally, the small data which are stored in the data warehouse is mostly understood. On the other hand after data prevalence our digital world with its large, complex, non-relational and unstructured amount of data that comes from several fields and resources starting from social network, medical science, commercial, industrial,
scientific until many more, lead to realize the term of Big Data, were the main purposes of the big data is to generate small data to be understood. Big data gives a lot of opportunities to make great development in many fields. Generally, data and specially big data techniques is the complement of the traditional data tools were the big data is still the main and one of the newest term of contemporary debate in the whole of data word [2]. In our fields as a scientists researchers, the massive amount of data has altered the way that we assumes during the past research implementation, analysis to control huge complicated, hidden, and some time unavailable data. This data-intensive has been improved to serve us as a new scientific discoveries called big data mining. The rest of the paper is organized as follows. In section 2, we review the main big data definitions and characteristics including the 5 V’s. In section 3, we survey the big data sources including the smart hardware devices and system software resources. In section 4, we present the main big data tools. In section 5, data mining concept and types are discuss comprehensively. In section 6, data mining techniques are briefly described. In section 7, the big data application and its notable uses in company, healthcare, financial, telecommunication, marketing, and industrial. In section 8 we give the conclusion and future work.
2
Big Data Definitions and Characteristics
After we inclined to accept the huge data alteration; Big Data appear to be the biggest issue and the next research innovations [3]. One of the most popular definition of the big data were defined by Gartner is, “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making” [4]. After the 3 V’s of Gartner (Volume, Velocity, Variety); IBM data scientists introduced the 4th V of big data so-called “Veracity” [5] for ambiguity and incompleteness data which leads to another challenge for keeping big data organized [6]. To discover and analyze the hidden valuable data from these 4 V’s (Volume, Velocity, Variety and Veracity) of big data leads to the fifth V so-called “Value” which is the main opportunities that most of organizations are looking for. The other definition of the big data is Matt Aslett definition, he define the term big data as: “Big data is now almost universally understood to refer to the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due the limitation of traditional data management technologies”.
Each of the 5V's (Volume, Velocity, Variety, Veracity, and Value) figures out as part of the big data scope. These 5V’s are now the main characteristics terms of the big data [1], [7] as listed below: 1. Data Volume is the size of the dataset, by Terabytes, Exabyte, and Zettabyt age. 2. Data Velocity indicates the speed of data for in and out process in a real time. Data is begin generated fast and need to be processed fast, this means the data velocity measure how often the data is generated in time. 3. Data Veracity describes the ambiguity and Incompleteness data. 4. Data Variety describes the scope of several data types that are comes from deferent fields and resources with structured, semi-structure, and also unstructured data. 5. Data Value to explore, discover and analyze the most valuable information from the dataset.
3
Big Data Sources
Nowadays among more than two billion users are using the newest smart technology that are daily generating data, starting from smart home, smart city, smart business, smart posters, up-to-the-minute entertainments applications, and even the great new facilities that allow machine to machine communication without user in between, which is known as the Internet Of Things (IoT) [8]. This smart devices are widely used in our world with huge amount of data daily generated, this lead to increase to data capacity and variety. We classify these data sources into two main parts: the first is the hardware, and the second, is the system software. 1. System Hardware The system hardware sources including the smart devices that generating and collecting daily data, such as Data Center, Wireless Sensors Network (WSN), Radio Frequency Identification (RFID), Near Field Communication (NFC), Machine Log Data, Cloud Computing, Smart Phone, Smart Poster and many more [8, 9, 10]. 2. System Software The system software including, many software and platforms developed specially for the big data, like big data platforms (IBM, SAP, SAS Microsoft, Intel, Infobright, Hortonwork, Kognitio, ORACLE, and Amazon); Internet of Things such as (Internet Protocol version 6 (IPv6), Machine to Machine (M2M) protocol; Service Oriented Architecture (SOA), and Representational State Transfer (REST); Web of Things; Cloud platform such as(Google cloud, and Dropbox); Social Network such as (Facebook, LinkedIn, and Twitter); Search Engine such as (Google BigQuery) [8, 9, 10]. All of these data sources and technologies are playing an important and critical impact in many in several fields like medical science, social network, commercial, industrial, scientific, and many more.
3
4
Big Data Tools
Storing huge amount of data is not a big challenge, but one of the most challenge of the big data deal with how to figuring out this huge amount of data to make it valuable and more senses. Big data deals with multi data types, such as structure, semistructure, and even unstructured and untraditional databases. These types of data can reach a high capacity of storage media, scaled today by petabytes, Exabyte or zettabytes that require a specific tools to handle this huge capacity. Nowadays, many big company like Google are dealing with big data doesn’t use the traditional techniques to process their data, its use a special big data tools to manipulate, store, and analyze their data stream. Also there are several big data tools used to extract, analyze, and visualize the complex and different data types. In this section we will introduce briefly the four main open source big data tools, such as; NoSQL; MapReduce; Hadoop, and R Language [11]. 1. NoSQL NoSQL refers to “Not Only SQL”, that means, this tool have combine two parts, the traditional SQL techniques with additional other new and alternative techniques used for querying and access the large, complex, unstructured, and non-relational dataset [12] that can be stored remotely on multiple virtual services in cloud dataset. NoSQL is an open source database software that is useful for big data management. NoSQL is combine with other tools like massive parallel processing, columnarbased databases and database-as-a-service (Dass), and most of the recent social network such Facebook, LinkedIn and Twitter are using Apache Casandra NoSQL database tool [13]. 2. MapReduce MapReduce is one of the most open source data mining techniques model that allows programmers to implement, processing, and develop large dataset with parallel and distributed algorithm on a cluster by using several programming languages like C, C++, Java, Perl and more by using several MapReduce libraries [13, 14]. MapReduce is inspired by Google Company, consisting of two parts, Map and Reduce. The Map is a procedures that divided, filter, and sort in the distributed cluster, while the Reduce is another procedures that summarize the results into a single mode at a time [11]. MapReduce can be applied to large volumes of data that can be processed by a large number of servers. MapReduce can be used to sort a petabyte of data, with only a few hours. Parallelism also gives some possibilities partial recovery server failures: if the operating portion [12], which produces a preprocessing operation or convolution fails, its operation may be transferred to another working unit (assuming that the input data for the ongoing operation are available) [15]. The most popular open source implementation of the MapReduce is the Apache Hadoop software.
3. Hadoop Hadoop is a software with an open source, freely available set of tools and libraries, based on Java software framework for processing, development, and execution the large volume of distributed datasets and application [12]. Hadoop can execute thousands terabytes of large, complex and non-relational dataset under several operating system like Windows, Linux, BSD (UNIX), and OS X for Apple Macintosh [11]. Hadoop framework is used by several online search engines like Google and Yahoo [13, 15]. 4. R-Language R is a programming language for statistical data processing and graphics, free software environment computing, and open source project GNU developed at Bell Labs. R-Language is an implementation of S programming language that is used for processing large amount of data. Big Data has altered the way that we espouse in doing scientific research trends. Several analysis techniques for extract and visualize this complex amount of big data and being used for optimization processes, decisions, design and implementation by using data mining and machine learning [16].
5
Data Mining Concept and Types
The idea of parallel data mining has been emerged with big data to improve the usage of huge, complex amount of dataset by using Artificial Intelligence (AI) system that makes the computer thinks and operates like a human being [17]. Data mining refers to the steps of searching, analyzing and extracting the valuable needed data from a data warehouses to exploit problem-solving and decision-making. This is known as Knowledge Data Discovery (KDD) [18]. Data mining technique contains a variety of applications and notable uses which are designed to work skillfully with a huge amount of datasets. These applications and their notable uses cover a colossal domain of our life, and most of them will be presented briefly in this study. There are two main types of data mining models: The first type is descriptive data mining, which used for summarization analysis tools for re-organizing and extracting the basic structure and interconnections between data, this model is commonly used in marketing advertising such as: summarization, clustering, and interconnection data mining and the second type is the predictive data mining, which used for development model based on existing data that carries out the analysis and extraction in more specifications and classifications. This model is commonly used in marketing predictions, like which new products may be popular in the future such as: classification, specification, and prediction [19]. This model is commonly used by big data for awfully valuable techniques to produce productivity in deferent fields.
5
6
Data Mining Techniques
To optimize the available and suitable data needed from a datasets, parallel data mining tools have been developed to use for solving a lot of problems using several techniques like, artificial neural network, decision tree, rule induction, genetic algorithms, nearest neighbor and many more. Some of these tools are descript briefly in this section. Artificial Neural Network Artificial Neural Networks are relatively new technology in computing, which is inspired by the workings of the human brain. Neural networks consist of a large number of processing unites and neurological that are interlocking and interconnected with each other to be able to treat certain types of problems [20]. Neural networks are characterized by its ability for extracting and predicting data from complex input or inaccurate. It can also extract patterns and detect trends of complexity that cannot be observed by a human or by other computer technology [21]. Decision Trees There are tree-shaped structures of decision tree that perform the sets of decisions commonly used in operation research, specifically in decision analysis. Decision tree consists of three main types , the first, is the decision nodes represented as square, the second, is the chance nodes represented as circle, and the third is the end nodes represented as a triangles. Genetic Algorithms (GA) Genetic Algorithms represented one of the artificial intelligence algorithms that are attractive paradigm to improve performance in information retrieval system that are natural evolution techniques for optimizing and searching problems. There are several applications used genetic algorithms such as bioinformatics, airlines revenue management, artificial creativity, clustering, biology and chemistry, electrical circuit design, financial mathematics, software engineering and many more. Also there are several techniques for data mining such as, Nearest Neighbor and Rule Induction. The Nearest Neighbor, sometimes called the k-Nearest Neighbor (KNN) used a classification technique that classifies each record depend on the records most related to it in a historical database and the Rule Induction for extraction of useful if-then rules from dataset based on statistical impact [22].
7
Big Data Applications and Notable Uses
There are several varieties of big data notable uses applications that exist in current information age, these applications are available to cover our scope for extracting and
analyzing data from several data resources. This section is briefly introduce most of each applications and its notable uses. 7.1
Companies
Most of the big companies have been using big data applications. Big Data mining is a powerful new technology, with great potential to help companies focus on the most important information in the data they have collected, behavior and potential customers. Big data mining techniques can help companies with more vigorously promote to their marketing programs and pricing to retain existing customers and attract new ones. An example of a company that already deals with big data and the new techniques like IoT is the Infobright Company [23], they provide a high quality, complex queries resulting in faster rapid analysis techniques software and services for machine generated data. One of the most attractive software that built by Infobright company with (SaaS) platform is called the ERZ-1 software. ERZ-1 provides the optimal solutions that allow clients to visualize distribute of their equipment, connect to the Intermodal marketplace, execute efficient, and accurate financial services, this help them to get the most out of their assets. REZ-1 provide tailored solutions that utilize most of product areas enhanced to use the website such as tracking that provides real time visibility and location for drive movement, inventory management that supplies of assets complex bossiness at any location, and financial management. There are several services and applications for the company and its notable uses area, some of them are listed below:
7.1.1
Credit Company
Credit Company detects of information fraud that typical behavior credit card holder, reliability analysis of customer accounts, and cross-selling program. An example of financial management company is the Infobright ERZ-1 software that supports billing rules, invoices, dispute and other business services.
7.1.2
Heath Care and Medicine Company
Big Data created a link between life sciences and the medical term to predict, expedite more interactions between patients, doctors, diagnosis, choice of treatment modalities, and predicting the outcome of surgery, and pharmaceutical companies [15, 24]. In addition there are many other company types are also dealing with big data techniques, such as, airline, and insurance company.
7
7.2
Financial Banking
Big data techniques help the financial banks to build many analytical techniques, especially in predictive area to their customer’s behavior, and appropriately serve each category. There are several big data tools that are available in the business application especially for financial banking fields [25], most of them are based on the classification and prediction techniques to support an intelligent system which is known as Business Intelligence (BI) system [26]. The banking trade application use, such as [15, 27], they use the big data mining technology to perform several common tasks like identification of credit card fraud by analyzing the historical transactions, marketing policy, prediction of customer changes, credit risk analysis, and customer acquisition. 7.3
Telecommunications
Nowadays; Telecommunication companies are dealing with big data mining techniques, they are involves in many records and calls analysis, pricing, failure analysis, attracting customers, prediction of funds, and Identification of customer loyalty. One of the most modern tools in the telecommunication company is the Mobile user data mining which is fast becoming the most important communication way in our work and life. Mobile user data mining is aims to analyze and predict behavior of mobile users from the data collected, and one of the advantages of possessing mobile data based on real user behavior [28]. 7.4
Marketing and Retail Big data
Marketplace is one of the most business big data applications trend. Today retailers gather detailed information on each individual purchase using credit cards and storebrand computerized control system. Here are the typical problems that can be solved with the data mining algorithms in the retail sector such as [29], the analysis of the shopping cart (similarity analysis) is designed to identify products that customers tend to buy together. Knowledge of the shopping cart need to improve advertising, develop a strategy of stockpiling goods, and ways of their layout in the sales places. 7.5
Industrial
Data mining tools include several applications to automatically control the industrial production management such as, quality control, logistics, and process optimization [30]. After we mentioned all of the above applications, big data is also impact many other areas like science and engineering, text and web data mining, and many more.
8
CONCLUSION and Future Work
In this survey reported on the current state of Big Data and Data mining, we introduce a short history of the big data, definitions, characteristics and tools. We mentioned the traditional small data and the demand necessity for the big data value in several scientific researches. We also propose briefly the data mining types, techniques that support searching, extracting, and analyzing different data types. Finally we introduce the most big data opportunities that serve a large number of notable uses areas like, (Financial, Telecommunication, Company, Science and Engineering and Industrial). So finally we can conclude that, big data which are flying over the world is the master key for any new digital decorations, and parallel algorithm is the suitable solution for the big data mining techniques. In the soon future, the next generation of big data is the; Data Fusion and; Data Binding to integrate both services and data from multiple resources with remotely access process and execute, both for providing a better analysis and decision make.
9
Acknowledgment
This work was supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme and by the SGS in VSB - Technical University of Ostrava, Czech Republic, under the grant No. SP2015/146.
References 1. Kudyba, S. (2014). Big Data, Mining, and Analytics: Components of Strategic Decision Making. CRC Press. 2. Schroeder, R., & Cowls, J. (2014). Big Data, Ethics, and the Social Implications of Knowledge Production. 3. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. (2011). Big data: The next frontier for innovation, competition, and productivity [Kindle edition], McKinsey Global Institute. Retrieved June, 11, 2012 4. Gartner IT Glossary, http://www.gartner.com/it-glossary/big-data/ , (Last seen 05–April– 2015) 5. The Four V’s of Big Data – IBM http://www.ibmbigdatahub.com/infographic/four-vs-bigdata (last seen 05–April–2015). 6. Elorie Knilans, “The 5 V’s of Big Data” , Avnet Advantage: The Blog, Soulution-Focused Insight for Growth-Minded VARs. http://blogging.avnet.com/ts/advantage/2014/07/the-5vs-of-big-data/#comment-474 (Last seen 05–April–2015) 7. Gupta, R. (2014). Journey from Data Mining to Web Mining to Big Data. arXiv preprint arXiv:1404.4140.
9
8. Gubbi, J., Buyya, R., Marusic, S., & Palaniswami, M. (2013). Internet of Things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems, 29(7), 1645-1660. 9. Domingo, M. C. (2012). An overview of the Internet of Things for people with disabilities. Journal of Network and Computer Applications, 35(2), 584-596. 10. Whitmore, A., Agarwal, A., & Da Xu, L. (2014). The Internet of Things—A survey of topics and trends. Information Systems Frontiers, 1-14. 11. Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2014). Data mining with big data.Knowledge and Data Engineering, IEEE Transactions on, 26(1), 97-107. 12. Barbierato, E., Gribaudo, M., & Iacono, M. (2014). Performance evaluation of NoSQL big-data applications using multi-formalism models. Future Generation Computer Systems, 37, 345-353. 13. Lee, K. M., Park, S. J., & Lee, J. H. (2014). Soft Computing in Big Data Processing. 14. Koch, C. (2013). Compilation and synthesis in big data analytics. In Big Data (pp. 6-6). Springer Berlin Heidelberg. 15. Srinivasa, S., & Bhatnagar, V. (Eds.). (2012). Big Data Analytics: First International Conference, BDA 2012, New Delhi, India, December 24-26, 2012: Proceedings (Vol. 7678). Springer. 16. Verzani, J. (2014). Using R for introductory statistics. CRC Press. 17. Jain, N., & Srivastava, V. (2013). DATA MINING TECHNIQUES: A SURVEY PAPER. IJRET: International Journal of Research in Engineering and Technology. 18. Saed Sayad, Data Mining Map, An Introduction to Data Mining, http://www.saedsayad.com/ (2012). (Last seen 05–April–2015). 19. Zaki, M. J., & Meira Jr, W. (2014). Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 20. Sanjesh Ghore. (2014) Data Mining used of Neural Networks Approach, Department, CSE, Govt. Engg. College Bilaspur, Chhattisgarh , India. ISSN 2348 – 7968. 21. Singh, Y., & Chauhan, A. S. (2009). Neural networks in data mining. Journal of Theoretical and Applied Information Technology, 5(6), 36-42. 22. Lahoti, A. A., & Ramteke, P. L. (2014). Data Mining Technique its Needs and Using Applications. IJCSMC, Vol. 3, Issue. 4, April 2014, pg.572 – 579. 23. Infobright, Data Analysis Institute, https://www.infobright.com/index.php/case-study/rez1-ad-hoc-reporting-reduced/#.VE5MOiLF98F, (Last 05–April–2015). 24. Wang, Y., Kung, L., Ting, C., & Byrd, T. A. (2015). Beyond a Technical Perspective: Understanding Big Data Capabilities in Health Care. Proceedings of 48th Annual Hawaii International Conference on System Sciences (HICSS), Kauai, Hawaii. 25. Rajendra Akerkar. (2014) Big Data Computing, Chapman & Hall Book, CRC Press Western Norway Research Institute Sogndal. 26. Kambatla, K., Kollias, G., Kumar, V., & Grama, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7), 2561-2573. 27. Saraswathi, K., & Ganesh Babu, V. (2015). A survey on data mining trends, applications and techniques. History, 30(135), 383-389. 28. Du, R., Huang, J., Huang, Z., Wang, H., & Zhong, N. (2014). A System to Generate Mobile Data Based on Real User Behavior. In Web Information Systems Engineering–WISE 2013 Workshops (pp. 48-61). Springer Berlin Heidelberg. 29. Feinleib, D. (2014). Doing a Big Data Project. In Big Data Bootcamp (pp. 103-123). Apress. 30. Johnston, W. J. (2014). The future of business and industrial marketing and needed research. Journal of Business Market Management, 7(1), 296-300.