Machine Learning and Big Data Processing: A Technological Perspective and Review Roheet Bhatnagar(B) Department of Computer Science and Engineering, Manipal University Jaipur, Jaipur, Rajasthan, India
[email protected] Abstract. This paper discusses the role of Machine Learning (ML) based algorithms and methods in Big Data Processing & Analytics (BDA). ML and BDA are both evolutionary fields of computing and the developments in these fields are complementing each other. The ever changing data landscape in modern digital world have resulted in newer ways of data processing frameworks in order to get meaningful insights which are unprecedented. This paper presents a detailed review on latest developments in ML algorithms for Big Data Processing. In later section key challenges associated with application of ML based approaches are also discussed. ML based Big Data Processing has gained popularity and new developments are on the rise for efficient data processing. This field is witnessing unparalleled emergence of new methods and approaches for efficient data processing in order to discover interestingness for decision making. Thus, more and more ML based data processing approaches are being used for Big Data Processing. With the splurge data from different newer sources, heterogeneous nature of data, uncertain & unstructured data, the so called Big Data with all its characteristics (5 Vs) there is an ever increasing need to use approaches which aid in modelling and processing of these data, provide automated approach to data processing and so on. These type of new processing requirements have given a big boost to the development of new ML based methods for managing & processing them. The paper will be useful to the scholars who are researching in this interesting & challenging domain of ML and Big Data Processing. Keywords: Machine Learning Decision making
1
· Big Data Processing & Analytics
Introduction
The world is witnessing a tremendous technological advancements and ever increasing need to understand data, because today ‘Data is Power and Data is Money’. We are living in an era where we are witnessing unprecedented amount of data being generated from several unheard and unseen sources. We have technology developed to capture, manage and process these unforeseen data, but still c Springer International Publishing AG, part of Springer Nature 2018 A. E. Hassanien et al. (Eds.): AMLTA 2018, AISC 723, pp. 468–478, 2018. https://doi.org/10.1007/978-3-319-74690-6_46
Machine Learning and Big Data Processing: Review
469
there are many challenges and issues which need to be tackled. Many researches are going on in these direction to better understand & have many meaningful insights from Big Data. Today in every field of study, be it Basic Sciences, Applied Sciences, Engineering, Social Sciences, Bio-Medical Sciences and so on we are dealing with Big Data. All of these fields are dealing with Large Scale datasets [1] and lot of work is being carried out to better harness and process Big Data, using domains like Machine Learning (ML) which holds tremendous potential in handling modern data challenges. According to one study [2], in 2011, digital information has grown nine times in volume in just 5 years and its amount in the world will reach 35 trillion gigabytes by 2020 [3]. The paper aims at discussing numerous issues related to humongous amount of data, their processing and analytics, current research focus and the future trends. It also talks about utilizing the machine learning approaches for Big Data Processing and highlights the current scenario in the domain from different perspectives. The paper makes many contribution and it is organized as follows. Section 2 concerned with advent of Big Data and recent developments in Big Data Processing & Analytics domain. Section 3 starts with a description of the evolution of Machine Learning and deals with assessment of the traditional machine learning techniques, their applications followed by many advanced learning methods as a direct outcome of recent researches carried out to efficiently perform Big Data Processing. Several issues, challenges and opportunities associated with Big Data Processing and the solutions proposed by different researchers are discussed in the Sect. 4. Section 5 concerned with future issues and challenges which need to be worked upon to get many new meaningful insights.
2
Big Data - An Introduction
“Big Data” refers to a collection of tools, techniques and technologies for working with data productively, at any scale. Increase in the storage capacity & advanced storage technologies, increased processing capacity of modern day computers and availability of large scale data – all have led to the development in Big Data Processing field. Modern times hardware and software technologies can manage, manipulate, process and analyze humongous amount of data as never before. The explosion of the Internet, social media technology, devices and apps is creating a tsunami of data. Extremely large sets of data can be collected and analyzed to reveal patterns, trends and associations related to human behavior and interactions. Big data is being used to better understand consumer habits, target marketing campaigns, improve operational efficiency, lower costs, and reduce risk. International Data Corporation (IDC), a global provider of market intelligence and information technology advisory services, estimates that the global big data and analytics market will surge in times to come [4]. The challenge for businesses is how to make the best use of this wealth of information. Some experts break down big data into three subcategories: (i) Smart data; (ii) Identity data; and (iii) People data [5]. Big data sets are so large that traditional processing methods often are inadequate [6]. Gartner’s 2014 Hype Cycle, includes Big Data as technology of the future [7–9].
470
R. Bhatnagar
Today’s Big Data may not be Big Tomorrow, since data is being continuously generated and we are flooded with data which need to be harnessed & processed to gain new insights. As such, traditional data processing tools which do not scale to big data will eventually become obsolete. Everyone is processing Big Data, and trying to harness the benefits out of processing using various Big Data processing frameworks. Apache Hadoop and Spark are some of the popular frameworks and are very well-known, while there are others which are more niche in their usage, but have still managed to carve out respectable market shares and reputations [9]. Generally speaking, these frameworks can be categorized as Proprietary Frameworks and, Open Source Frameworks and both types are popular in industry. Hadoop, Spark, Flink, Storm, and Samza are some of the popular Open Source Big Data processing frameworks.
3
Machine Learning - A Brief Introduction
This section discusses the concepts of machine learning, its evolution, different ML techniques applications and finally discusses the Advanced Machine Learning techniques proposed in recent past. ML based problem solution is very much required in the field of Big Data Processing. 3.1
Machine Learning Techniques - Classification and Use
The concept of Machine Learning is not new in the field of computing, however due to ever changing nature of requirements of today’s world it has come up in a new ‘Avatar’ all together. Now we find everyone talking of ML based solution strategies for a given problem set. ML is a subset of Artificial Intelligence, where computer algorithms are used to autonomously learn from data and information. With the rise of the internet, there is a lot of digital information being created - which means there is more data available for machines to analyse and ‘learn’ from [10]. Hence, as a result we see the resurgence of Machine Learning. Today, machine learning algorithms enable computers to communicate with humans, autonomously drive cars, write and publish sport match reports, and find terrorist suspects. Machine learning (ML) is the most growing field in computer science [11]. Classification [12,13], regression [14], topic modelling [15,16], time series analysis [16], cluster analysis [12,16,17], association rules [14,16], collaborative filtering [13,18,19], and dimensionality reduction [20,21] are some of the popular Machine learning techniques/methods. These are used to perform analytics and predict the future trends based on the existing patterns and correlations among data in the given dataset. Agneeshwaran [22], proposed a maturity model for describing advanced analytics and it also distinguishes analytical tools into three generations of machine learning as follows [23]:
Machine Learning and Big Data Processing: Review
471
– 1st Generation Machine Learning (1GML) requires the data workload to fit into memory of a single machine. Such tools are restricted to vertical scaling which is a drawback when considering Big Data. Tools in this group were usually developed before Hadoop and are referred to as traditional analytical tools. (R, RapidMiner, KNIME, SAS, WEKA are some of the examples of 1GML tools). – 2nd Generation Machine Learning (2GML) enhances 1GML with capabilities for distributed processing across Hadoop clusters. In contrast to 1GML, data remains at its location while the code execution is divided and processed on each required data node in parallel (Mahout (MapReduce) is an example). – 3rd Generation Machine Learning (3GML) enhances 2GML with capabilities to efficiently perform distributed processing of iterative algorithms. This class is referred to as beyond Hadoop (Mahout (Spark/H2O/Flink), MLlib, H2O ML, Flink-ML SAMOA, MADlib are some of the examples). ML has evolved tremendously from the classical Turing Test proposed by Alan Turing in 1950 to AlphaGo algorithm by Google DeepMind, Google Inc in 2016 [24]. Qiu et al. [25] in Table 1, provided the comparison of three subdomains of ML from different perspectives and outline the ML technologies for data processing. Table 1. Comparison of machine learning technologies [25]. Learning types
Data processing tasks
Distinction norm
Learning algorithms
References
Supervised learning
Classification/ regression/ estimation
Computational classifiers
Support vector machine
[26]
Statistical classifiers
Naive Bayes
[27]
Hidden Markov model
[28]
Bayesian networks
[29]
Connectionist classifiers
Neural networks
[30]
Parametric
K-means
[31]
Unsupervised Clustering/ learning prediction
Nonparametric
Reinforcement Decision learning making
Model-free
Model-based
Gaussian mixture model
[32]
Dirichlet process mixture model
[33]
X-means
[34]
Q-learning
[33]
R-learning
[33]
TD learning
[34]
Sarsa learning
[35]
472
R. Bhatnagar
Supervised learning, unsupervised learning, and reinforcement learning are the three sub domains of Machine Learning [36]. A lot of development vis– vis the theory mechanisms and application services have been proposed for dealing with data tasks [37–39] in the above subdomains of ML. Spam Detection, Credit Card Fraud Detection, Digit Recognition, Speech Understanding, Face Detection, Shape Detection, Product Recommendation, Medical Diagnosis, Stock Trading, Customer Segmentation are some of the key applications of Machine Learning [33]. Big Data Processing Frameworks like Apache Spark have got Machine Learning libraries and components to apply ML on Big Data. Apache Spark is a general data processing framework and the various components of the framework are used by researchers for specific purpose across the globe [34]. MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. Similarly, Apache Flink is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
4
Big Data Processing
Big Data Processing is a focus area of research and many frameworks & techniques have been proposed in recent past by different researchers. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology [35]. Big Data Analytics is helping organizations to improve business efficiency but then there are many challenges & issues associated with Big Data Processing & Analytics for each of the 5 Vs (Volume, Velocity, Variety Veracity, and Value) [25]. 4.1
Issues, Challenges and Opportunities in Big Data Processing Using Machine Learning
Business needs are changing, and ever increasing & diversified data poses new challenges to the researchers. The Big Data Processing is gaining prominence as the world has realised the potential of discovering meaningful insights from unstructured data over the past few years now. So is true for Machine Learning algorithms which have been pushed to the forefront and are aiding in making more timely & accurate predictions. ML algorithms are used to realize the value of Big Data and to process the large data volume at high velocity than ever before witnessing tremendous & unprecedented changes. There is continuous development in the field of ML as well but still the Models Scalability and Distributed Computing [40] are some of the key challenges to ML implementations during Big Data Processing.
Machine Learning and Big Data Processing: Review
473
A typical ML system usually consists of a Data Pre-processing unit, Model builder & Evaluation unit and an Output unit. Data pre-processing is an important phase and poses new challenges, issues and opportunities to researchers & practitioners in the Big Data era. During pre-processing the raw data is transformed and it results in certain representation of data that can support effective ML applications [41]. Traditional data pre-processing and preparation relies on human interventions and is costlier and error prone; while with the application of ML algorithms on Big Data, opportunities have been created in reducing the reliance on human monitoring and supervision, as ML algorithms learn from massive, diverse and data in motion on their own. Zhou et al. [40] and others have done an excellent review and the following section discusses each of the above pre-processing issues along with challenges & proposed solutions to mitigate the risk associated with them [40]. a. Redundancy in Data: It is also known as data duplication and it leads to inconsistent data which can be detrimental to ML based system. Techniques do exist for identifying duplicates in a given data set [42] but these traditional methods are not effective in case of Big Data. Techniques such as Dynamic Time Warping are much more efficient than traditional Euclidian Distance algorithms [43,44]. b. Noisy Data: Missing or incorrect values of data are one of the primary source of noise in data, which may severely hamper the outcome of applying analytics over the data set containing noise. Traditional mechanisms of removing noise from the data set fails in case of Big Data Processing due to their lack of scalability, and we cannot simply discard noisy data by deleting them as some very interesting insights may be part of them. Efforts are made to increase scalability of outlier detection for effectively exploring anomalies in large data sets [45]. c. Heterogeneour Naature of Data: It is the Variety characteristics of Big Data that gather & present data collected from various sources, in different formats and are thus essentially heterogeneous in nature. These heterogeneous data in different formats e.g. unstructured, text, audio and video data formats [46] poses challenges to ML algorithms vis–vis their learning rate. We cannot treat all the features of a data set equally important and concatenate them into one as it won’t provide an optimal learning outcome and optimal performance. Big Data is seen as an opportunity to learn from multiple views in parallel and then learn the importance of feature views w.r.t. the task to be accomplished. Thus, it will be robust to the data outliers to address optimization and data convergence issues [47]. The heterogeneous mixture data i.e. the collection and storage of mixed data based on different patterns or rules can be challenging in analysis of large scale data. The solution to deal with such data has been proposed by the authors [36] where they make a mention of ‘heterogeneous mixture learning’ – an advanced form of analysis technology developed by NEC.
474
R. Bhatnagar
d. Discretization of Data: It is the process of translating the quantitative data into qualitative data resulting in a non-overlapping division of continuous domain. Decision Trees and Naive Bayes are the examples of some ML algorithms which can only deal with discrete data. Attribute discretization leads to categorization of data which are effective for learning task. However, when dealing with Big Data such traditional approaches are not efficient. The solution is parallelization of standard discretization methods by developing a distributed version of the Entropy Minimization Discretizer based on Minimum Description Length Principle in big data platforms, boosting performance as well as accuracy [48]. Another solution is where the data is first sorted based on the values of numerical attributes and then split into fragments of original class attributes [49]. e. Data Labelling: Annotations are important in data understanding but the process is quite tedious as data increases in size/dimension. Alternative methods have been proposed for data labelling when dealing with Big Data e.g. Online Crowd-generated repositories which can serve as a source for free annotated training data [50]. Probabilistic program induction is another approach to address human-level concept learning. User-specific context is another issue that must be addressed properly, otherwise it will result in diminished performance. f. Imbalance of Data: Traditional methods such as stratified random sampling methods can be time consuming and also cannot efficiently support userspecified data set for value-based sampling. They fail to address Big Data and the solution is parallel data sampling, which are based on multiple distributed index files. g. Feature Representation and Feature Selection: The way the data is represented or features are selected (prominent feature identification) affects the performance of ML algorithms [41]. Current algorithms for the above purpose are not sufficiently equipped to handle Big Data. Different solutions such as distributed feature selection, a low rank matrix approximation, representation learning concept, adaptive feature scaling scheme for ultra-high dimensional feature selection, spectral graph theory based framework, fuzzy clustering are proposed over the years and it is still an active area of research. Deep Neural Network based auto encoding has proven effective in learning video, audio and textual features. Even prior to the advent of Big Data developing scalable ML algorithms for handling large datasets have been an active research area with different researchers working & proposing newer algorithms over a time period. Now, it has gained new impetus & significance as a result of ever increasing challenges being posed by Big Data. The algorithms scalability was mainly aimed at improving performance efficiency in terms of bettering time complexity and space complexity. State-of-the-art ML algorithms for Big Data Processing focuses on parallelism by exploiting data geometry in the input and/or algorithm/model space. Parallelism may further be classified into Data Parallelism (e.g. MapReduce,
Machine Learning and Big Data Processing: Review
475
Distributed Graph) and model/parameter parallelism (e.g. multi-threading, MPI /OpenMP). Non-parallelism ML algorithms aim to incorporate much faster optimization methods which can deal with big data without any parallelism [40]. Most of the existing work on Big Data Processing using ML focuses on the first three Vs namely Volume, Velocity and Variety aspects, but there is a need to focus on other Vs as well viz; Veracity and Value aspects. 4.2
Trends and Open Issues in Big Data Processing Using Machine Learning
As we understand now that ML based methods and their applications are an integral part of Big Data Processing, it is a hot research area with many new developments happening in this direction. Although research in ML based application development has achieved significant results boosting deriving meaningful insights from Big Data, much more is yet to be accomplished, in this important domain. Qiu et al. [25] describes following future trends from different perspectives in ML based applications for Big Data Processing. 1. Data Meaning Perspective: It implies as to how to make ML more intelligent to achieve context-awareness. 2. Pattern Training Perspective: It implies how to avoid the overfitting during the process of training patterns. 3. Technique Integration Perspective: It deals with integrating other related techniques with ML for Big Data Processing. Developing a composite, integrated and seamless platform for Big Data Processing have a great research potential. 4. Privacy & Security Perspective: It provides a research direction for ensuring security and privacy in Big Data Processing using ML techniques. 5. Realization and Application Perspective: How and where one must apply ML research in Big Data to gain optimal results. Applying and utilizing the developed ML techniques to real world problems carries huge potential as research area.
5
Conclusion
Big data are now quickly expanding in all science and engineering domains. Learning and gaining newer insights from these massive data brings tremendous opportunities for business houses. Traditional Machine Learning methods for Big Data Processing are not efficient & are not scalable to meet up the high Volume, Velocity, Variety, Veracity and Value (the famous 5 Vs of Big Data), hence ML needs to reinvent itself for big data processing. Machine Learning algorithm based methods are inseparable part of Big Data Processing to gather new unforeseen insights, discover new knowledge and improve efficiency. The amalgamation of ML and Big Data will augur well for the future of data driven industry.
476
R. Bhatnagar
The paper is targeted at providing both the current practices & future research directions in the domain of Big Data Processing using Machine Learning techniques. It is envisaged that the academia and industry will focus on the Veracity & Value aspects in the coming times to improve the processing capabilities, trust management and covering all the important aspects of Big Data. The research scientists, data scientists, analysts and big data practitioners must collaborate towards establishing more efficient Big Data Processing using ML standards and exploring new domains in future.
References 1. Sandryhaila, A., Moura, J.M.: Big data analysis with signal processing on graphs: representation and processing of massive data sets with irregular structure. IEEE Signal Process. Mag. 31(5), 80–90 (2014) 2. Gantz, J., Reinsel, D.: Extracting value from chaos technical report white paper. International Data Corporation (IDC) Sponsored by EMC Corporation (2011) 3. Gantz, J., Reinsel, D.: The Digital Universe Decade - Are You Ready?. Basic Books, New York (2010) 4. Press, G.: 6 predictions for the $125 billion big data analytics market in 2015 (2014) 5. The evolution of big data, and where we’re headed — wired. https://www.wired. com/insights/2014/03/evolution-big-data-headed/. Accessed 10 June 2017 6. Inc., T.P.F.S.G.: The evolution of big data. https://content.pncmc.com/live/pnc/ corporate/pncideas/articles/CIB ENT PDF 0815-066-196209-CIB FPS BigData rev1.pdf. Accessed 10 June 2017 7. Hype cycle for big data (2014). https://www.gartner.com/doc/2814517/hypecycle-big-data-. Accessed 10 June 2017 8. Hype cycle - wikipedia. https://en.wikipedia.org/wiki/Hype cycle. Accessed 10 June 2017 9. Gartner hype cycle for emerging technologies: AI, AR/VR, digital platforms — what’s the big data? https://whatsthebigdata.com/2017/08/16/2017-gartnerhype-cycle-for-emerging-technologies-ai-arvr-digital-platforms/. Accessed 10 June 2017 10. What is the difference between artificial intelligence and machine learning? https://www.forbes.com/sites/bernardmarr/2016/12/06/what-is-the-differencebetween-artificial-intelligence-and-machine-learning/2/#1f240102483d. Accessed 10 June 2017 11. Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015) 12. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37 (1996) 13. Ingersoll, G.: Introducing apache mahout. IBM developer Works Technical Library (2009) 14. Mikut, R., Reischl, M.: Data mining tools. Wiley Interdisc. Rev. Data Mining Knowl. Discov. 1(5), 431–443 (2011) 15. Chen, H., Chiang, R.H., Storey, V.C.: Business intelligence and analytics: From big data to big impact. MIS Q. 36(4), 1165–1188 (2012) 16. Dietrich, D., Heller, B., Yang, B.: Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Wiley, Hoboken (2015)
Machine Learning and Big Data Processing: Review
477
17. Chopra, A., Madan, S.: Big data: a trouble or a real solution? Int. J. Comput. Sci. Issues 12(2), 221 (2015) 18. Twardowski, B., Ryzko, D.: Multi-agent architecture for real-time big data processing. In: 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 3, pp. 333–337. IEEE (2014) 19. Amatriain, X.: Mining large streams of user data for personalized recommendations. ACM SIGKDD Explor. Newsl. 14(2), 37–48 (2013) 20. Richter, A.N., Khoshgoftaar, T.M., Landset, S., Hasanin, T.: A multi-dimensional comparison of toolkits for machine learning with big data. In: 2015 IEEE International Conference on Information Reuse and Integration (IRI), pp. 1–8. IEEE (2015) 21. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(1–3), 37–52 (1987) 22. Agneeswaran, V.S., et al.: Big-data-theoretical, engineering and analytics perspective. In: BDA, pp. 8–15. Springer (2012) 23. Lehmann, D., Fekete, D., Vossen, G.: Technology selection for big data and analytical applications. Technical report, Working Papers, ERCIS-European Research Center for Information Systems (2016) 24. A short history of machine learning - every manager should read. https://www. forbes.com/sites/bernardmarr/2016/02/19/a-short-history-of-machine-learningevery-manager-should-read/2/#28d56abd6b1b. Accessed 10 June 2017 25. Qiu, J., Wu, Q., Ding, G., Xu, Y., Feng, S.: A survey of machine learning for big data processing. EURASIP J. Adv. Signal Process. 2016(1), 67 (2016) 26. Zheng, J., Shen, F., Fan, H., Zhao, J.: An online incremental learning support vector machine for large-scale data. Neural Comput. Appl. 22(5), 1023–1035 (2013) 27. Mitchell, T.M., et al.: Machine Learning. WCB/McGraw-Hill, USA (1997) 28. Ghosh, C., Cordeiro, C., Agrawal, D.P., Rao, M.B.: Markov chain existence and Hidden Markov models in spectrum sensing. In: 2009 IEEE International Conference on Pervasive Computing and Communications, PerCom 2009, pp. 1–6. IEEE (2009) 29. Yue, K., Fang, Q., Wang, X., Li, J., Liu, W.: A parallel and incremental approach for data-intensive learning of Bayesian networks. IEEE Trans. Cybern. 45(12), 2890–2904 (2015) 30. Dong, X., Li, Y., Wu, C., Cai, Y.: A learner based on neural network for cognitive radio. In: 2010 12th IEEE International Conference on Communication Technology (ICCT), pp. 893–896. IEEE (2010) 31. Safatly, L., Bkassiny, M., Al-Husseini, M., El-Hajj, A.: Cognitive radio transceivers: RF, spectrum sensing, and learning algorithms review. Int. J. Antennas Propag. 2014, 21 (2014) 32. Bkassiny, M., Jayaweera, S.K., Li, Y.: Multidimensional dirichlet process-based non-parametric signal classification for autonomous self-learning cognitive radios. IEEE Trans. Wirel. Commun. 12(11), 5413–5423 (2013) 33. Das, T.K., Gosavi, A., Mahadevan, S., Marchalleck, N.: Solving semi-markov decision problems using average reward reinforcement learning. Manag. Sci. 45(4), 560–574 (1999) 34. Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988) 35. Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in big data analytics. J. Big Data 2(1), 1–21 (2015)
478
R. Bhatnagar
36. Ryohei, F., Satoshi, M.: The most advanced data mining of the big data era. NEC Tech. J. 7(2), 91–95 (2012) 37. Jones, N.: The learning machines. Nature 505(7482), 146 (2014) 38. Langford, J.: Tutorial on practical prediction theory for classification. J. Mach. Learn. Res. 6(Mar), 273–306 (2005) 39. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. J. Mach. Learn. Res. 3(Mar), 1183–1208 (2003) 40. Zhou, L., Pan, S., Wang, J., Vasilakos, A.V.: Machine learning on big data: opportunities and challenges. Neurocomputing 237, 350–361 (2017) 41. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013) 42. Chen, Q., Zobel, J., Verspoor, K.: Evaluation of a machine learning duplicate detection method for bioinformatics databases. In: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics, pp. 4–12. ACM (2015) 43. Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E.: Addressing big data time series: mining trillions of time series subsequences under dynamic time warping. ACM Trans. Knowl. Discov. Data 7(3), 10 (2013) 44. Garc´ıa, S., Ram´ırez-Gallego, S., Luengo, J., Ben´ıtez, J.M., Herrera, F.: Big data preprocessing: methods and prospects. Big Data Anal. 1(1), 9 (2016) 45. Cao, L., Wei, M., Yang, D., Rundensteiner, E.A.: Online outlier exploration over large datasets. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98. ACM (2015) 46. Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015) 47. Cai, X., Nie, F., Huang, H.: Multi-view k-means clustering on big data. In: IJCAI, pp. 2598–2604 (2013) 48. Ram´ırez-Gallego, S., Garc´ıa, S., Mouri˜ no-Tal´ın, H., Mart´ınez-Rego, D., Bol´ onCanedo, V., Alonso-Betanzos, A., Ben´ıtez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 6(1), 5–21 (2016) 49. Zhang, Y., Cheung, Y.M.: Discretizing numerical attributes in decision tree for big data analysis. In: 2014 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 1150–1157. IEEE (2014) 50. Nguyen-Dinh, L.V., Rossi, M., Blanke, U., Tr¨ oster, G.: Combining crowd-generated media and personal data: semi-supervised learning for context recognition. In: Proceedings of the 1st ACM International Workshop on Personal Data Meets Distributed Multimedia, pp. 35–38. ACM (2013)