2016 International Conference on Computational Science and Computational Intelligence
BIG DATA FOR INDUSTRY 4.0: A CONCEPTUAL FRAMEWORK Mert Onuralp Gökalp, Kerem Kayabay, Mehmet Ali Akyol, P. Erhan Eren, Altan Koçyi÷it Informatics Institute, Middle East Technical University, Ankara, Turkey E-mail:{
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] } Abstract— Exponential growth in data volume originating from Internet of Things sources and information services drives the industry to develop new models and distributed tools to handle big data. In order to achieve strategic advantages, effective use of these tools and integrating results to their business processes are critical for enterprises. While there is an abundance of tools available in the market, they are underutilized by organizations due to their complexities. Deployment and usage of big data analysis tools require technical expertise which most of the organizations don’t yet possess. Recently, the trend in the IT industry is towards developing prebuilt libraries and dataflow based programming models to abstract users from low-level complexities of these tools. After briefly analyzing trends in the literature and industry, this paper presents a conceptual framework which offers a higher level of abstraction to increase adoption of big data techniques as part of Industry 4.0 vision in future enterprises.
continuously comes up with with new models which can use distributed architectures to process data more quickly and efficiently. However, available analysis methods are insufficient to use high speed data flowing from various sources due to their lower level complexities and shortcomings [1]. The utilization and installation of existing big data analytics platforms require significant expertise and know-how in data science and IT domain because of their complex infrastructures and programming models. This may hinder the adoption of big data technologies in the Industry 4.0 domain. Hence, a programming model for big data platforms which provides higher level abstractions is necessary from the perspective of widespread user adoption. Latest trends in the big data domain is moving towards providing a level of abstraction to utilize popular data processing platforms [2]. Apache Beam [3] implements its dataflow programming model on multiple runners like Apache Spark [4] and Apache Flink [5]. Apache SAMOA [6] enables programmers to apply machine learning algorithms on data streams. Applications developed with SAMOA can be executed on Apache Storm [7] [8], Apache S4 [9], and Apache Samza [10]. In this paper, after briefly discussing trends in the data science literature and industry, a visual and dataflow based architectural framework is proposed to abstract programmers away from complexities of underlying data processing platforms. This can enable enterprises to easily incorporate data mining and machine learning techniques into their business process monitoring and improvement activities.
Keywords: Industry 4.0; big data; data flow based programming languages; machine learning; data mining.
I.
INTRODUCTION
Businesses need to process data into timely and valuable information for their decision making and process optimization activities. Today’s competitive business environment forces enterprises to process high speed data and integrate valuable information in production processes. For example, the concept of Industry 4.0 is expected to change production in the near future. In this concept, machines in a smart manufacturing plant interact with their environments. Ordinary machines transform into contextaware, conscious, and self-learner devices. This transformation gives these devices the capability to process real-time data to self-diagnose and prevent potential disruptions in production process. Furthermore, when such a machine is assigned a task, it can self-calibrate and prioritize between tasks to optimize production quality or efficiency.
II.
The data flow based programming model, which aims to facilitate the development and orchestration of services, is a commonly used approach in data analysis frameworks. Especially, IoT scenarios require the coordination of computing resources across the network. Traditional programming tools are typically more complex when compared to visual and data flow based programming tools, as they require developers to learn new protocols and APIs, create data processing components, and link them together [11]. In the data flow based programming model,
As approaches like Industry 4.0 gain popularity, the characteristics of data to be analyzed change. Some processes require high speed data whose value diminishes over time. Heterogeneous IoT devices and sensors produce unstandardized and unstructured data. IT industry
978-1-5090-5510-4/16 $31.00 © 2016 IEEE DOI 10.1109/CSCI.2016.87
RELATED WORKS
429 431
applications are modelled as directed graphs of ‘black box’ nodes that exchange data along connected arcs. Hence, developers do not need to know internal details of building blocks comprising the application.
tools, MOA and ADAMS are focused on data mining and machine learning algorithms that can be applied on stream data. However, none of these systems support distributed execution environments to handle big data effectively.
Several data flow based programming models have been proposed in the literature. WoTKit and Node-Red, which are two notable frameworks of this kind, allow users to model their applications via browser based visual editors. After receiving real time updates from sensors and other external sources, WoTKit processor [11] can be used to process input data and respond to changes in the environment. Node-Red [12] is another important service that is implemented on Node.js where users can implement their applications in JavaScript. In both Node-Red and WoTKit application development environments, complex applications can be modelled as directed graphs by dragging and dropping programming blocks into a flow canvas. ClickScript [13] is yet another data flow based programming service for modelling home automation applications visually on a graphical user interface. The visual and data flow based programming model enables users to receive data from external sources, including social media and IoT devices, and forward data to external systems such as Twitter and e-mail.
Mahout [21] is a data mining and machine learning library for Hadoop. The Mahout library provides clustering, regression, classification and model analysis algorithms. Google, Amazon, Yahoo and Facebook utilize Mahout in their data mining and machine learning applications. Since Mahout is designed for batch processing applications, it does not support real time stream processing. Moreover, it doesn’t provide a visual programing model. III.
CONCEPTUAL FRAMEWORK
In order to exploit the potential of big data technologies as part of Industry 4.0, challenges which hinder the adoption of such technologies should be tackled first. These challenges include handling large amount of unstructured data coming from IoT devices, expertise barriers, resource management, and delivery of results to appropriate channels. Hence, a framework which can facilitate the development and deployment of big data analytics is necessary. In this section, we explain our conceptual framework architecture which enables system engineers to model, develop and deploy their own big data use cases for Industry 4.0 applications, even when they have limited or no experience in big data analytics. We describe functionalities of major modules and how these modules can be integrated in a cloud environment. The system architecture is delineated in Figure 1. The proposed framework architecture in this paper is an extended version of the architecture proposed in our previous study [22] in which queries are defined as Groovy scripts. Accordingly, the framework utilizes a data flow based visual programming model to facilitate flexible application development.
The data analysis term refers to utilization of business intelligence and analytics technologies. This corresponds to applying statistical and data mining techniques in organizations to produce additional business value [14]. There are various open source and commercial tools for machine learning and data mining applications which have been developed to execute popular algorithms such as classification, clustering, and anomaly detection. Some of these applications support distributed processing across computing nodes to handle big data use cases. There are also tools which provide a visual programming model to users who do not have any know-how in data analytics. Orange [15] is a notable visual data mining and machine learning tool. In Orange platform, each visual component represents a data analytics algorithm and these components communicate with each other through data channels. KNIME [16] , KEPLER [17] and RapidMiner [18] are other open source data mining tools which support visual programming environments where data mining algorithms are provided as readily available programming elements. Each operator has data input and output ports which can be dragged and dropped to link elements and linked elements form a directed graph. These tools are designed for batch data processing. They are hardly scalable because their performance is limited by the server’s processing capabilities.
Figure 1. Architectural Conceptual Framework
MOA [19] and ADAMS [20] are similar to KNIME, KEPLER and RapidMiner in regards to application design and execution. While the latter are batch processing oriented
The architecture of the proposed conceptual framework consists of the following modules: Big data application
432 430
design, pre-processing input data streams, distributed infrastructure, and distribution of results.
weeks/months. There is no “one-size-fits-all” big data solution. Instead, each big data platform has its own advantages and disadvantages. Therefore, the proposed framework is aimed to support multiple big data platforms such as Storm, Spark and Flink. Hence, according to specific characteristics of an application under design, one of the supported platforms can be chosen. Moreover, by considering the designed application logic and use cases, the framework itself can offer a suitable big data platform to run the application.
Big Data Application Design module allows system engineers to develop their own big data applications with a visual editor. Applications are represented as directed graphs where vertices represent data mining and machine learning algorithms as well as programming constructs, while edges represent data streams which correspond to intermediate results as shown in Figure 2. The programming nodes take and produce data in a common standard to handle data from various sources and to be integrated with other programming nodes. Thus, the application logic can be built by just connecting the programming nodes without worrying about their internal details and interfaces.
The Results of the applications may be forwarded to interested parties in different forms. Each distribution channel is defined as a programming node in the visual editor. Thus, users may select more than one distribution channel to deliver the results. In this way, certain problems in the production may be forwarded to right staff as notifications. The results can also be used as inputs to actuators and, hence, manufacturing processes can be controlled and even improved. It is also possible to deliver the results to external entities via web services for data visualization or monitoring purposes. IV.
CONCLUSION
There is an abundance of tools and application frameworks for processing big data, yet new tools continue to emerge especially for stream data. These tools are commonly open sourced after being developed by Internet based companies including Google, Twitter, Linkedin, and Yahoo according to their business requirements. Low level complexities of data processing platforms make them suitable for programmers who have the knowledge and experience on data science. On the other hand, people who have expertise and deep knowledge in a specific domain only may not be able to use these tools. As a result, real time data coming from various sources cannot be integrated to business processes in an enterprise.
Figure 2. Programming Model
In this setting, a large number of Data Sources should be integrated to the platform to collect information regarding different aspects of a factory. Due to their heterogeneous nature, these data sources may generate data in disparate formats. Therefore, data variety is an important challenge that can hinder the adoption of big data analytics in Industry 4.0 domain. Hence, the Preprocessing Input Data Streams module plays a central role in our framework to convert data into a common format for further processing. This is based on data standardization to define a common standard for receiving structured, semi-structured and unstructured data from various number of resources.
Specialized for the big data domain, data flow based visual programming models can solve this problem by allowing programmers to iteratively develop new techniques which can utilize real time data. In organizations, people can quickly design and develop small programs to investigate whether there are efficiency or quality issues in production and service processes. We see this approach as an important step towards the Industry 4.0 vision.
Deployed applications need fast and scalable infrastructures to handle big data use cases effectively. Therefore, big data platforms are established on a Distributed Infrastructure. User defined applications are deployed automatically on the distributed infrastructure to handle unique characteristics of the big data. On the other hand, the requirements of big data applications vary according to use cases. For instance, a monitoring application needs to process stream data and produce results in a real-time manner. However, a predictive analytics application needs to deal with bulk data to detect potential risks about the production in upcoming
In this paper, we propose a conceptual framework which can be utilized in a smart enterprise. Its main components are designed to abstract users away from low level complexities like data standardization, platform specific development, resource management, protocols, and APIs. The framework handles collection of data from IoT and Web based data sources, implementation of big data analytics applications containing machine learning and data mining components, translation of visually designed
433 431
programs to platform specific ones, management of jobs among processing units, and delivery of results to people and services. From this perspective, the framework facilitates the integration of big data analytics with business processes by providing an end to end approach.
[22]
REFERENCES [1] [2]
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
[14] [15] [16] [17]
[18]
[19] [20] [21]
J. Lee, H. A. Kao, and S. Yang, “Service innovation and smart analytics for Industry 4.0 and big data environment,” in Procedia CIRP, 2014, vol. 16, pp. 3–8. K. Kayabay, M. O. Gökalp, M. A. Akyol, A. Koçyi÷it, and P. E. Eren, “Big Data for Future Enterprises: Current State and Trends,” in 3rd International Management Information Systems Conference, øzmir, 2016, pp. 298–307. “Apache Beam.” [Online]. Available: http://beam.incubator.apache.org. [Accessed: 10-Nov-2016]. “Apache Spark.” [Online]. Available: http://spark.apache.org. [Accessed: 28-Oct-2016]. “Apache Flink.” [Online]. Available: http://flink.apache.org. [Accessed: 28-Oct-2016]. “Apache Samoa.” [Online]. Available: https://samoa.incubator.apache.org. [Accessed: 10-Nov-2016]. “Apache Storm.” [Online]. Available: http://storm.apache.org. [Accessed: 28-Oct-2016]. A. Toshniwal et al., “Storm@twitter,” Proc. 2014 ACM SIGMOD Int. Conf. Manag. data - SIGMOD ’14, pp. 147–156, 2014. “Apache S4.” [Online]. Available: http://incubator.apache.org/s4/. [Accessed: 10-Nov-2016]. “Apache Samza.” [Online]. Available: http://samza.apache.org. [Accessed: 10-Nov-2016]. M. Blackstock and R. Lea, “WoTKit,” in Proceedings of the Third International Workshop on the Web of Things - WOT ’12, 2012, pp. 1–6. “Node-Red.” [Online]. Available: https://nodered.org. [Accessed: 12-Nov-2016]. S. Mayer, N. Inhelder, R. Verborgh, and R. Van De Wallet, “User-friendly configuration of smart environments,” in 2014 IEEE International Conference on Pervasive Computing and Communication Workshops, PERCOM WORKSHOPS 2014, 2014, pp. 163–165. H. Chen, R. H. L. Chiang, and V. C. Storey, Business Intelligence and Analytics: From Big Data to Big Impact, vol. 36, no. 4. 2012. J. Demšar, B. Zupan, G. Leban, and T. Curk, “Orange: From Experimental Machine Learning to Interactive Data Mining,” Knowl. Discov. Databases PKDD 2004, pp. 537–539, 2004. M. R. Berthold et al., “KNIME-the Konstanz information miner: version 2.0 and beyond,” AcM SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 26–31, 2009. I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock, “Kepler: an extensible system for design and execution of scientific workflows,” Sci. Stat. Database Manag. 2004. Proceedings. 16th Int. Conf., vol. I, pp. 423–424, 2004. I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, “YALE: Rapid prototyping for complex data mining tasks,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. 2006, pp. 935–940, 2006. A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA Massive Online Analysis,” J. Mach. Learn. Res., vol. 11, pp. 1601–1604, 2011. P. Reutemann and J. Vanschoren, “Scientific Workflow Management with ADAMS,” Knowl. Discov. Databases, pp. 833–837, 2012. “Apache Mahout.” [Online]. Available: https://mahout.apache.org. [Accessed: 12-Nov-2016].
434 432
M. O. Gokalp, A. Kocyigit, and P. E. Eren, “A Cloud Based Architecture for Distributed Real Time Processing of Continuous Queries,” in Proceedings - 41st Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2015, 2015, pp. 459–462.