Anomaly Detection and Performance Evaluation of

Anomaly Detection and Performance Evaluation of Mobile Agricultural Machines by Analysis of Big Data T. Steckel, CLAAS E-Systems, Gütersloh; A. Bernardi, Y. Gu, German Center for Artificial Intelligence, Kaiserslautern; S. Windmann, A. Maier, O. Niggemann, Fraunhofer IOSB-INA, Lemgo; Abstract Thanks to CAN Bus architectures, Electronic Control Units and Wireless communication, today’s agricultural machines provide detailed process data that can be used for improving productivity in farm and contractor’s enterprises. We explain current efforts within the research project AGATA1 to set up an appropriate infrastructure, basic principles for analysis and implementation in real harvest scenarios. Big Data in Agriculture Modern agricultural machines are equipped with a high share of CAN Bus architectures, numerous sensors and actors and thus can be seen as mobile data processing units. Original intention was to improve local control. Later attention shifted towards crosslinking with farm management applications. Currently machine-to-machine communication becomes an additional topic in both, implementation and standardisation. Merging machines and their work environment to an entire system emerges to somewhat that can be called Big Data. As an example today’s combine harvesters can provide more than 3.000 different data attributes. They all work in environments that are progressively depicted in a temporal and spatial way. Agriculture meets the 3+1 V’s that define Big Data: Volume, Velocity, Variety, Veracity. Since heuristics approaches are a means for successful farming and probably will remain, data centric approaches are an additional key feature for quantitative based farming for i.e. optimizing machine settings, logistic processes and maintenance. Overall targets are reduction of unit cost, on schedule processes and reduction of negative environmental impacts like soil compaction and CO2-emmission. Big Data and Data mining provide a way for introducing machine

1

AGATA: “Analyse großer Datenmengen in Verarbeitungsprozessen” (engl.: Analysis of large amount of data in manufacturing processes) in context of framework “”Management und Analyse großer Datenmengen” (engl.: Management and Analysis of large quantities of data) ; funded by German Federal Ministery of Education and Research; Period 2014-2017; Fkz (Project-ID): 01IS14008x.

learning. When applying Data Mining Technologies an established process model CRISP-DM [1] can serve as a guideline. Background of this paper is the recently started project AGATA where agriculture delivers a use case for data mining. Main driver for data mining is the expectation to reduce the gap between installed and achieved performance in everyday situations. Typical utilization is in the range of 50 percent [2]. An increase to 85 percent is seen as realistic [3]. Applying Data Mining technologies requires the deployment of an appropriate infrastructure as well as identification and adaptation algorithms.

Infrastructure for dealing with Big Data Big Data processing environments have been operationalized in industry [4]. Primary motivation in AGATA is improving machine performance, other possible applications are advancements in predictive maintenance and product and process engineering. To make use of them for agricultural applications in terms of integrating mobile objects and the domain specific environment, complementary modifications are required. Thus, a major activity within this research project is the definition and provision of such an environment. An advanced kind of infrastructure is needed, since existing ones are only capable of dealing with small amounts of data and attributes.

Fig. 1: Infrastructure for machine analytics

Core of this environment (see Fig. 1) are machines equipped with embedded data units for acquisition, preprocessing and transmission of process data and a powerful backend for finding patterns in data being delivered by machines and other parts of the infrastructure. The embedded system is configured in a way that data for subsequent analysis are defined as well specified rules for acquisition (i.e. “record position if driving direction deviates more than

5 degrees”). For 2015 harvest season 10 combine harvesters are equipped with such technology; approximately 3.000 different attributes are provided. CAN-, GPS- and serial data are sent via Kafka-Client to the backend. On the backend side the Hortonworks distribution is installed to provide a Hadoop based cluster. Important components are Kafka-Server for secure and reliable reception of machine data, Storm-Server for fast processing of data and deposition, HDFS as distributed file system, REST and Hive for data access by analytic tools like R and RapidMiner. The physical infrastructure consists of 6 servers (5 data nodes and 1 name node)

Unsupervised Anomaly Detection Data mining methods – here for increasing single machine performance – can be categorized as either supervised or unsupervised. In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns among all the variables [5]. Consequently, the unsupervised approaches can deliver indications of potentially useful patterns and dependencies within the analyzed parameter space although it is not known a priori which interdependencies – if any - might exist. With regard to the intended application, three principal approaches are currently under consideration in AGATA. In all cases, the available sensor data streams collected from the agricultural machines are seen as a multi-dimensional data space. Each sensor (or parameter) typically delivers a one-dimensional time series: the data stream consists of the respective values delivered over time. On this basis, Outlier Detection approaches try to identify “unexpected” values. Within a single data stream, the observed development over time is assumed to follow some regular development function (usually with linear or periodic behavior). Calculation of minimum distances between the actual values and the assumed function allow to determine the best-fitting function and to identify single values which deviate significantly. Whether these findings are useful, e.g. whether the identified function can be used for correct predictions, or whether the identified outlier represents an interesting event, has to be determined by agricultural domain experts. Second, pairwise correlation analysis between the different parameters allows to identify different sensors whose readings are closely related. Such observations, if verified by the domain experts, allow to reduce the number of dimensions which need to be considered for valid analysis. Finally, considering a number of different parameters together, similar operational situations over time can be identified by applying clustering approaches. Again, defining the correct – i.e.

useful – distance measures between the parameter vectors are the crucial challenge to be solved. As a typical harvest scenario depends not only on the machine under operation but also on a number of context factors, like e.g. location and shape of the field, density of the crop, or current weather pattern, such data have to be taken into account when looking for unconventional and interesting work situations within the recorded data. The combination of geo-spatial and time-related environment data with the recorded machine sensor readings offers challenging research questions which are still under initial investigation.

Supervised Anomaly Detection

In supervised methods, training data with measurements of several input variables and a particular pre-specified target variable of the system output is given. Often model-based approaches are used to learn the dependency between values of the target and values of the input variables. For anomaly detection, the identified model is employed to simulate the normal behavior of a system. For this, the simulation model needs all inputs of the system, e.g. crop variety, user input, status, etc. If the actual measurements vary significantly from the predictions, the behavior is classified as anomalous (see e.g. [6]). The behavior models can be learned from data collected about the system and its components in normal, fault-free operation, using algorithms such as (Hy-)BUTLA or OTALA [7]. Since modern combine harvesters create a huge amount of data, the MapReduce technology is applied to identify behavior models [8]. OTALA is applied for model learning of the discrete states and quadratic regression models (QRM) are generated for continuous behavior. Both model learning algorithms have been parallelized applying the MapReduce technology. The MapReduce version of OTALA allows to distribute the workload on |T| nodes, and therefore a speedup is achieved as each transition T can be processed in parallel in the REDUCE function. For the MapReduce version of QRM, distribution of workload on |S| nodes is possible by processing the states S of the automaton in parallel in the REDUCE function. Furthermore, online algorithms have been proposed, which efficiently handle novel observations to update the models, which have been created from large historical data sets. So far, the proposed MapReduce algorithms have been implemented on a single node. In the next step, speedup of OTALA and QRM has to be evaluated in a Hadoop cluster (c.f. Figure 1).

The identified behavior models are then used to detect anomalies in (i) discrete events (including their timing) that are incompatible with the learned process models, and (ii) deviations of the continuous behavior from the predicted process behavior. Integration of Data Mining Technologies into the operational environment In the run-up to 2015 harvest season, 8 combine harvesters in 2 farm had been equipped with data acquisition and communication modules to capture the entire CAN-messages. In addition a subset of combine harvesters had been equipped with cameras to provide imagery information of the machine’s surrounding, since no sensors or information sources are available to characterize the crop respectively harvesting conditions in detail. Environment data like field boundaries and digital terrain models were integrated into the cluster to serve as filters in analytic processes. During harvest, interesting observations have been noted by drivers to serve as labels in subsequent model-based anomaly analysis. Benefits, Limitations and Vision Here a data mining projects is introduced and explained in an early stage. Focus is on basic aspects like infrastructure, some basic procedures, integration and data acquisition while harvesting. Considering a large amount of attributes, a multitude of machines and harvesting seasons, we deal with high dimensionality. Other industries like telecommunication have shown that – by using data analytics technologies – a much deeper insight into processes is possible and yet unknown interrelations can be discovered. We expect to support machine operators by deriving recommendations from identified anomalies. Since not much research has been done in terms of analyzing mobile machines in areas with limited communication network coverage it is not clear yet, if the current architecture is appropriate for providing near-time recommendations.

The authors like to thank the German Federal Ministry of Education and Research for supporting this research project.

[1]

Wirth, R. Hipp, J.: CRISP-DM: Towards a standard process model for data mining; Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining; pp. 29-39; 2000.

[2]

Feiffer, A.: Großversuch mit dem CR 980: Einfluss des YARA N-Sensors auf die Mähdrescherleistung; Zentr. für Mechanisierung und Technologie; Sondershausen, 2004.

[3]

Nakajima, S.: Introduction to TPM: Total Productive Maintenance. Productivity Press, Cambridge, 1988.

[4]

Mathias Weber, Maurice Shahd, Katja Hampe: Potenziale und Einsatz von Big Data – Ergebnisse einer repräsentativen Befragung von Unternehmen in Deutschland. BITKOM, Berlin, Mai 2014. Online at https://www.bitkom.org/Bitkom/Publikationen/Publikation_2564.html , last access 29.July 2015.

[5]

Larose, D. T.: Data Mining and Predictive Analysis, p. 160, Wiley & Sons, Hoboken, New Jersey, 2015.

[6]

S. Faltinski, H. Flatt, F. Pethig, B. Kroll, A. Vodenčarević, A. Maier, and O. Niggemann, “Detecting anomalous energy consumptions in distributed manufacturing systems,” in Industrial Informatics (INDIN), 2012 9th IEEE International Conference on, 2012, pp. 358 – 363.

[7]

Alexander Maier, “Identification of timed behavior models for diagnosis in production systems,” Ph.D. dissertation, University of Paderborn, Feb 2015

[8]

Windmann, Stefan; Niggemann, Oliver: MapReduce Algorithms for Efficient Generation of CPS Models from Large Historical Data Sets. In: IEEE International Conference on Emerging Technologies and Factory Automation (ETFA 2015) Luxembourg, Sep 2015.