Efficient filtering processes for machine-to-machine ...

5 downloads 40811 Views 3MB Size Report
handle big M2M data, M2M platforms need to automate their configuration and ... machine-to-machine data based on automation modules and data-agnostic ...
Apostolos Papageorgiou, Mischa Schmidt, JaeSeung Song, and Nobuharu Kami. Efficient filtering processes for machine-to-machine data based on automation modules and data-agnostic algorithms. International Journal of Business Process Integration andand Management (IJBPIM), Publishers, 2014 Vol.7, No.1, pages 73-86. Int. J. Business Process Integration Management, Vol. 7, No.Inderscience 1, 2014 73 DOI: 10.1504/IJBPIM.2014.060606

Efficient filtering processes for machine-to-machine data based on automation modules and data-agnostic algorithms Apostolos Papageorgiou* and Mischa Schmidt NEC Laboratories Europe, Kurfürstenanlage 36, 69115 Heidelberg, Germany E-mail: [email protected] E-mail: [email protected] *Corresponding author

JaeSeung Song Sejong University, Department of Computer and Information Security, Gwangjin-Gu, Seoul, 143-747, Korea E-mail: [email protected]

Nobuharu Kami NEC Corporation, 1753 Shimonumabe, Nakahara-ku, Kawasaki, Kanagawa 211-8666, Japan E-mail: [email protected] Abstract: Machine-to-machine (M2M) platforms are evolving as large-scale multi-layer solutions that unify the access and the control of all devices that are being equipped with the capability to perform automated tasks and to report data based on connectivity to a backend system. As the integration of more and more devices in such platforms results in the need to handle big M2M data, M2M platforms need to automate their configuration and include appropriate data filtering frameworks and algorithms. Otherwise, the collected raw data can become expensive, unmanageable, and of low quality. This paper presents how data filtering processes can be automated as part of an M2M self-configuration framework and describes a solution that enables the seamless adjustment of domain-specific filtering thresholds in domain-agnostic platforms, based on quality-of-information calculations and M2M-specific data categorisation. An evaluation from the facilities-monitoring domain shows that our approach was the only one to achieve, for example, forwarding less than 25% of the monitored data maintaining at the same time a coverage ratio bigger than 50% for all considered applications. Further, a projection of this evaluation to a Smart City scale indicates that such gains can make database queries up to many seconds faster. Keywords: machine-to-machine; M2M; filtering; autonomic computing; processes; Big Data; platform. Reference to this paper should be made as follows: Papageorgiou, A., Schmidt, M., Song, J. and Kami, N. (2014) ‘Efficient filtering processes for machine-to-machine data based on automation modules and data-agnostic algorithms’, Int. J. Business Process Integration and Management, Vol. 7, No. 1, pp.73–86. Biographical notes: Apostolos Papageorgiou is a Research Scientist of the Cloud Systems and Smart Things group at NEC Laboratories Europe. His research focuses on efficient data handling and configuration in machine-to-machine systems, while he is also active in standardisation, representing NEC at The Broadband Forum. He received his Diploma in Computer Engineering from the University of Patras (Greece) and his PhD degree from the Technische Universitaet Darmstadt (Germany). Mischa Schmidt is a Senior Researcher at NEC Laboratories Europe, working in the Converged Networks Research and Standards Group. He was active in the standardisation of next generation telecommunication networks in IETF and in ETSI TISPAN where he was the Vice-Chair of WG3 (protocols) and rapporteur of multiple standards. He received his Diploma in Computer Science with special focus on computer vision, computer graphics and pattern recognition from the University of Mannheim, Germany in 2003. His research interests include connected home technologies and services, decentralised content distribution as well as energy management. The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, not withstanding that they have offered their works here electronically. It is understood that all persons copying this information will Copyright © 2014 Inderscience Enterprises Ltd. adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

74

A. Papageorgiou et al. JaeSeung Song is a Senior Standard Researcher in NEC Europe Ltd. and an Assistant Professor in the Computer and Information Security Department at Sejong University. His research areas include IoT/M2M platforms, big data analytics and the reliability and security of networked software systems. Previously, he worked for LG Electronics as a Senior Researcher for seven years. He occupied leadership positions in 3GPP and oneM2M standard groups as a rapporteur and contributor. He received his PhD degree at Imperial College London in the Department of Computing, UK. He received his BS and MS in Computer Science from Sogang University. Nobuharu Kami is a Principal Researcher in NEC Corporation. He has been mostly engaged in developing computer networks. His research interests are in a wide range of topics related to control and management technologies in computer networks and M2M platforms with a particular focus on algorithm and architecture design. He received his BE, ME and PhD degrees from the University of Tokyo, Tokyo, Japan. This paper is a revised and expanded version of a paper entitled ‘Smart M2M data filtering using domain-specific thresholds in domain-agnostic platforms’ presented at the IEEE Big Data Congress 2013, Santa Clara, CA, 27 June to 2 July 2013.

1

Introduction

More and more machines are connected to each other, communicating without human intervention. The number of interconnected machines has been growing steeply and many research surveys estimate that several billions of devices will be interconnected in the near future. This increasingly popular machine-to-machine (M2M) communication paradigm enables the collection of data from a huge amount of different sources. This brings Big Data research (Agrawal et al., 2012), which is mainly characterised by approaches for identifying hidden information (Gatti et al., 2012), or for performing new kinds of actions (Laurila et al., 2012), into the domain of M2M research. As explained in Manyika et al. (2011), in many sectors Big Data can refer to up to multiple petabytes, much of it being exhaust data, i.e., data collected blindly that may never be used. At this growth rate, methods for avoiding the consumption of expensive computing and storage resources are needed sooner or later. Even now, for complex analyses, such as multivariate statistics, random data sampling is often necessary as a pre-process. Considering conventional M2M data handling processes, M2M technologies will sooner or later face a situation where massive data handling capabilities must be part of the core functions of the M2M service layer platform. Thus, among others, filtering will be required as a form of pre-processing in order to maintain the quality and manageability of the Big M2M data M2M systems are the systems in which various devices, sensors, and computing systems communicate with the physical environment and with each other without human intervention, in order to offer easy, homogeneous, and efficient remote access to machines or the physical world itself on a higher level. M2M communication, protocols, and platform architectures are hot issues for industry, standards, and research (Matsuda et al., 2011; Wu et al., 2011; Tan et al., 2011). In terms of data processing aspects, many M2M systems are structured in multiple layers and each layer performs different roles for handling the data. For example, the bottom layer of an M2M system is typically

responsible for gathering data from various sources (i.e., sensors, devices, home appliances, etc.) and local area networks that use different access technologies [e.g., Zigbee (http://www.zigbee.org/), KNX (http://www.knx.org/), Bluetooth (http://www.bluetooth.org/), etc.]. The collected data is then managed or delivered to an upper layer for further data processing. Big Data processing can be required in this layer in order to identify useful information inside massively collected datasets. However, one of the main problems of this process is that the collected datasets from lower layers typically contain raw data. Raw data are more difficult to process and often contain lots of unnecessary information. We believe that data filtering can be applied as an effective and feasible technique in order to eliminate unnecessary information from the bottom layer so that the data collected at the M2M service layer platform is cheaper to store, faster to process and less likely to overload the database interfaces. For accommodating workable filtering functionalities inside M2M systems, priority control based on smart quality-of-information (QoI) handling will become a heart of the enabling technology because it allows to keep the data size in the backend system manageable yet highly efficient. To that direction, this work contributes the following: •

A high-level framework for self-configuration of M2M platforms and the description of data filtering-related processes as part of this framework.



A data filtering solution tailored to M2M platforms that span application domains, based on automated filtering threshold adjustments and on the usage of M2M-specific data categories.



An evaluation from the facility-monitoring domain, which demonstrates unique benefits of the approach, i.e., the ease of configurability, as well as the capability to achieve higher filtering ratios without withdrawing useful data. Further, a projection of this evaluation to a Smart City scale and an analysis of the impact of data filtering on database queries is also provided.

Efficient filtering processes for machine-to-machine data based on automation modules The next section introduces the state-of-the-art of modern M2M platforms and their data filtering processes. Afterwards, detailed presentations of autonomous M2M systems, our approach for autonomic M2M data filtering, and our Smart City-oriented evaluations will follow.

2

M2M data filtering processes

This section explains how the processes of M2M data filtering are changing due to the technological trends of modern M2M platforms. Afterwards, related approaches on data filtering are introduced.

2.1 M2M data handling processes An M2M system typically follows a layered architecture style as shown in Figure 1. Standardisation bodies such as European Telecommunications Standards Institute (ETSI) (Boswarthik et al., 2012) and one M2M (http://www.onem2m.org/), as well as many published solutions (Matsuda et al., 2011; Wahle et al., 2012) apply very similar layered architectures, as well. In such a layered architecture, M2M data is handled as follows: sensing, gathering, processing, utilising. Figure 1

M2M data handling process and different possible bottlenecks (see online version for colours)

75

Data gathering is normally performed at a gateway, which is a node that interconnects constrained devices (e.g., sensors) to the network infrastructure. The gateway normally runs as a proxy so that all data from local area networks are delivered to the core network via the gateway. Since a gateway normally manages many sensors and gathers all data from the sensors, it often faces data processing bottlenecks. Next is the data processing layer which is normally equipped with databases or data warehouses in order to manage, process, and store collected data from underlying layers. Finally, the data is utilised by applications which access a common API, provided also by the backend. As already depicted in Figure 1, this work focuses on three main bottlenecks that might appear on such a layered system, namely on 1

M2M data storage

2

M2M backbone bandwidth consumption

3

M2M database access.

The latter refers mainly to the prevention of system crashes due to many concurrent database access requests, as well as to the problems that are caused when database tables with billions of rows are created and must be queried. Note that these problems are technically different than (1), which refers to the high data centre costs for storing big amounts of data. Applying an intelligent filtering of flow between the frontend and the backend can be a promising solution to reduce the problems caused by all three bottlenecks. However, filtering at this point is particularly challenging for modern M2M platforms because various – possibly competing – industrial domains might be served at the same time, which increases the complexity of filter management.

2.2 M2M data filtering challenges

Data sensing is the process of reading contextual information about the environment and things of interest. Devices such as sensors, meters, embedded computers, smartphones, etc., can belong to the M2M data sources. For example, multiple sensors can be deployed in water to read the level of tides, motion sensors can be used to detect suspicious behaviour in empty buildings, etc. Depending on the types of sensors and on the industry domain, different data collection strategies may be required.

Since the current M2M systems are fragmented, in order to launch a new M2M application, a traditional approach is to deploy new devices, gateway software modules, and databases, which are dedicated to this application and serve its needs. This approach can provide domain-specific value-added services to the application users. However, because of interoperability issues, current research and latest trends suggest a transition towards horizontal solutions (Matsuda et al., 2011; Boswarthik et al., 2012). In the horizontal approach, M2M platforms are configurable and enriched with various technologies such as device abstraction, autonomic functionalities, and flexible data processing modules. Figures 2 and 3 illustrate the difference between the traditional approach of vertical systems and the modern paradigm of horizontal integration.

76 Figure 2

A. Papageorgiou et al. Vertical M2M systems (see online version for colours)

This brings up the following question: how can smart filtering mechanisms be applied to horizontal approaches if domain-specific modules deployed on the gateways are not allowed to encompass vertical filtering and selection logic? In addition, horizontal solutions need to consider a way to simplify the usage of smart filters. For example, an M2M operator often needs to configure thresholds of hundreds of filters that refer to different data, metrics and scales to assess the data. In this case, filters are usually too complex for a backend operator to handle.

2.3 Related work on data filtering

Figure 3

The horizontal M2M paradigm (see online version for colours)

Although horizontal integration has many other challenges, e.g., the difficulty of reusing specific device technologies for different domains/verticals, one of main issues is that single domains or applications cannot directly dictate to the lower levels what data to select or how to filter them. This is mainly because applications are not exclusively served by the platform, but they co-exist with applications of different domains.

A first class of filtering approaches consists of the general purpose filtering methods. These approaches use algorithms that perform removal or aggregation of duplicate data, erroneous data, outliers, etc. (Mylyy, 2008) describes such filtering and aggregation for RFID data, while (Wedin et al., 2008) gives a good overview of general data filtering methods. These algorithms do not care about the semantics of the data and they are obviously domain- and application-independent. Thus, it is feasible to apply them also for horizontal solutions. However, they are more appropriate for fault-tolerance and much less promising for load reduction. It is self-evident that they cannot be expected to filter significant amounts of the data that is reported by a well-functioning M2M area network. Therefore, approaches that ‘look into’ the data have appeared. QoI assessment techniques are used for evaluating how much or how important information is contained in the data. Hossain et al. (2011) presents algorithms for evaluating the importance of multimedia content, Zöller et al. (2011) assesses sensor readings based on providers’ score sheets, while Stvilia et al. (2007) presents a general framework that can be customised for assessing QoI in different use cases. However, these approaches are perfectly appropriate only for vertical solutions (cf., Section 2). For example, healthcare or security applications may have a different notion of multimedia content importance than home environments. Thus, in a platform that serves all verticals, it is difficult to decide what formulas and algorithms (i.e., which instantiation of a QoI calculator such as the one described in Hossain et al. (2011) should be used. Similarly, scoresheets such as those used in Zöller et al. (2011) are too fine-granular for the frontend of a horizontal M2M platform. Another idea would be to filter based on data classification. The logic would be similar to QoI assessment, but the importance of the data would be judged based on the category they belong to, rather than on a QoI score. The used scientific methods are also quite different here. For example, the Kobe solution (Chu et al., 2011) of Microsoft Research uses standard machine learning techniques in order to classify the readings of different sensors (sound recordings, images, GPS series) into various different categories, which could be, for example, ‘human voice’, ‘loud noise’, ‘fast movement’, and many more. In order to use similar data classifications for performing M2M

Efficient filtering processes for machine-to-machine data based on automation modules data filtering, we would need to adjust the class definitions and the training methods of the machine learning approach to the M2M scenario. Then, if it were desired to use the classification for filtering purposes, it would be enough to provide the M2M platform operator with the ability to express which classes of readings should be reported to the M2M backend. However, not only are existing class definitions and training methods immature for supporting efficient M2M data filtering, but such a solution would also hit against the same problem as QoI assessment does, i.e., missing techniques for the harmonic coexistence of various verticals. Considering the pros and cons of these approaches and having examined the gaps that appear when it comes to their employment inside a horizontal M2M platform, we believe that the best approach in order to address the gaps of the related work is the following: conceive a solution that allows for the co-existence of domain-specific filters, but in a coordinated, controlled, and regulated manner that does not harm the ‘horizontality’ of the platform. Such an approach is contributed by this work and presented in the following.

3

System and methods for efficient semi-autonomous M2M data filtering

This section presents an approach for tackling the challenges and bridging the previously identified gaps based on an automation framework and a filtering logic that is tailored to horizontal M2M platforms, i.e., platforms that serve many different applications and many different vertical industries. The presentation starts with a description of the general concept and framework of ‘autonomous M2M’. Then, it is explained how data filtering is part of this framework. Finally, the details and the algorithms of the developed filtering logic are presented. The core idea is the hiding of domain-specific filtering thresholds inside the logic of special constructs, so that the filtering algorithms can work domain-agnostically.

3.1 The concept of self-configuring M2M platforms Although vertical M2M solutions have practically existed since a long time ‘possibly with different names’, horizontal solutions and extensive M2M platforms start to make sense just because of the wide scope and the big scale they are intended to have. Thus, like every other computing platform of such a scale, M2M platforms can benefit a lot by automating many of its operational functions.

77

The term ‘autonomous M2M’, understood as a specific case of ‘autonomic computing’ (Kephart and Chess, 2003), is a very broad term, which can refer to the automation of various tasks of an M2M system. Note that scientific works for transferring the concepts of autonomic computing to the M2M context have already started appearing, e.g., in Alaya et al. (2012). Traditionally, the so-called ‘self-* properties’ (Kephart and Chess, 2003) are considered to be elementary parts of an autonomic computing system. In this work, focusing on data as well as on the processes for smart data management, we are mostly concerned with the self-configuration capabilities that can help achieve smarter data handling, but also reduce operational costs, namely: Personnel costs for the operation of the platform (by automating the configuration tasks) Hardware costs of the backend system, mainly data centre storage and energy consumption (by including automated intelligence that reduces the amount of data –and data operations– needed to satisfy the applications) Figure 4 depicts the concept of self-configuration in M2M platforms. Backend-level commands and requirements can be provided by the platform operator. Based on them, as well as on further system context parameters, an automated logic determines the configurations of the gateways. The goal is 1

to relieve the operator from having to manually set hundreds of parameters1 on thousands of gateways that control millions2 of devices

2

to achieve configuration combinations that the operator could have not thought of, because it is impossible to consider all involved system synergies.

3.2 Data filtering as part of self-configuring M2M platforms The deeper analysis of concrete commands, contexts, and configurations for Figure 4 are out of scope for this paper. This paper rather focuses on a new data filtering process, which is built in accordance to and as part of this concept of self-configuring M2M platforms. The idea is to have a filtering module (among the other gateway applications/modules) which controls which of the data that is captured by other gateway applications should be forwarded to the backend and stored. The filtering module includes different filters, each of which has a – dynamically changing – filtering threshold. More details are provided in the following subsection. Here, Figure 5 simply depicts how the filtering module is part of the examined technical landscape and how the filtering thresholds are defined as ‘self-configurable’ parameters.

78

A. Papageorgiou et al.

Figure 4

Core elements of a self-configuring M2M platform

Figure 5

Filtering and filtering thresholds in the context of self-configuring M2M platforms

Efficient filtering processes for machine-to-machine data based on automation modules Figure 6

Sequence diagram of the M2M data filtering logic

3.3 A solution for enabling domain-agnostic handling of domain-specific filtering thresholds The main points that render the solution presented here conceptually and technically different from related approaches are the following: •



79

A system (flow, algorithms, and configuration) and methods that unify the handling of various filters from different verticals. To the best of our knowledge, this is not only the first work that specifies filtering as an independent module of an M2M platform, assembled by well-defined components, but it also has the following novelties: 1 it is based on QoI without prescribing exact QoI assessment functions 2 it uses M2M data categorisation 3 it includes intelligence for filter threshold management 4 it combines filtering with access control. A common, system-wide logic for meaningfully adjusting the thresholds of heterogeneous filters based on M2M-specific data categorisation and M2M-specific platform structures. This includes the first filter-independent filtering-related algorithm, a step towards unifying filters or supporting their co-existence.

Assuming a standard M2M architecture enhanced by a filtering module on each gateway, the flow of the proposed filtering logic is summarised in the sequence diagram of Figure 6. The entities ‘M2M platform frontend’, ‘M2M data

gateway’, and ‘M2M device’ were already shown in Figure 6 and have obvious tasks, so that their role and their position in the technical landscape should be clear. As introduced in the previous subsection, the ‘M2M filtering module’ can be understood as a part of the M2M platform frontend, which consists of heterogeneous filtering functions, along with the software needed to interwork with the other components, as well as some configuration. It could be a gateway component such as a software library, a standalone gateway application or an OSGi bundle. Below is a summary of the flow of the filtering system, whereby the most innovative parts relate to the way the threshold calculation, the automated threshold adjustment, and the transition from the data assessment to the final data forwarding are handled internally: •

Step 1: The idea is to let the M2M filtering module maintain different (possibly heterogeneous) thresholds for each filtering method that it supports or implements. These are adjusted each time the backend sends updated requirements. Note that the actions 1.x in the sequence diagram are decoupled and happen asynchronously to the others; the sequence only indicates an intuitively logical order. As is also shown in the definitions of Table 1, the requirements implicitly indicate which top percentage of the information of a certain category should be currently reported and stored into the backend databases. A decision with regard to the thresholds is to implicitly define many different levels for the QoI values that can be returned by each method (i.e., timax levels for the ith method) in a way that the distribution of the values to the intervals (ti, ti+1)

80

A. Papageorgiou et al. would be approximately uniform. Of course, this cannot be specified further until the methods are defined and implemented. Once the above is specified, ‘current thresholds’ can be set upon reception of backend requirements with the logic of Algorithm 1 or sophisticated extensions of it. Algorithm 1 is a basic logic for filter-independent threshold adjustment based on backend requirements. The exact threshold adjustment approach is understood better by a combined study of the necessary definitions (Table 1), Algorithm 1, and the illustrative examples that are provided after the description of the steps of the flow.







Step 2: This step refers to the standard process of communicating with the M2M devices and reading information from them. The details of this step are out of scope for this work. Different technologies may be involved, usually using periodical, but also event-based triggering of this step. The evaluation of the next section will be performed with polling in fixed intervals. Step 3: For this, it is important to have established an appropriate set of filters. The decision here is to keep the filtering module open to extensions, i.e., to the addition of new filters, not necessarily known when implementing the entire proposed filtering framework. However, this requires the new filter to have two features: Firstly, the return values of its QoI function (qi) can be meaningfully split into levels. Secondly, it can be meaningfully applied for a category of data (cqi) which is known and used in the system. When the actual data is captured from the M2M devices (Step 2), it is assessed by an appropriate filter fi of the filtering module (selected based on the category cqi of its QoI assessment function qi) and the calculated QoI values are compared with the corresponding thresholds. According to the result, the M2M Data Gateway may be allowed to report the data to the backend or not. Step 4: According to the definitions of Table 1, the final filtering action looks like: if qi(data) ≥ v(ti) then return_passkey else reject, where function v(t) returns the QoI value that corresponds to the threshold t, while a passkey may be generated by the filtering module, allowing the reporting of the data to the backend. The employment of an access control mechanism in the context of data filtering to ensure that a platform operator can control the rate of the reported data without having to blindly reject write requests is nothing obvious. In a standard approach, the M2M filtering module would forward the data by itself to the backend. However, the described passkey-based mechanism stems from the scenario-specific feature that – according to standardisation trends – the M2M data gateway has components of many providers, while

the M2M filtering module is controlled by the M2M platform operator. Definitions

Table 1

R = [r1, r2 … rn], ri = (ci, wi), c ∈ N, w ∈ R, 0 ≤ w ≤ 1, R: Set of requirements r, each of them being a category-weight pair. ci: An integer corresponding to a category (or type) of M2M data (range and mapping rules are ontology- and implementation-specific). wi: A real between 0 and 1 providing the weight of a category. (i.e., wi × 100% of the data –the most important – shall be stored) F = [f1, f2 … fm], fi = (qi, cqi, ti), ti ∈ [0, timax], ti, cqi ∈ N, F: Set of filters f implemented in the filtering module. qi: Function used by fi for QoI calculation, implementation-specific. (any logic; qi(data) returns a value in the range [0, qimax]) cqi: An integer corresponding to the category of data fi is used for. ti: Current threshold for fi; data mapped by qi below it are filtered. Algorithm 1 Basic logic for threshold adjustment // Input: R (set of requirements) // Output: T (list of thresholds) for i = 0 to m for j = 0 to n if cqi = cj ti = ⎣ ((1 – wi) * timax) ⎦ T.add(ti) end if end for end for return T

The following paragraphs provide a better understanding of the proposed approach through an illustrative overview and a simple example for the system compilation and the filtering configuration. Figure 7 depicts a landscape with multiple gateways showing how each of them may contain multiple filters, and thus multiple filter configurations, as well. Further, it is shown how requirements are sent from the M2M backend to the M2M frontend, while at the bottom of the figure, there is a zoom into two sample instances of filter configurations. The following example explains what these instances could refer to and how they could be set based on an example instance of R.

Efficient filtering processes for machine-to-machine data based on automation modules Figure 7

81

System overview with example filter configurations

For example, assume a filter f1 using as q1 an M2M-equivalent of the approach of Zöller et al. (2011), thus assigning to sensor readings a QoI value (∈ R) in the range [0, 6], and a filter f2 using as q2 an algorithm that evaluates the importance of captured video data between 0% and 100% [refer to Hossain et al. (2011) for a similar example]. Both are implemented in the filtering module of an M2M platform that is working with video (category ‘1’), temperature (category ‘2’), and vehicle location (category ‘3’), because it is serving, for example, a Smart City scenario (safety, energy, logistics, etc.). For the given filters, it is meaningful to set cq1 = 2 and cq2 = 1. Further, due to the continuous and ‘score-like’ nature of the values returned by q1 and q2, it would be possible to define, for example, six levels (i.e., seven possible threshold values; t1max = 6) for f1, e.g., every 1 point, and ten levels (i.e., eleven possible threshold values; t2max = 10) for f2, e.g., every 10%. Running then Algorithm 1 for, e.g., R = [(1, 0.75), (2, 0.7), (3, 0.5)] would set t1 = 4 and t2 = 7 as the current thresholds of the two existing filters. What has been presented is a formal description of the framework parts that help understand the innovative aspects of the solution. In terms of engineering and algorithmic details, many parts have not been specified, either because they are obvious or because they are left open to the implementation. Such parts could be, for example, the software needed to implement such a filtering middleware, the exact logic of the used filters, or variations and more sophisticated extensions of Algorithm 1. Here are some further important features of the presented solution with regard to flexibility and scalability:



Data for which it is impossible or not desirable to calculate their QoI can still be integrated in the solution. For such data, a trivial filter with two possible thresholds can be employed. This can be set to either allow all or filter all. Data which is critical and must always be sent can, of course, bypass the filtering module.



The implementation and the security levels of the passkey-mechanism are open. In a solution where all modules of the M2M data gateway are implemented and owned by the platform operator, and thus trusted, the mechanism can be implemented as a simple yes-or-no response.



Independently of the nature of a QoI calculation function q, appropriate selection of thresholds can make the distribution of the returned values approximately uniform, thus rendering the function suitable to be used by a filter of the presented approach.

4

Evaluation

The main advantages of the presented approach do not lie in the filtering ratios themselves. Filtering ratios can be anyway directly configured or set to the desired levels, while they also heavily depend on the exact datasets in a way that renders general-purpose comparisons meaningless. The advantages of our approach lie rather in the enablement of intelligent control of the filtering system and in the easing of filtering-related configuration actions.

82

A. Papageorgiou et al.

Table 2

Impact of approach features on operator goals Self-configured filtering thresholds

Filter-independent threshold setting algorithm

Passkey-based backend access admission

Personnel costs

Not requiring an operator to check and manually set the filtering thresholds can obviously lead to reduced manpower requirements

No impact

This security mechanism is an enabler of allowing third-party gateway applications to run on the gateways. Thus, this implicitly outsources development effort reducing personnel costs

Hardware and software costs

If the self-configuration logic of the threshold is smart enough, database costs can be reduced by reducing the amount of data that needs to be stored

The filter-independence of the filtering self-configuration logic means that no new software or any significant updates of the filtering modules are needed whenever new filters are added into the filtering module

No impact

Required technical skills

Hiding the knowledge about the characteristics of the filters inside the self-configuration logic lowers also the requirements on the technical skills of the operator

No impact

Similarly to the personnel costs. ‘Outsourcing’ the development of gateway applications of technical areas that are unrelated to the expertise of the platform owner can lower the needed technical skills.

Therefore, this evaluation consists of three parts. The first part is a simple qualitative analysis of the features that a platform which employs our approach does not need to have and the tasks that its operator does not need to perform. The second part is a quantitative evaluation and comparison of different filtering approaches based on a real deployment in the Commerzbank Arena football stadium in Frankfurt. The third part projects the results of the second part to a bigger scale in a methodical way, in order to evaluate what data filtering can mean in terms of database sizes in M2M deployments of Smart Cities.

4.1 Qualitative analysis of the benefits of data filtering self-configuration Before discussing measurements from concrete real deployments, we first summarise in this subsection the positive effects of the presented solution with regard to the main goals of an M2M platform operator. Because of the difficulty of quantifying all of these effects at the moment, this is performed with a simple qualitative analysis, rather than with a scientific evaluation process. However, it is presented here, as it is a post-development analysis and it is considered to be complementary to the present evaluation. For this purpose, Table 2 explains how each of the three unique features of the approach (cf., previous sections) impacts each of the three selected goals of a platform operator. The rest of the evaluation is quantitative and focuses on the behaviour of the filtering approach in a real deployment and various Smart City-related scenarios.

4.2 Comparison of filtering ratios and application coverage 4.2.1 Scope and setup The evaluation has been performed on data collected by a middleware which NEC and its partners have deployed to monitor and control the utilities of the Commerzbank Arena football stadium in Frankfurt. The middleware gives access to 13,143 objects (sensors, switches, meters, etc.) of the utilities and has been run in a read-only, 5-minute-interval polling mode for one day, so that 24 * 12 * 13,143 = 3,785,184 values have been read in total. Although the experiment is limited in size, it must be considered that it refers to only one day and only one facility. Further, there is no issue of scaling with regard to the evaluated metrics, namely this small experiment is valid for examining filtering ratios and application coverage. A projection of the evaluation to a Smart City scale and different periods is provided in the next section in order to show the significance of the results in terms of database storage costs. The following setup details explain what has been evaluated and how: •

Compared approaches: General-purpose filtering approaches would give the same result as simple data forwarding, because no redundancies in the sense of erroneous or duplicate data have been observed. This approach will be called BASIC and will serve as the evaluation baseline. Our approach will be called M2M-NECtar. Its implementation for this evaluation is provided for the sake of completeness as a footnote3. Two further application-specific (AS) filters are considered for the comparisons: AS-FILTER-1 filters zero and non-numerical values, while AS-FILTER-2 forwards only changed values.

Efficient filtering processes for machine-to-machine data based on automation modules •

Metrics: Two metrics have been used: The amount of forwarded (and stored) data (S), which will be presented over time for the day of the measurements, and the coverage (COV) provided to different applications after filtering. COV is defined as the percentage of the expected, i.e., desired or useful, readings that are actually forwarded. This is application-dependent, as each application may expect different readings. For the purpose of our analysis, we calculate this for three fictive applications: APP1 and APP2 expect exactly the data that is forwarded by AS-FILTER-1 and AS-FILTER-2, respectively, while APP3 expects the energy-related data, i.e., as if the QoI assessment functions used here for M2M-NECtar were specifically designed for it. The reason for this choice was to ease comparison by having (baseline) COV values at 100%. This will be better understood in the discussion of the results.



The storage requirements after filtering vary from 4.7 to 83.0 MB (Figure 8). However, due to the tradeoff between storage requirements and application coverage, these results need to be examined in parallel with the values of Table 3, while there is also a certain degree of scenario-dependence in the judgment of the results. The next subsection summarises the main findings.

4.2.3 Discussion In general, the results prove the flexibility and the quality of the results achieved by M2M-NECtar, but they also indicate some limitations of the approach. Given that forwarding more data naturally leads to better coverage, the main conclusions are the following: •

The most interesting observation stems from a comparison of the COVs of M2M-NECtar (R1) and M2M-NECtar (R2), looking in parallel at the data sizes forwarded by them compared to AS-FILTER-1. More concretely, compared to AS-FILTER-1, both M2M-NECtar (R1) and M2M-NECtar (R2) reduce the data size by ca. 25% (Figure 8), but R2 achieves an unambiguously better coverage of APP2 than R1, without performing worse for APP1 and APP3. This implicitly demonstrates a unique advantage of the M2M-NECtar approach: threshold adjustments can lead to better coverage without forwarding and storing more data. The enablement or ease of such adjustments is what M2M-NECtar is mainly about and the rest of the approaches are obviously not flexible in this matter.



As a result of the above, M2M-NECtar presents in this case some unique achievements: For example, M2M-NECtar (R2) is the only approach that forwards less than 25% of the original dataset maintaining a coverage bigger than 50% for all applications, while M2M-NECtar (R3) is the only approach dropping the data size to less than 10MB without having less than 50% coverage for more than one application.



Although AS-FILTER-1 filters less data than all other approaches, it is very successful in maintaining a high degree of coverage (>80%) for all applications. Though this is partly because the definition of the examined applications favours this approach, it also reveals a limitation of our M2M-NECtar approach: in simple cases with a limited variety of data types and served applications, simpler approaches (such as AS-FILTER-1) may cover all the requirements avoiding the complexity introduced by our approach. However, AS-FILTER-1 becomes by definition inapplicable when new applications appear or new data types (e.g., audio/video) are considered.

Variables: The dependent variables will be, of course, the defined metrics S and COV. The only independent variable is time, as all the rest (expected readings, state of the utility, reading values, dataset size) are controlled variables, because they are identical for all compared approaches.

Table 3

Coverage (COV) provided to example applications after filtering

Filtering approach

COV per application APP1

APP2

APP3

BASIC

100%

100%

100%

M2M-NECtar (R1)

72.7%

45.5%

100%

M2M-NECtar (R2)

73.5%

60.7%

100%

M2M-NECtar (R3)

5.6%

51.0%

100%

AS-FILTER-1

100%

98.2%

80.2%

AS-FILTER-2

5.5%

100%

14.1%

4.2.2 Results The size of the complete dataset, i.e., the amount of data forwarded and stored by BASIC (not visible in the results), was 261.2 MB. Given that this comes from only one utility and with only trivial numerical entries (e.g., no video, audio or complex data are monitored), it is easy to imagine that the storage requirements of a Smart City system can grow extremely and become very expensive over the months. Figure 8 presents the results of the data sizes forwarded and stored by using each of the filtering approaches, while Table 3 presents the data coverage provided by the approaches to the mentioned fictive applications.

83

84 Figure 8

A. Papageorgiou et al. Data forwarded per filtering approach on a single day (see online version for colours)

Table 4

Device readings per day

Devices/building

13,143

13,143

13,143

13,143

13,143

5

5

5

5

5

Reading interval [min] #Buildings

1

5

10

100

1,000

Device readings per hour

1.58E+05

7.89E+05

1.58E+06

1.58E+07

1.58E+08

Device readings per day

3.79E+06

1.89E+07

3.79E+07

3.79E+08

3.79E+09

4.3 Evaluation of the effects of higher filtering ratios in a Smart City scale When considering scalability in terms of data load, not only the mere size of data is of importance, but also the implications on the amount of data records for searching and indexing. As this paper deals with M2M data in a general way we keep the discussion also on a general level, i.e., not discussing the used data types or sources. Based on the aforementioned Commerzbank Arena installation, we assume for the following discussion a 5-minute interval of reading all 13,143 data values per building and calculate the amount of data points for up to 1,000 such buildings to be handled by a single platform for a single day. The result is presented in Table 4. Based on the results of the previous subsection, which have also been presented in our earlier work (Papageorgiou et al., 2013), we reproduce in Table 5 the dataset size factor corresponding to the different algorithms. Table 5

Dataset factor per algorithm

Algorithm BASIC

Resulting dataset factor 1

M2M-NECtar (R1)

0.2

M2M-NECtar (R2)

0.2

M2M-NECtar (R3)

0.01

AS-FILTER-1

0.25

AS-FILTER-2

0.01

Applying the different algorithms to the daily datasets of Table 4 on Table 5 yields Table 6. While different index, search and computation operations take different amounts of time and depend on the underlying data types, we try to estimate the time needed to fully scan one data attribute, e.g., a column in a database, represented by eight bytes (e.g., the column index). The resulting data size of this column in bytes after a single is shown in Table 7. According to a database guide of Oracle (2013), a CPU core is roughly estimated to serve 200 MB/s of throughput when dimensioning a Data Warehouse. Based on this, the resulting time to fully scan and compare the data attribute of an entire day on a single CPU core is calculated in Table 8. Of course, faster CPUs, mechanisms to parallelise search such as Map Reduce (Dean and Ghemawat, 2008), well defined data indices and appropriate data structures (e.g., bit accurate storage of data) will dramatically reduce the amount of time needed to fully search a full data column in a database when compared to Table 8. However, the examined columns can become bigger and more complex, as well, especially if multiple days are stored in the same database tables. Thus, Table 8 gives a strong indication that data filtering approaches will help to drastically reduce the time needed for data operations in M2M settings and thus this field of research is another valid dimension of optimising the handling of large amounts of data in M2M systems.

85

Efficient filtering processes for machine-to-machine data based on automation modules Table 6

1 Bldg.

5 Bldg.

10 Bldg.

100 Bldg.

1,000 Bldg.

BASIC

3.79E+06

1.89E+07

3.79E+07

3.79E+08

3.79E+09

M2M-NECtar (R1)

7.57E+05

3.79E+06

7.57E+06

7.57E+07

7.57E+08

M2M-NECtar (R2)

7.57E+05

3.79E+06

7.57E+06

7.57E+07

7.57E+08

M2M-NECtar (R3)

3.79E+04

1.89E+05

3.79E+05

3.79E+06

3.79E+07

AS-FILTER-1

9.46E+05

4.73E+06

9.46E+06

9.46E+07

9.46E+08

AS-FILTER-2

3.79E+04

1.89E+05

3.79E+05

3.79E+06

3.79E+07

Table 7

Column size estimation

BASIC

3.03E+07

1.51E+08

3.03E+08

3.03E+09

3.03E+10

M2M-NECtar (R1)

6.06E+06

3.03E+07

6.06E+07

6.06E+08

6.06E+09

M2M-NECtar (R2)

6.06E+06

3.03E+07

6.06E+07

6.06E+08

6.06E+09

M2M-NECtar (R3)

3.03E+05

1.51E+06

3.03E+06

3.03E+07

3.03E+08

AS-FILTER-1

7.57E+06

3.79E+07

7.57E+07

7.57E+08

7.57E+09

AS-FILTER-2

3.03E+05

1.51E+06

3.03E+06

3.03E+07

3.03E+08

Table 8

6

Filtered daily device dataset sizes

Time (in seconds) needed to scan full column 1 Bldg.

5 Bldg.

10 Bldg.

100 Bldg.

1,000 Bldg.

BASIC

0.14

0.72

1.44

14.44

144.39

M2M-NECtar (R1)

0.03

0.14

0.29

2.89

28.88

M2M-NECtar (R2)

0.03

0.14

0.29

2.89

28.88

M2M-NECtar (R3)

0.00

0.01

0.01

0.14

1.44

AS-FILTER-1

0.04

0.18

0.36

3.61

36.10

AS-FILTER-2

0.00

0.01

0.01

0.14

1.44

Conclusions

One of the main needs that appear together with the enablement and expansion of Big Data is an efficient way of data filtering that helps maintain the manageability and affordability of the data. This issue is particularly challenging in horizontal M2M platforms, because the latter co-serve different industrial domains and handle different categories of data, so that specific applications are not any more allowed to dictate which data should be forwarded to the backend and when, while the overall complexity of configuring the data filters grows, as well. The present work describes a solution for including data filtering as part of a self-configuring M2M platform. Among others, the paper focuses on tackling the complexity of configuring heterogeneous filters in an automated but also efficient manner. The proposed filtering approach is called M2M-NECtar and its core novelties lie mainly in 1

the module interactions and the flow of actions that unify the handling of heterogeneous filters

2

the filter-independent nature of the logic for adjusting filtering thresholds of heterogeneous filters based on M2M-specific data categorisation and M2M-specific requirements

3

the combination of filtering with access control mechanisms.

An evaluation from a real deployment in the facilities-monitoring domain has demonstrated the unique ease of configurability of the M2M-NECtar approach, as well as some unique results in a scenario from the utility-monitoring domain, e.g., providing the only solution that achieved to forward less than 25% of the original dataset maintaining coverage bigger than 50% for all applications. Further, a projection of these results to a larger scale (Smart City scenario) has shown that the impact of the achieved filtering ratios can be significant. More concretely, an analysis based on characteristics of modern database technologies indicated that appropriate usage of our filtering solution can render queries to database tables of realistic sizes up to many seconds faster.

Acknowledgements The presented research work has been partly funded by the European Commission within the Seventh Framework Programme FP7 (FP7-ICT) as part of the Campus-21 project under Grant agreement 285729.

86

A. Papageorgiou et al.

References Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., Jagadish, H.V., Labrinidis, A., Madden, S., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S. and Widom, J. (2012) Challenges and Opportunities with Big Data, a community white paper developed by leading researchers across the United States [online] http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf. Alaya, M.B., Matoussi, S., Monteil, T. and Drira, K. (2012) ‘Autonomic computing system for self-management of machine-to-machine networks’, International Workshop on Self-aware Internet of Things (Self-IoT ‘12), pp.25–30, ACM. Boswarthik, D., Elloumi, O. and Hersent, O. (2012) M2M Communications: A Systems Approach, John Wiley & Sons Ltd, UK. Chu, D., Lane, N.D., Lai, T.T-T., Pang, C., Meng, X., Guo, Q., Li, F. and Zhao, F. (2011) ‘Balancing energy, latency and accuracy for mobile sensor data classification’, in ACM Conference on Embedded Networked Sensor Systems (SenSys ‘11), pp.54–67, ACM. Dean, J. and Ghemawat, S. (2008) ‘MapReduce: simplified data processing on large clusters’, ACM Communications, Vol. 51, No. 1, pp.107–113. Ericsson (2011) More than 50 Billion Connected Devices, Ericsson White Paper No. 284 23-3149, pp.1–12. Gatti, M., Herrmann, R., Loewenstern, D., Pinel, F. and Shwartz, L. (2012) ‘Domain-independent data validation and content assistance as a service’, in IEEE International Conference on Web Services (ICWS ‘12), pp.407–414, IEEE. Hossain, M.A., Atrey, P.K. and El-Saddik, A. (2011) ‘Modeling and assessing quality of information in multisensor multimedia monitoring systems’, ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP), Vol. 7, No. 1, pp.3:1–3:30. Kephart, J. and Chess, D. (2003) ‘The vision of autonomic computing’, Computer, Vol. 36, No. 1, pp.41–50. Laurila, J.K., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T., Dousse, O., Eberle, J. and Miettinen, M. (2012) ‘The mobile data challenge: Big Data for mobile computing research’, in Mobile Data Challenge Nokia Workshop. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Byers, A.H. (2011) Big Data: The Next Frontier for Innovation, Competition, and Productivity, McKinsey Global Institute, May [online] http://www.mckinsey.com/mgi/publications/big_data/pdfs/M GI_big_data_full_report.pdf. Matsuda, N., Sato, H., Koseki, S. and Nagai, N. (2011) ‘Development of the M2M service platform’, NEC Technical Journal, Special Issue on the Network of Things, Vol. 6, No. 4, pp.19–23. Mylyy, O. (2008) RFID Data Management, Aggregation and Filtering, Hasso Plattner Institute Publications, Seminar on RFID Technology. Oracle (2013) Oracle Database 2 Day + Data Warehousing Guide 11g, Release 1 (11.1) [online] http://docs.oracle.com/cd/B28359_01/server.111/b28314/tdpd w_system.htm#CHDFIGHF (accessed December 2013).

Papageorgiou, A., Schmidt, M., Song, J. and Kami, N. (2013) ‘Smart M2M data filtering using domain-specific thresholds in domain-agnostic platforms’, IEEE International Congress on Big Data (BigData 2013), pp.286–293, IEEE. Stvilia, B., Gasser, L., Twidale, M.B. and Smith, L.C. (2007) ‘A framework for information quality assessment’, Journal of the American Society for Information Science and Technology, Vol. 58, No. 12, pp.1720–1733. Tan, S.K., Sooriyabandara, M. and Fan, Z. (2011) ‘M2M communications in the smart grid: applications, standards, enabling technologies, and research challenges’, International Journal of Digital Multimedia Broadcasting, Vol. 2011, pp.1–8. The Broadband Forum (2013) TR-069 Amendment 4, CPE WAN Management Protocol, July [online] http://www.broadbandforum.org/technical/download/TR069Amendment-4.pdf (accessed November 2013). Wahle, S., Magedanz, T. and Schulze, F. (2012) ‘The OpenMTC framework – M2M solutions for smart cities and the internet of things’, in International Symposium on World of Wireless, Mobile and Multimedia Networks (WoWMoM ‘12), pp.1–3, IEEE. Wedin, O., Bogren, J. and Grabec, I. (2008) Data Filtering Methods, EU Project Deliverable, Roadidea 215455. Wu, G., Talwar, S., Johnsson, K., Himayat, N. and Johnson, K.D. (2011) ‘M2M: from mobile to embedded internet’, IEEE Communications Magazine, Vol. 49, No. 4, pp.36–43. Zöller, S., Reinhardt, A., Schulte, S. and Steinmetz, R. (2011) ‘Scoresheet-based event relevance determination for energy efficiency in wireless sensor networks’, in IEEE Conference on Local Computer Networks (LCN), EDAS Conference Services, pp.207–210.

Notes 1

2 3

For example, the TR-069 data model of the Broadband Forum (2013) specifies many hundreds of parameters that can be remotely set on a device such as a home/M2M gateway. Although not all of them are relevant in the context of this work, this is a representative example. Related surveys, e.g., Ericsson (2011), even predict billions of M2M devices. For this evaluation, M2M-NECtar defines three categories of M2M data (c1: energy-related sensors, c2: meters, and c3: switches) and three respective filters with QoI assessment functions q1, q2, q3. q1 is ignored (dummy) because the respective category is assumed to be critical and thus weighed always with 1.0. q2 calculates the QoI score of a reading as the absolute value of the relative change of its value since the last reading. Everything above 5% is reduced to 5% and assigned to the top score. q3 assigns integer scores in the range [0, 2], i.e., 0 for unchanged switches in off-state, 1 for unchanged switches in on-state, and 2 for changed switches. Further, t2max = 10 and t3max = 2 were defined, with the obvious (uniform) assignment of thresholds to QoI values (v(t) function). Finally, the measurements are performed for three different requirements: R1 = [(1,1.0), (2,0.7), (3,0.7)], R2 = [(1, 1.0), (2, 0.9), (3, 0.5)], and R3 = [(1, 1.0), (2, 0.8), (3, 0.2)].

Suggest Documents