Windowing Mechanisms for Web Scale Stream ...

2 downloads 0 Views 138KB Size Report
Tata Consultancy Services. Kolkata, India, 700091. debnath.mukherjee@tcs.com. ABSTRACT. Web-scale stream reasoning is based on continuous queries.
Windowing Mechanisms for Web Scale Stream Reasoning Snehasis Banerjee

Debnath Mukherjee

TCS Innovation Labs Tata Consultancy Services Kolkata, India, 700091.

TCS Innovation Labs Tata Consultancy Services Kolkata, India, 700091.

[email protected]

ABSTRACT Web-scale stream reasoning is based on continuous queries and reasoning on a snapshot of the dynamic knowledge combined with background knowledge. The existing stream reasoners usually use either time-based or count-based window techniques following the data stream principles, however they do not fit all scenarios in the stream reasoning area. In this paper, different types of windowing mechanisms are described with exemplary scenarios in which they are most suitable for reasoning on stream of facts. A new windowing technique namely Adaptive Window is also proposed. Lastly, some important questions related to windowing techniques for web-scale stream reasoning are positioned.

Categories and Subject Descriptors H.m [Information Systems]: Miscellaneous

General Terms Design, Management

Keywords stream reasoning; data streams; window specification

1.

INTRODUCTION

Stream reasoning [2] is logical reasoning on large volumes of dynamic knowledge (like sensor readings) combined with background knowledge which includes static knowledge (like geospatial knowledge of cities) and slowly changing knowledge (like profiles of citizens). The sensors involved can be both hard sensors (such as GPS posts from mobile sensors and temperature readings from thermometer) as well as soft sensors (like RSS feeds or microblog posts). Many useful applications has been developed using stream reasoning like a public alert system presented in [1]. As the size of streaming knowledge is unbounded, the query results clearly depend on the streaming knowledge that is available for processing.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). Web-KR’13, November 1, 2013, San Francisco, CA, USA. ACM 978-1-4503-2424-3/13/11 http://dx.doi.org/10.1145/2512405.2512409 ...$15.00.

[email protected]

The two basic issues in processing streaming knowledge are: a) the knowledge usually has to reside in memory instead of physical storage, thereby limiting the amount of knowledge that can be processed at a time from a practical perspective. b) old knowledge from sensors usually becomes irrelevant in practical applications, so selective deletion needs to be done. This leads to the concept of ‘window’ which abstractly is selection of a finite set of tuples from the unbounded streaming knowledge. Query evaluation is done on a specified window. In a window, the progression step can be unity or hops, i.e. the deletion of tuples can happen one at a time or in groups. It is to be noted that if all the tuples in a window expire at the same time, such a window is called tumbling, else called sliding. The query logic is usually expressed in SPARQL1 like syntax, however for simplicity of understanding the paper, we have taken help of simple word examples that can be easily mapped to SPARQL queries. In section 2, different types of windowing are described. Section 3 discusses some challenges related to windowing mechanisms in this area.

2.

TYPES OF WINDOW MECHANISMS

In this section, the different types of window are listed with appropriate scenarios in which they are most suitable: (semantics of window types 1-3 is discussed in detail in [5]) 1) Partitioned window: Here the stream is partitioned into several sub streams based on attributes, resulting in each group of streams having a separate window. This is useful in cases where the window size should be maintained differently for each stream type classified on some property. As an example, consider events like road blockage due to traffic accident that have a long duration of validity, where as context data about GPS location of citizen mapped to a street has very short validity. Here separate window is maintained for long-lasting tuples and fast-changing ones by using a large and small count window size respectively. Count based window is a special case of partitioned window where partition size is one. A count based window maintains the last N tuples in memory. This is useful for queries where the logical need spans across last N number of tuples only. 2) Time based window: This window is defined based on time units instead of number of tuples. Here the tuples that are upto N time units old are kept, while rest are deleted. An example may be to retrieve the list of places visited by a person in the last 1 hour, where the window should contain 1 hour of sensor data about the location of that person. 3) Landmark window: Landmarks in the temporal axis is used to define the upper and lower bounds of this window 1

http://www.w3.org/TR/rdf-sparql-query/

type. This can be a) Upper bounded: here the upper temporal bound is defined. An exemplary query will be to get the most visited place by a person by end of today from now. It is to be noted that value of ‘now’ is changing at each instant of query execution and is not fixed. b) Lower bounded: here the lower bound is defined. An exemplary query is to provide the most visited place by a person from start of today till now. c) Fixed band: here both upper and lower bounds are fixed. An exemplary query is to find the most visited place by a person between two time instants such as a 24 hour period like a day with fixed start and end times. 4) Application managed window: In stream reasoning scenario, sometimes the aforementioned windowing mechanisms fail. The concept of applications controlling the window tuples in a custom fashion is described in [3] and [4]. As an example, a critical event E like ‘Fire at a building’ may last for 4 hours, however if a time window of 3 hours is used, E will get deleted. Similar problem exists for a count window that evicts E due to entry of other events. Hence in such situations, the application (usually at the sensor level, having the best knowledge about the validity of the tuple it generates) reporting the event E can send a tuple that the fire is over and hence the event can be deleted from the window. 5) Adaptive window: In stream reasoning, the continuous queries are usually registered in the system. However, the application developer may dynamically change the logic of the query as per need. Also sometimes the end-users may be given a provision to run on-demand custom queries on the combined knowledge of stream (that resides in a window) and the background knowledge. In such scenarios, due to sudden logic change or new logical query registration, the results of the system are erronous until data is available to satisfy the query logic’s requirement. This shortcoming may prove fatal in critical use cases. Examples to clarify the issue are as follows. Suppose a count window is specified with a tuple size of 1000, then if the query logic changes and demands that the window size be 2000, then either the query evaluation has to be stopped until the window gets filled with fresh 1000 tuples, or execution will continue with missing data until the data gets populated. This is because the past data of 1000 tuples is not available any more due to deletion. However, if the query logic has changed to demand 500 as tuple size, then just reduction of the window size to 500 (thereby deleting old 500 tuples) will serve the purpose. In case of upper bound based landmark window, there is no problem of losing the data, however for lower bound based and fixed band landmark window where new lower bound specification may lie outside the old query logic, a similar problem exists. The issue is true for time based window as well. An example of registered time based window query where the problem arises is if the logic is changed from ‘give all traffic accidents occurring in the city in the last 1 hour’ to ‘give all traffic accidents in city occurring in the last 2 hours’. In this case only last 1 hour’s data is actually kept in window, so the result of query execution will be erroneous. To overcome this, either an extended window has to be maintained or separate store can be kept that stores streaming data upto a certain limit (may be a day in this example, as determined by a domain expert) and when a logic change requires old data for query evaluation, the same is loaded from the store. Obviously if the store can be an in-memory database instead of a network separated persistent store, the performance will be higher. As logic change in registered queries is very occa-

sional, the system performance is not affected by this type of data loading. One may think that the domain expert may load all the data of a time range (in the example it may be a day), however that directly contradicts the necessity of windowing as discussed earlier. For occasional change of query, the adaptive windowing mechanism seems to work, however for custom queries by end-user, this is indeed a problem, as the end users can run queries that need a large window size for answering, and the current window size (serving existing registered queries) as assessed by domain experts may fail to serve the need. In such cases, learning has to be done on the window sizes that end users’ queries usually need; and based on the learning, an optimal window size and the limit of data resident on store can be determined that will support custom queries of a particular pattern. In future work, formalization of the adaptive window concept will be done, along with experimental evaluation in real life scenarios.

3.

SOME CHALLENGING QUESTIONS

Here some problems for web scale stream reasoning related to windowing needing active attention are listed: 1) Will a hybrid windowing mechanism by combining the aforementioned windowing mechanisms in a suitable way be the solution for stream reasoning? Or is it that a new window mechanism needs to be thought of in its entirety? 2) Based on the streaming knowledge flowing into the system and the registered logic, can the type of window to be used along with its optimal specification be automatically determined? What strategies (such as machine learning) should be taken for dynamic adaptive windows? 3) If the query expression is fuzzy like ‘what crime events happened around an hour ago?’, then how can such queries be mapped to window specification constructs? How will such fuzzy window operators be handled? We shall try to actively address the questions raised above. Further, we will like to carry out a comparative study on the different windowing mechanisms on standard datasets.

4.

REFERENCES

[1] Banerjee, S., Mukherjee, D., and Misra, P. ‘what affects me?’: a smart public alert system based on stream reasoning. In Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication (2013), ICUIMC ’13, ACM, pp. 22:1–22:10. [2] Della Valle, E., Ceri, S., van Harmelen, F., and Fensel, D. It’s a streaming world! reasoning upon rapidly changing information. Intelligent Systems, IEEE 24, 6 (2009), 83–89. [3] Mukherjee, D., Banerjee, S., and Misra, P. Ad-hoc ride sharing application using continuous sparql queries. In Proceedings of the 21st international conference companion on World Wide Web (2012), WWW ’12 Companion, ACM, pp. 579–580. [4] Mukherjee, D., Banerjee, S., and Misra, P. Towards efficient stream reasoning. In Proceedings of OTM Workshops (2013). [5] Patroumpas, K., and Sellis, T. Window specification over data streams. In Current Trends in Database Technology - EDBT 2006 (2006), vol. 4254 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, pp. 445–464.

Suggest Documents