360
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 25, NO. 2,
FEBRUARY 2013
Fast Activity Detection: Indexing for Temporal Stochastic Automaton-Based Activity Models Massimiliano Albanese, Andrea Pugliese, and V.S. Subrahmanian Abstract—Today, numerous applications require the ability to monitor a continuous stream of fine-grained data for the occurrence of certain high-level activities. A number of computerized systems—including ATM networks, web servers, and intrusion detection systems—systematically track every atomic action we perform, thus generating massive streams of timestamped observation data, possibly from multiple concurrent activities. In this paper, we address the problem of efficiently detecting occurrences of high-level activities from such interleaved data streams. A solution to this important problem would greatly benefit a broad range of applications, including fraud detection, video surveillance, and cyber security. There has been extensive work in the last few years on modeling activities using probabilistic models. In this paper, we propose a temporal probabilistic graph so that the elapsed time between observations also plays a role in defining whether a sequence of observations constitutes an activity. We first propose a data structure called “temporal multiactivity graph” to store multiple activities that need to be concurrently monitored. We then define an index called Temporal Multiactivity Graph Index Creation (tMAGIC) that, based on this data structure, examines and links observations as they occur. We define algorithms for insertion and bulk insertion into the tMAGIC index and show that this can be efficiently accomplished. We also define algorithms to solve two problems: the “evidence” problem that tries to find all occurrences of an activity (with probability over a threshold) within a given sequence of observations, and the “identification” problem that tries to find the activity that best matches a sequence of observations. We introduce complexity reducing restrictions and pruning strategies to make the problem—which is intrinsically exponential—linear to the number of observations. Our experiments confirm that tMAGIC has time and space complexity linear to the size of the input, and can efficiently retrieve instances of the monitored activities. Index Terms—Activity detection, indexing, stochastic automata, timestamped data
Ç 1
INTRODUCTION
T
HERE
are numerous applications where we need to monitor whether certain (normal or abnormal) activities are occurring within a stream of transaction data. For example, an online store might want to monitor the activities occurring during a remote login session on its Web site in order to either better help the user or to identify users engaged in suspicious activities. A company providing security in an airport might want to monitor activities in a baggage claim area or in a secure part of the tarmac in order to identify suspicious activities. A bank might want to monitor activities at its automatic teller machines for similar reasons. It is well recognized that models of activities are likely to be uncertain. We can rarely predict exactly how a particular activity may be executed, especially as a large number of irrelevant activities might be intermixed together. As a consequence, though early models of activities were
. M. Albanese is with the Department of Applied Information Technology and the Center for Secure Information Systems, George Mason University, Room 5350, Nguyen Engineering Building, Fairfax, VA 22030. E-mail:
[email protected]. . A. Pugliese is with the DEIS Department, Universita` della Calabria, Via Pietro Bucci, cubo 41C, Rende 87040, Italy. E-mail:
[email protected]. . V.S. Subrahmanian is with the Department of Computer Science and the Institute for Advanced Computer Studies, University of Maryland, 2113 A. V. Williams Building, College Park, MD 20742. E-mail:
[email protected]. Manuscript received 10 Apr. 2011; revised 19 Oct. 2011; accepted 25 Oct. 2011; published online 30 Nov. 2011. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-2011-04-0194. Digital Object Identifier no. 10.1109/TKDE.2011.246. 1041-4347/13/$31.00 ß 2013 IEEE
“certain” about what constituted an activity and used logical methods or context-free grammars [15], more recent activity detection is based on either graphical models [7], [11] or stochastic automata [2] in which vertices correspond to observable atomic events. However, most existing work on stochastic activity recognition has two main limitations. First, they often do not account for the time between observations associated with an activity. For instance, Fig. 1 shows an example of an online bill payment activity. Each vertex corresponds to an observation (made by the system) of the activity. However, in almost all online systems, there are temporal constraints about how much time can elapse between one observation and another for the two to jointly constitute part of a single occurrence of an activity. For instance, if we observe that a user checks his balance and then 6 hours (or 6 days or 6 months!) elapse between that observation and “go Billpay,” then should these two observations be counted as part of the same activity? The answer is “it depends” upon the application. Our first contribution in this paper, although not the most important one, is a temporal stochastic automaton framework for expressing activities with probabilities and temporal (application-dependent) constraints (Section 2). Subsequently, in Section 3, we formally define the “evidence” and “identification” problems. In the formalization of these problems, we model a sequence of observations that are observed in real time as well as a previously collected database of observations as a (continuously updated) observation table. The evidence problem looks at an observation table and tries to find all minimal sets of Published by the IEEE Computer Society
ALBANESE ET AL.: FAST ACTIVITY DETECTION: INDEXING FOR TEMPORAL STOCHASTIC AUTOMATON-BASED ACTIVITY MODELS
Fig. 1. Example of temporal stochastic activity.
tuples that jointly support the assertion that a given activity occurs with a probability exceeding a given threshold. Thus, for instance, our application may have a set A of activities it wants to monitor in a sequence of observations. A given activity A 2 A might occur zero, one or many times in the associated observation table. We would like to find all sets of tuples in the table that are believed to be occurrences of activity A with a probability exceeding some userspecified threshold. The identification problem tries to find the activity A 2 A that has the highest probability of being present in the table. Section 4 introduces an index structure called Temporal Multiactivity Graph Index Creation (tMAGIC) to monitor multiple activities concurrently. tMAGIC starts by merging multiple temporal stochastic automata (each representing an activity) into a special kind of graph. As observations arrive, the tMAGIC index can be updated either through individual insertions or through bulk insertions. Finally, appropriate algorithms are developed on top of the tMAGIC index structure to solve the evidence and identification problems. The insertion, evidence, and identification problems are also studied under various “restrictions” that can be applied to reduce computation time. We also show that the proposed Time Frame, Best Path, and Compaction pruning strategies do not lead to loss of solutions. Section 5 provides a detailed set of experiments showing that our tMAGIC index structure and associated algorithms work efficiently. We tested our framework on both synthetic and real data. On modest hardware, we can process approximately 28,500 observations per second for insertion into the tMAGIC index, while the evidence and identification problem can also be solved very efficiently in under a second even when a million activity occurrences are present in the index.
1.1 Related Work Limitations of traditional database management systems in supporting streaming applications and event processing have prompted extensive research in Data Stream Management Systems (DSMSs). An early yet comprehensive survey of relevant issues in data stream management was presented in [8]. Amongst the several systems resulting from research efforts in this direction, of particular relevance is TelegraphCQ [6], a streaming query processor that filters, categorizes, and aggregates flow records according to one or more CQL [4] continuous queries, generating periodic reports. Differently from traditional queries on static data collections, results of continuous queries on streaming data need to be periodically and
361
incrementally updated as new data is received. A significant portion of research in this area has been devoted to optimization of continuous queries [13]. Other works target the recognition of events based on streams of possibly uncertain data [16]. Although the system we propose in this paper operates on streams of observation data, the scope of our work is drastically different from the scope of DSMSs. In fact, we are not interested in retrieving a set of data items satisfying (exactly) certain conditions and keeping this set up to date as new data items are received. Instead, we are interested in finding sets of records such that, with a probability above a given threshold, the records in each set together constitute the “evidence” that a given activity occurred in a specific time interval. Additionally, we want to track partially completed activity occurrences. To the best of our knowledge, DSMSs do not provide support for this type of probabilistic inference. Moreover, there has been limited work on efficient indexing to support probabilistic activity recognition. The aim of past work on indexing of activities was merely to retrieve previously recognized activities, not to recognize new ones. Such work includes that of Ben-Arie et al. [5] who use multidimensional index structures to store body pose vectors in video frames. Kerkez [10] develops indexes for case-based plan recognition where knowledge about planning situations enables the recognizer to focus on a subset of the plan library containing relevant past plans. A two-level indexing scheme, along with incremental construction of the plan libraries, is proposed to reduce the retrieval efforts of the recognizer. In short, past work does not address the issue of indexing observations to find activity instances—more importantly, these indexing approaches do not account for uncertainty in what defines an activity which is key to any HMM or stochastic automatonbased definition of activities. With respect to the problem of modeling activities, Hidden Markov Models (HMMs) and their variants have been used extensively. For instance, Duong et al. [7] introduce the Switching Hidden Semi-Markov Model, a two-layered extension of the Hidden Semi-Markov Model (HSMM). The bottom layer represents atomic events and their duration using HSMMs, while the top layer represents high-level activities in terms of atomic events. A survey of temporal concepts and data models used in unsupervised pattern mining from symbolic temporal data is presented in [12]. Automatic learning of transition probabilities in activity models is discussed in [14]. Finally, dynamic Bayesian networks [9] and Petri nets can also be used for tracking multiagent activities. A probabilistic extension of Petri nets for activity detection is proposed in [1]. Context free grammars have also been used to define activities [15]. In conclusion, our work differs from previous efforts by providing a mechanism to index and update observations (as they occur or previously stored) in a data structure that also unifies a set of known activities. The index we propose in this paper—which extends [3]—stores both activities and observations and enables us to quickly answer the Evidence and Identification problems, where time and uncertainty together play a role in the definition of an activity.
362
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Fig. 2. Example of timespan distribution.
2
TEMPORAL STOCHASTIC ACTIVITY MODEL
We extend the stochastic automaton activity model for video described in [2] to the case of temporal data. We assume an arbitrary but fixed time granularity which always allows us to differentiate between the exact times of two different observations—so we can assume their timestamps to be different. We use T to denote the set of all time points. Definition 2.1 (Timespan Distribution). A timespan distribution ! is a pair ðI; Þ where: I is a set of time intervals1 such that 8½x; y 2 I; x y; 8½x; y; ½x0 ; y0 2 I with ½x; y 6¼ ½x0 ; y0 , the time intervals ½x; y and ½x0 ; y0 are disjoint; . : I ! ½0; 1 is a function that associates a value ðx; yÞ 2 ½0; 1 with each time interval ½x; y 2 I. We use to denote the set of all possible timespan distributions. Given a timespan distribution ! ¼ ðI; Þ, we P use Sð!Þ to denote ½x;y2I ðx; yÞ—we require that Sð!Þ 1. . .
Intuitively, a timespan distribution ðI; Þ specifies a set I of disjoint time intervals when an observation might occur, and an incomplete conditional probability distribution . ðx; yÞ is the probability that the observation will occur during the time interval ½x; y, given the previous observation. may not be complete as an observation may not occur at all after another given observation. The following example shows a timespan distribution. Example 2.1. Consider the graph of Fig. 2, and assume that a traveler just landed at JFK, and the only means of ground transportation are buses and cabs. The edges from landAtJFK to the other two nodes are labeled with timespan distributions !1 and !2 , respectively. Assuming a time granularity of hours, the timespan distributions can be interpreted as follows: there is a 0.3 probability that the traveler will take a cab within an hour from landing; . there is a 0.1 probability that the traveler will take a cab in 1-2 hours from landing, but will not take a cab after 2 hours; . there is a 0.2 probability that the traveler will take a bus within an hour from landing; . the traveler will take a bus during the second or third hour from landing with a 0.3 and 0.1 probability, respectively. The total probability of taking a cab is Sð!1 Þ ¼ 0:4, whereas the total probability of taking a bus is .
1. A time interval is a closed interval of the set T of time points, which in turn can be assumed to be nonnegative natural numbers.
VOL. 25, NO. 2,
FEBRUARY 2013
Sð!2 Þ ¼ 0:6. Note that Sð!1 Þ is strictly smaller than 1, as the traveler might decide not to take a cab after landing at JFK. A similar reasoning applies to Sð!2 Þ. However, Sð!1 ÞþSð!2 Þ ¼ 1, meaning that the traveler will leave the airport by either bus or cab. Finally, given any time interval, the sum of probabilities over all outgoing edges is not required to add up to 1. For instance, the traveler may not have left yet the airport after an hour. We now formalize the notion of Temporal Stochastic Activity (or activity). Definition 2.2 (Temporal Stochastic Activity). A temporal stochastic activity (or just activity) is a labeled graph ðV ; E; Þ where . . . . . .
V is a finite set of observations; E is a subset of ðV V Þ; 9v 2 V s.t. 6 9v0 2 V s.t. ðv0 ; vÞ 2 E, i.e., there exists at least one start node in V ; 9v 2 V s.t. 6 9v0 2 V s.t. ðv; v0 Þ 2 E, i.e., there exists at least one end node in V ; 6 9v 2 V s.t. ðv; vÞ 2 E, i.e., no self-loops are allowed; : E ! is a function that associates a timespan distribution P with each edge in the graph, such that 8v 2 V fv0 2V jðv;v0 Þ2Eg Sððv; v0 ÞÞ ¼ 1.
If A ¼ ðV ; E; Þ is an activity and v 2 V is a node, A:pmax ðvÞ denotes the maximum product of probabilities on any path in A from v to an end node. Example 2.2. Fig. 1 shows an example temporal stochastic activity modeling a bill payment process in an online banking system. A user will first access her accounts page (goAccounts) and either check her balance (checkBalance) or continue directly to the bill payment page (goBillpay). Assuming a time granularity of minutes, the edges between goAccounts and its successors are interpreted as in Example 2.1, e.g., there is a 0.5 probability that the checkBalance observation will occur in less than 1 minute, and a 0.2 probability that it will occur in 13 minutes. The rest of the activity requires users to select an account (selectSource), choose a payee (selectPayee), schedule the amount and date of payment (selectSchedule), and finally confirm the transfer (confirmTransfer). At each stage of the process, a user can cancel the sequence and return to the bill payment page. Definition 2.3 (Temporal Activity Instance). Given a temporal stochastic activity A ¼ ðV ; E; Þ, an instance of A is a sequence ðv1 ; . . . ; vn Þ with vi 2 V such that . .
v1 is a start node and vn is an end node in A; 8i 2 ½1; n 1, ðvi ; viþ1 Þ 2 E.
Intuitively, an activity instance is a path from a start node to an end node in A. Note that past work [2] does not require any temporal constraints—so, as mentioned in the Introduction, it is possible for selectPayee in the above example to occur 10 years after selectSource and still be considered a part of the same activity. Moreover, it may turn out that in activity A1 , two observations occur quickly one after another, while in activity A2 they both occur, but with a longer time delay
ALBANESE ET AL.: FAST ACTIVITY DETECTION: INDEXING FOR TEMPORAL STOCHASTIC AUTOMATON-BASED ACTIVITY MODELS
363
between them. Our framework accounts for both cases to be handled and recognized.
3
EVIDENCE AND IDENTIFICATION PROBLEMS
This section formalizes the Evidence and Identification problems. Without loss of generality, we assume that observations are stored in a single relational observation table, denoted D. Each tuple t 2 D corresponds to a single observation, denoted t:obs, which is observed at a given time, denoted t:ts. When our framework is used for realtime activity detection, our proposed insertion algorithm (which will be described in Section 4.1) processes each observation as it is received, updates the index, and stores a tuple in the observation table. Conversely, when the framework is used to detect activities in a previously acquired body of data, our bulk insertion algorithm can pull all the observation tuples from the table and build the whole index. Additionally, in some applications, each observation may be associated with context information (e.g., IP addreess, full name, spatial location), which might help discriminate between observations belonging to different activity occurrences. However, we do not assume this information to be available in general. For instance, in an intrusion detection system, multiple attackers engaged in different activities, may need to perform some common steps, and they may appear to come from the same origin if they use proxies to conceal their real identities. We use t:context to denote context information for observation tuple t, and propose a restriction where two tuples are considered to be part of the same activity occurrence only if their context information is “equivalent.” Note that t:context can generally be used to represent the result of the evaluation of a given predicate on t. Definition 3.1 (Probability of Set of Tuples Belonging to an Activity). Suppose D is an observation table, A ¼ ðV ; E; Þ is a temporal stochastic activity, and ft1 ; . . . ; tn g is a set of tuples in D such that t1 :ts t2 :ts tn :ts. Additionally, suppose that ðt1 :obs; . . . ; tn :obsÞ is an instance of A. The probability, probðft1 ; . . . ; tn g; AÞ, that the set ft1 ; . . . ; tn g belongs to activity A is i2½1;n1 i ðxi ; yi Þ where ðIi ; i Þ ¼ ðti :obs; tiþ1 :obsÞ is the timespan distribution associated with the edge ðti :obs; tiþ1 :obsÞ 2 E and ½xi ; yi 2 Ii is a time interval such that xi tiþ1 :ts ti :ts yi . The above definition says that in order to find the probability that t1 ; . . . ; tn represents an instance of activity A, we should find the path in A—and the time intervals from the associated timespan distributions—corresponding to the times when the observations associated with ti were made, and multiply the corresponding probabilities. As is the case in [2], we make the Markovian assumption which justifies the use of multiplication. Example 3.1. Consider the tuples with ids 2, 3, 4, 7, 10, 14, 15 in the example log of Fig. 3. The corresponding sequence of ti :obs’s is an instance of the activity of Fig. 1. Based on the timespan distributions and the time elapsed between consecutive observations, the probability that this set of tuples belongs to the activity is 0:5 0:9 0:95 0:95 0:97 0:98 ¼ 0:386. Note that other sets of tuples also
Fig. 3. Example log (time is in minutes from the origin 0).
belong to the activity (such as 1, 5, 8, 16, 17, 18, 19), but with lower probabilities. Definition 3.2 (Temporal Activity Occurrence). Given an observation table D and a temporal stochastic activity A ¼ ðV ; E; Þ, an occurrence of A in D with probability p is a set ft1 ; . . . ; tn g D such that . t1 :ts t2 :ts tn :ts; . ðt1 :obs; . . . ; tn :obsÞ is an instance of A; . probðft1 ; . . . ; tn g; AÞ p. Moreover, we define the span of the occurrence above as the time interval spanðft1 ; . . . ; tn gÞ ¼ ½t1 :ts; tn :ts. Example 3.2. Consider the example log shown in Fig. 3, and assume we are looking for occurrences of the activity of Fig. 1 with probability greater than or equal to 0.3. As shown in Example 3.1, the set of tuples with ids 2, 3, 4, 7, 10, 14, 15 belongs to the activity with probability 0:386 0:3. Therefore, it is valid occurrence, and its span is the interval ½2; 10. Note that multiple concurrent activities may generate interleaved observations in the observation table. The following proposition characterizes the number of possible occurrences of an activity in terms of its level of concurrency, i.e., the maximum possible number of interleaved activity occurrences. Proposition 3.1. Let A ¼ ðV ; E; Þ be a temporal stochastic activity and D an observation table. The number of occurrences of A in D is ðkjV j Þ, where k is the level of concurrency of A. Hence, it is not feasible to find all activity occurrences. Each observation may be considered as being connected to many occurrences. However, in the real world, each tuple could have been generated by only one activity. Finding all the identifiable activity occurrences is therefore both infeasible and undesirable, as it would lead to a number of identified occurrences much greater than the actual number of occurrences in the observation table. Hence, we define reasonable restrictions on what constitutes a valid occurrence in order to reduce the number of possible occurrences. We propose three restrictions—minimal span, maximal probability, and earliest action— applicable in most real-world scenarios. We do not claim these restrictions to be exhaustive: many others could be easily defined, depending on the application’s needs, and added to our framework. Moreover, we will show that the most significant complexity reduction in our framework is achieved by introducing a pruning strategy that leverages
364
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
the temporal constraints on the unfolding of activities without altering the space of solutions. The minimal span (MS) restriction requires that if two occurrences D1 and D2 of a given activity are found in the observation table and the span of D2 is contained within the span of D1 , we disregard D1 , unless D1 has higher probability than D2 .
VOL. 25, NO. 2,
FEBRUARY 2013
Fig. 4. Observation table from bank surveillance.
Definition 3.3 (Minimal Span Restriction). An activity occurrence D1 D is said to satisfy the minimal span restriction if and only if there is no other occurrence D2 D of the same activity such that spanðD2 Þ spanðD1 Þ and probðD1 Þ probðD2 Þ.
Definition 3.5 (Earliest Action Restriction). An activity occurrence ft1 ; . . . ; tn g D is said to satisfy the earliest action restriction if and only if 8i 2 ½2; n; 6 9wi 2 D such that wi :ts < ti :ts and ft1 ; . . . ; ti1 ; wi ; tiþ1 ; . . . ; tn g is an occurrence of the same activity.
Example 3.3. Consider a video surveillance application where we are interested in detecting occurrences of suspicious behaviors and submitting suspicious video segments to a human expert for review. Now, suppose that A is an activity of interest, D is an observation table storing atomic events observed in a video stream (e.g., through suitable image processing libraries), and D1 , D2 are two valid occurrences of A such that spanðD2 Þ spanðD1 Þ and probðD1 Þ ¼ probðD2 Þ. As suspicious segments need to be watched by humans, it is reasonable to include only D2 in the results. If D2 is actually deemed relevant, a superset of D2 could be provided to the user at a later stage.
Example 3.5. Consider the video surveillance application of Example 3.3 and the observation table of Fig. 4. Suppose we are interested in detecting bank robbery attempts in real time. Suppose activity A1 contains an edge approachT eller!showHandgun. The showHandgun observation occurs multiple times in the observation table, meaning that the gun appeared and disappeared within the field of view of the camera. EA tries to link approachT eller to the first occurrence of showHandgun, since we want to detect criminal activities as early as possible.
The MS restriction, by itself, may still allow exponentially many occurrences of an activity (multiple occurrences may have the same span in the worst case scenario of Proposition 3.1). The maximal probability restriction (MP) and its strict version (MPS) select only a few such occurrences. Definition 3.4 (Maximal Probability Restriction). An activity occurrence D1 D is said to satisfy the maximal probability (resp., strict maximal probability) restriction if and only if there is no other occurrence D2 D of the same activity (resp., of any activity) such that spanðD1 Þ ¼ spanðD2 Þ and probðD1 Þ < probðD2 Þ. Example 3.4. Consider the video surveillance application of Example 3.3—suppose there are two activities of interest A1 and A2 modeling attempted and successful bank robberies, respectively. Suppose D is an observation table and D1 , D2 are occurrences in D of A1 and A2 , respectively, such that spanðD1 Þ ¼ spanðD2 Þ, probðD1 Þ ¼ 0:6, and probðD2 Þ ¼ 0:8. Under MPS, only D2 is included in the results. MP provides a big reduction in the search space. The worst case of Proposition 3.1 still applies, but we expect that for any possible span—the number of possible spans being OðjDj2 Þ—only a very small number of occurrences will be considered. The earliest action restriction (EA for short) requires that when looking for the next observation in an activity occurrence, we always choose the first possible successor in the sequence. As we will make clearer in the following, this restriction provides a huge reduction in the search space, by making the number of occurrences independent of the size of the activities.
Finally, the context restriction (CTX for short) takes advantage of context information. Definition 3.6 (Context Restriction). An activity occurrence D D is said to satisfy the context restriction if and only if for any ti ; tj 2 D , ti :context ’ tj :context, where ’ is an equivalence relation defined over the domain of t:context. We now formally define the two types of problems we are interested in solving. Definition 3.7 (Evidence Problem). Given a temporal observation table D, a set of activities A, a time interval ½tss ; tse , and a probability threshold pt , compute all the occurrences D D of activities in A such that D occurs within the interval ½tss ; tse and probðD Þ pt . Definition 3.8 (Identification Problem). Given a temporal observation table D, a set of activities A, and a time interval ½tss ; tse , find the activity which occurs in D in the interval ½tss ; tse with maximal probability among the activities in A. A solution to the identification problem could be biased because short activity occurrences generally tend to have higher probabilities. To remedy this, we normalize occurrence probabilities as defined in [2], by introducing the relative probability p of an occurrence D of activity A as
Þpmin p ðD Þ ¼ probðD pmax pmin , where pmin , pmax are the lowest and highest possible probabilities of any occurrence of A.
4
TEMPORAL MULTIACTIVITY GRAPH INDEX
In order to monitor an observation table for occurrences of multiple activities, we first merge all temporal activity definitions from A ¼ fA1 ; . . . ; Ak g into a single graph. We use idðAÞ to denote a unique identifier for activity A and IA to denote the set fidðA1 Þ; . . . ; idðAk Þg.
ALBANESE ET AL.: FAST ACTIVITY DETECTION: INDEXING FOR TEMPORAL STOCHASTIC AUTOMATON-BASED ACTIVITY MODELS
Fig. 5. Temporal stochastic activities (top) and corresponding multiactivity graph (bottom).
Definition 4.1 (Temporal Multiactivity Graph). Let A ¼ fA1 ; . . . ; Ak g be a set of temporal stochastic activities, where Ai ¼ ðVi ; Ei ; i Þ. The Temporal Multiactivity Graph for A is a triple G ¼ ðVG ; IA ; G Þ where . .
VG ¼ [i¼1;...;k Vi is a set of observations; G : VG VG IA ! is a function that maps triples ðv; v0 ; idðAi ÞÞ to the timespan distribution i ðv; v0 Þif ðv; v0 Þ 2 Ei , and ; otherwise.
A temporal multiactivity graph merges a number of stochastic activities. It can be graphically represented by labeling nodes with observations and edges with the ids of activities containing them, along with the corresponding timespan distributions. The temporal multiactivity graph can be computed in time polynomial in the size of A. Furthermore, the temporal multiactivity graph has to be computed only once before building the index. Fig. 5 shows two temporal stochastic activities and the corresponding multiactivity graph. Definition 4.2 (Temporal Multiactivity Graph Index). Let A ¼ fA1 ; . . . ; Ak g be a set of stochastic activities, where Ai¼ðVi ; Ei ; i Þ, and let G¼ðVG ; IA ; G Þ be the temporal multiactivity graph built over A. A Temporal Multiactivity Graph Index is a 6-tuple IG ¼ ðG; startG ; endG ; maxG ; tablesG ; completedG Þ, where .
.
.
.
.
startG : VG ! 2IA is a function that associates with each node v 2 VG , the set of activity ids for which v is a start node; endG : VG ! 2IA is a function that associates with each node v 2 VG , the set of activity ids for which v is an end node; maxG : VG IA ! ½0; 1 is a function that associates with each pair ðv; idðAi ÞÞ the probability Ai :pmax ðvÞ, if v 2 Vi , and 0 otherwise; For each v 2 VG , tablesG ðvÞ is a set of records of the form ðcurrent, activityID, t0 , prob, previous, nextÞ, where current is a reference to an observation tuple, activityID 2 IA is an activity id, t0 2 T is a timestamp, prob 2 ½0; 1, previous is a reference and next a set of references to records in tablesG ; completedG : IA ! 2P , where P is the set of references to records in tablesG , is a function that associates with each activity identifier idðAÞ a set of references to records in tablesG which correspond to completed instances of activity A.
365
Note that G, startG , endG , maxG can be computed a-priori, based on the set A of activities of interest. All tables that are part of the index (tablesG ) are initially empty. As new tuples are added, the index tables are updated as described in Section 4.1. The tMAGIC index tracks information about which nodes are start/end nodes for the original activities. For each node, it also stores 1) the maximum probability of reaching an end node for each activity in A, 2) a table that tracks partially completed activity occurrences where each record points to a tuple whose observation is part of the corresponding activity instance, as well as to the previous and successor records. In addition, each record stores the probability of the activity occurrence so far, and the time at which the partial occurrence began (t0 ).
4.1 tMAGIC Insertion Algorithm This section describes an algorithm to insert tuples into the tMAGIC index (Algorithm 1). The algorithm takes as input a temporal multiactivity graph index IG , a new tuple tnew to be added to the index, a probability threshold pt , and seven boolean flags—fMS , fEA , fMP , fMP S , fBP , fT F , fCT X —indicating which restrictions and/or pruning strategies must be applied (we will explain the BP and TF pruning strategies shortly). Algorithm 1. insertðtnew ; IG ; pt ; fMS ; fEA ; fMP ; fMP S ; fBP ; fT F ; fCT X Þ Input: New tuple to be inserted tnew , temporal multiactivity graph index IG , probability threshold pt , Boolean flags fMS , fEA , fMP , fMP S , fBP , fT F , fCT X indicating which restrictions/pruning strategies must be applied. Output: Updated temporal multiactivity graph index IG . 1: //Look at start nodes 2: if startG ðtnew :obsÞ 6¼ ; then 3: updatedActivities ; 4: if fMS then 5: for all record r 2 tablesG ðtnew :obsÞ s.t. r:activityID 2 startG ðtnew :obsÞ ^ r:next ¼ ; do 6: r:current t"new 7: updatedActivities updatedActivities [ fr:activityIDg 8: end for 9: end if 10: for all id 2 startG ðtnew :obsÞnupdatedActivities do 11: add ðt"new ; id; tnew :ts; 1; ?; ;Þ to tablesG ðtnew :obsÞ 12: end for 13: end if 14: //Look at intermediate nodes 15: for all node v 2 VG s.t. 9id 2 IA ; G ðv; tnew :obs; idÞ 6¼ ; do 16: if fT F then minfr 2 tablesG ðvÞjtnew :ts r:current:ts 17: rfirst maxid2IA j G ðv;tnew :obs;idÞ6¼; maxtG ðv; tnew :obs; idÞg 18: else tablesG ðvÞ:first 19: rfirst 20: end if 21: for all record r 2 tablesG ðvÞ s.t. r rfirst ^ ð:fEA _ ðfEA ^ r:next ¼ ;ÞÞ ^ ð:fCT X _ ðfCT X ^ tnew :context ’ r:current:context)) do 22: id r:activityID 23: if G ðv; tnew :obs; idÞ 6¼ ; then
366
24: 25: 26: 27: 28: 29: 30: 31: 32:
33: 34: 35: 36: 37:
38: 39: 40: 41: 42: 43: 44: 45: 46: 47:
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
ðI; Þ G ðv; tnew :obs; idÞ p ðx; yÞ where ½x; y 2 I and x tnew :ts r:current:ts y if ð:fBP ^ r:prob p pt Þ _ ðfBP ^r:prob p maxG ðtnew :obs; idÞ pt Þ then ðt"new ; id; r:t0 ; r:prob p; r" ; ;Þ rn add rn to tablesG ðtnew :obsÞ r:next r:next [ fr"n g //Look at end nodes if id 2 endG ðtnew :obsÞ then if ðfMS ^ 9r"c 2 completedG ðidÞ; spanðrc Þ spanðrn Þ ^ rn :prob rc :probÞ _ ðfMP ^ 9r"c 2 completedG ðidÞ; spanðrc Þ ¼ spanðrn Þ ^rn :prob rc :probÞ _ ðfMP S ^ 9idc 2 IA ; r"c 2 completedG ðidc Þ; spanðrc Þ ¼ spanðrn Þ ^ rn :prob rc :probÞ then discard rn else add r"n to completedG ðidÞ for all idc 2 IA do for all r"c 2 completedG ðidc Þ s.t. ðfMS ^ idc ¼ id ^ spanðrn Þ spanðrc Þ ^ rc :prob rn :probÞ _ ðfMP ^ idc ¼ id ^ spanðrn Þ ¼ spanðrc Þ ^ rc :prob rn :probÞ _ ðfMP S ^ spanðrn Þ ¼ spanðrc Þ ^ rc :prob rn :probÞ do remove r"c from completedG ðidÞ deleteðrc Þ end for end for end if end if end if end if end for end for
Lines 2-13 handle the case when the tuple contains an observation that is the start node of an activity in A. If MS must be applied, the current pointers of records in tablesG ðtnew :obsÞ that do not have a successor are replaced with pointers to the new tuple, in order to minimize the span (Lines 4-9). If a record corresponding to a start node already has a successor, its current stays unchanged and a new record is added to the table for every activity in startG ðtnew :obsÞ for which no record was updated (Lines 1012), denoting the fact that the new observation may be the start of a new activity occurrence. Lines 15-29 look at the tables associated with the nodes that precede tnew :obs in the temporal multiactivity graph and check whether the new tuple can be linked to existing partially completed occurrences. For each predecessor v of tnew :obs, Lines 16-20 determine where the algorithm should start scanning tablesG ðvÞ. Note that records are added to the index as new observations are received. Therefore, records r in each index table tablesG ðvÞ are ordered by r:current:ts, i.e., the time at which the corresponding observation was received. Given two records r1 , r2 2 tablesG ðvÞ, we use r1 r2 to denote the fact that r1 precedes r2 in tablesG ðvÞ, i.e., r1 :current:ts r2 :current:ts. In the unrestricted case, the
VOL. 25, NO. 2,
FEBRUARY 2013
whole table is scanned, tablesG ðvÞ:first being the first record in tablesG ðvÞ. If TF is being applied (see Section 4.2), only the “most recent” records in tablesG ðvÞ are considered. Additionally, if the EA restriction is to be applied, the algorithm requires each record to have at most one successor; each record is linked to the first observation that is a valid successor (Line 21, where the context restriction is applied as well, if needed). On Line 25, timespan distributions are used to determine the probability that observation tnew :obs is the successor of r:current:obs for the activity definition in A identified by id ¼ r:activityID, given the amount of time elapsed between r:current:ts and tnew :ts. We also enforce the probability threshold (Line 26) by checking whether the partial occurrence still has a probability above the threshold. Alternatively, we detect whether the partial occurrence can still have a probability above the threshold on completion: we refer to this feature of the algorithm as Best Path (BP) pruning strategy, as it allows us to prune away solutions based on the best possible path to an end node. If all these conditions are met, a new record rn is added to the table associated with tnew :obs and r:next is updated to point to rn (Lines 27-29). Note that rn inherits t0 from its predecessor; hence, the start/end times can be quickly retrieved by looking directly at the last tMAGIC record for a completed occurrence. Lines 31-43 check whether tnew :obs is an end node for some activity. If yes, the algorithm checks if any of MS, MP, or MPS must be applied, and whether conditions for inclusion of the newly completed occurrence in the index under such restrictions are violated2 (Line 32). If so, record rn is discarded (Line 33); otherwise a pointer to rn is added to completedG , saying that a new occurrence has been completed (Line 35). In addition, if any of MS, MP, or MPS must be applied, previously completed occurrences rendered invalid after the addition of the new occurrence are removed from completedG and their records are removed from the index (Lines 36-41), using a function delete which recursively deletes records following the chain of previous pointers. Algorithm insert, can be used iteratively for bulk insertion of an entire observation table (we refer to this variant as bulk-insert). Example 4.1. Consider the temporal multiactivity graph of Fig. 5 and its temporal multiactivity graph index IG ¼ ðG; startG ; endG ; maxG ; tablesG ; completedG Þ. Fig. 6 shows how the index is updated under the MS and EA restrictions when new observations are added to the observation table D. The first tuple in the observation table contains the observation a. Since activities A1 and A2 have a as their start node, two records are added to tablesG ðaÞ. The probability of the two partially completed activities is initially set to 1 (Fig. 6(1)). The second tuple is also about observation a; to apply the minimal span restriction, the records in tablesG ðaÞ are updated to point to the new tuple (Fig. 6(2)). The third observation is b. As a is the only possible predecessor of b in the temporal multiactivity graph, the insertion algorithm looks at tablesG ðaÞ to check 2. spanðrÞ denotes the span of the partially completed occurrence corresponding to r, i.e., the interval ½r:t0 ; r:current:ts.
ALBANESE ET AL.: FAST ACTIVITY DETECTION: INDEXING FOR TEMPORAL STOCHASTIC AUTOMATON-BASED ACTIVITY MODELS
367
Fig. 6. Evolution of a temporal multiactivity graph index.
whether the new tuple can be linked to a partially completed occurrence. The tuple can be linked to the first record in tablesG ðaÞ, which represents a partially completed instance of activity A1 . Therefore, a new record is added to tablesG ðbÞ, with probability equal to the product of the probability associated with the previous record—which is 1 in this case—and the probability of traversing the edge from a to b for activity A1 . This probability is determined by the timespan distribution G ða; b; idðA1 ÞÞ, based on the time elapsed between a and b: in this case 1 time unit has passed between the two observations, so the probability associated with the edge is 0.3 (Fig. 6(3)). The fourth tuple involves b again; to apply the earliest action restriction we do not link it to the first record in tablesG ðaÞ that already has a successor (Fig. 6(4)). The fifth observation, c, can be linked to the second record in tablesG ðaÞ, which represents a partially completed instance of activity A2 . Therefore, a new record is added to tablesG ðcÞ, with probability 0.8, determined by the timespan distribution G ða; c; idðA2 ÞÞ (Fig. 6(5)). When d is observed, the insertion algorithm looks at tablesG ðbÞ, tablesG ðcÞ, and tablesG ðeÞ and adds two new records to tablesG ðdÞ, the first linked to the one in tablesG ðbÞ (for activity A1 ) and the second to the one in tablesG ðcÞ (for activity A2 ) (Fig. 6(6)). As d is an end node for both activities A1 and A2 , pointers to the newly added records are created in completedG ðidðA1 ÞÞ and completedG ðidðA2 ÞÞ, respectively.
The following results characterize the time complexity of insertion into a tMAGIC index in the unrestricted case and under the EA restriction. Proposition 4.1. Given a set A of stochastic activities and a temporal multiactivity graph G ¼ ðVG ; IA ; G Þ over A, the worst case complexity of algorithm insert (resp. bulk-insert) when no restriction is applied is OðkjVG j jAj jDjÞ (resp. OðkjVG j jAj jDj2 Þ), where D is the observation table and k is the level of concurrency. Proposition 4.2. Given a set A of stochastic activities and a temporal multiactivity graph G ¼ ðVG ; IA ; G Þ over A, the worst case complexity of the algorithm insert (resp. bulkinsert) when only the EA restriction is applied is Oðk jAj jDjÞ (resp. Oðk jAj jDj2 Þ), where D is the observation table and k is the level of concurrency. Thus, in the unrestricted case, algorithm insert is linear in the size of the observation table and exponential in the size of VG (for fixed level of concurrency k). Under EA, the algorithm is independent of the size of VG . This result was expected, since we still need to fully scan each predecessor’s table, but we are now limiting each record to have at most one successor even in the worst case where all the nodes in VG are legitimate successors of an existing record. Finally, the following result establishes the correctness of algorithm insert. Proposition 4.3. Given a set A of stochastic activities, a temporal multiactivity graph G over A, a tMAGIC index IG , and an observation tuple tnew , algorithm insert terminates and correctly indexes tnew by updating IG .
368
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
4.2 Improving Time and Space Performance We now propose two pruning strategies that improve the time and space performance of the tMAGIC index while guaranteeing the correctness of the results. The first strategy is called Time Frame pruning. It is based on the observation that the number of “recent” records in a tMAGIC index, i.e., those records whose corresponding observations still have a chance of being linked to a new one, is basically independent of the size of the observation table. This assumption is largely confirmed by the experimental results of Section 5. We start by defining the two functions maxt and maxtG : maxt : ðI; Þ 2 !
max
½x;y2I j ðx;yÞ>0
y;
maxtG : ðv; v0 ; idÞ 2 VG VG IA ! maxt ðG ðv; v0 ; idÞÞ: Intuitively, given the timespan distribution associated with an edge ðv; v0 Þ of a temporal stochastic activity, maxt returns the maximum time that can elapse from observing v before the probability of observing v0 drops to 0. maxtG does the same for the timespan distribution associated with an edge ðv; v0 Þ of the activity identified by id. If an amount of time larger than maxtG ðv; v0 ; idÞ has elapsed since v was observed, linking v to v0 will not lead to a viable occurrence of the activity identified by id. This brings us to the definition of the strategy. Definition 4.3 (Time Frame Pruning). Algorithm insert is said to apply Time Frame pruning strategy (TF) if, for each observation tuple tnew and each predecessor v of tnew :obs, it starts scanning tablesG ðvÞ at the first record r such that tnew :ts r:current:ts maxid2IA j G ðv;tnew :obs;idÞ6¼; maxtG ðv; tnew :obs; idÞ. Thus, this pruning strategy avoids scanning the entire predecessor table when most of the records in tablesG ðvÞ cannot be linked to tnew :obs, because too long has passed since their corresponding observations were made, causing the overall probability to be zero. The following propositions ensure that the strategy is correct and analyses the resulting time complexity. Proposition 4.4. A tMAGIC index built by applying the TF pruning strategy in conjunction with any combination of restrictions is equal to a tMAGIC index built on the same data by applying the same combination of restrictions without TF. Proposition 4.5. Given a set A of stochastic activities and a temporal multiactivity graph G ¼ ðVG ; IA ; G Þ over A, the worst case complexity of the algorithm insert (resp. bulkinsert) when only TF is applied is OðkjVG j jAjÞ (resp. OðkjVG j jAj jDjÞ), where D is the observation table and k is the level of concurrency. In the complexity result above, we assumed that the number of “recent” records in each table tablesG ðvÞ is independent of the size of D. Thus, the adoption of TF makes the complexity of bulk loading linear in the size of the observation table. Our experiments show that in practice, time (and space) complexities under TF are independent of the size of the activities. This result is expected since typically jDj jVG j.
VOL. 25, NO. 2,
FEBRUARY 2013
The second pruning strategy we propose is the compaction of a tMAGIC index. Compaction removes records that have not been linked after a time interval larger than the largest maxt of the timespan distributions associated with the outgoing edges of its observation. We can define the time to live (TTL) of a record r 2 tablesG ðvÞ as T T LðrÞ ¼
max
v0 2VG j G ðv;v0;r:actIDÞ6¼;
maxtG ðv; v0; r:actIDÞ;
and remove r from the index if, after T T LðrÞ time units, no successor has been found for it. The following result ensures the correctness of compaction. Here, we say that two tMAGIC indexes are equivalent if and only if they return the same solutions to the evidence and identification problems for any given observation table. Proposition 4.6. Let IG be a tMAGIC index, and let IG0 be the tMAGIC index resulting from the removal of all records r from IG such that T T LðrÞ has elapsed. Then, IG and IG0 are equivalent.
4.3 Evidence and Identification Algorithms The tMAGIC-evidence and tMAGIC-id algorithms (Algorithms 2 and 3), solve the Evidence and Identification problems, respectively, by leveraging the tMAGIC index. Algorithm tMAGIC-evidence selects occurrences from completedG that fall in the time interval ½tss ; tse and have probability equal to or greater than a threshold pet (not to be confused with the threshold used to build the index). For each occurrence to be included in the answer, the algorithm scans the index backwards to retrieve all data tuples belonging to the occurrence. Algorithm tMAGIC-id considers all occurrences falling in the time interval ½tss ; tse , selects those with the highest probability, and returns their ids. Algorithm 2. tMAGIC-evidenceðIG ; IA ; pet ; tss ; tse Þ Input: Temporal multiactivity graph index IG , set of activity identifiers IA , time interval ½tss ; tse , probability threshold pet Output: Set of pairs ðidðAÞ; RÞ where idðAÞ 2 IA and R is a set of sequences of tuples that satisfy activity A with probability above pet in the time interval ½tss ; tse 1: S ; 2: for all idðAÞ 2 IA do 3: R ; 4: for all r" 2 completedG ðidðAÞÞ do 5: if r:prob pet ^ ½r:t0 ; r:current:ts ½tss ; tse then 6: L hr:currenti 7: repeat r:previous 8: r" r:current 9: t" 10: add t" to L 11: until r:previous ¼ ? 12: add reverseðLÞ to R 13: end if 14: end for 15: add ðid; RÞ to S 16: end for 17: return S
ALBANESE ET AL.: FAST ACTIVITY DETECTION: INDEXING FOR TEMPORAL STOCHASTIC AUTOMATON-BASED ACTIVITY MODELS
Algorithm 3. tMAGIC-idðIG ; IA ; tss ; tse ) Input: Temporal multiactivity graph index IG , set of activity identifiers IA , time interval ½tss ; tse Output: Identifiers of activities in IA that are satisfied with the highest probability in the time interval ½tss ; tse 1: pmax ¼ 0 2: for all idðAÞ 2 IA do 3: for all r"2 completedG ðidðAÞÞ s.t. ½r:t0 ; r:current:ts ½tss ; tse do 4: if r:prob > pmax then 5: S fidðAÞg r:prob 6: pmax 7: else if r:prob ¼ pmax then 8: add idðAÞ to S 9: end if 10: end for 11: end for 12: return S The following results state the correctness and the time complexity of the algorithms. Proposition 4.7. Consider a set A of stochastic activities, a temporal multiactivity graph G ¼ ðVG ; IA ; G Þ over A, an observation table D, a time interval ½tss ; tse , and a tMAGIC index built using a probability threshold pt . Algorithm tMAGIC-evidence terminates and, if pet pt , returns all the activity occurrences in ½tss ; tse with probability p pet . Algorithm tMAGIC-id also terminates and returns the correct answers. Proposition 4.8. Given a set A of stochastic activities, a temporal multiactivity graph G ¼ ðVG ; IA ; G Þ over A, and a tMAGIC index, the worst case time complexity of algorithms tMAGICevidence and tMAGIC-id is OðC jVG jÞ and OðCÞ, respectively, where C is the number of indexed activity occurrences. Obviously, the activity occurrences retrieved by both algorithms follow the restrictions used when building the tMAGIC index. It should be noted that the temporal multiactivity graph index is also capable of efficiently retrieving partially-completed occurrences of a given activity A. This amounts to following forward pointers from the records in tablesG ðvÞ having activityID ¼ idðAÞ, for each v such that idðAÞ 2 startG ðvÞ.
5
EXPERIMENTAL RESULTS
This section describes experiments on both synthetic and real data to evaluate index creation time, memory usage and running time for the two problems described in the paper. We first ran experiments on synthetic data in order to evaluate several aspects of the framework in detail. We then used a third party real-world data set to validate the results obtained on synthetic data, and highlight the differences between the two scenarios. For the first set of experiments, we used synthetic activity definitions and data sets generated using two separate tools: one for generating random activity definitions, and another for simulating a set of activities and generating observation streams. We generated 32 activity definitions, and data sets of 5 million tuples each.
369
Fig. 7. Index build time (synthetic data).
The second set of experiments used a third party proprietary real-world travel data set including events such as hotel check-ins and check-outs, flight bookings, departures, arrivals, etc. The data set includes over 10 million records collected over a period of 2 years, and includes names of the individuals involved. Our implementation of tMAGIC consists of approximately 2200 lines of C code. Experiments were run on a Windows XP platform running on an Intel Centrino Duo 2.33 GHz with 2 GB of RAM. All the reported processing times were averaged over 10 independent runs.
5.1 Index Build Time and Memory Occupancy Fig. 7 shows the index build time for the bulk insertion algorithm on synthetic data, using a probability threshold pt ¼ 0:6 and different combinations of restrictions and/or pruning strategies. The results show that in the unrestricted case, indexing all possible activity occurrences quickly becomes unmanageable—the build time approaches 1,000 seconds with just 50,000 observations (note that both axes are on a log scale). The application of EA provides a marginal improvement, while the combined application of EA and MS provides a more significant improvement. The introduction of TF makes the build time linear w.r.t. the size of the observation table. These results also show that, for a fixed probability threshold, in the worst case 1 million observations can be processed in about 35 seconds,3 at an average rate of 28,500 observations per second. In other words, our algorithm provides real-time performance for any application generating up to 28,500 observations per second. Fig. 8 shows the number of activity occurrences indexed on synthetic data using any of TF, TF+MS, TF+EA, TF+BP, TF+MP, TF+MPS, TF+EA+MS, for a fixed probability threshold pt ¼ 0:6, and varying size of the observation table (recall that TF does not alter the search space but guarantees that it can be explored in linear time). The following conclusions can be drawn: 1) as expected, the number of occurrences indexed when only TF is applied is an upper bound for all the other restrictions or combinations of them, and BP does not alter the search 3. About 4 seconds are needed to compact the index for 1 million observation. However, in real-time scenarios, we can reasonably assume that compaction is incrementally performed during idle times.
370
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 25, NO. 2,
FEBRUARY 2013
Fig. 10. Build time when varying jVG j (real data). Fig. 8. Number of activity occurrences (synthetic data).
space when used alone; 2) EA+MS imposes the most restrictive constraints on the search space, causing the least possible number of occurrences to be indexed, followed by MS and EA; 3) MPS indexes a slightly smaller number of occurrences than MP, but both are far less restrictive than EA+MS. Fig. 9 shows the build time for the travel data set. These results confirm that TF builds the index in time linear w.r.t. the number of observations. Additionally, the application of CTX provides a significant improvement over TF alone as it avoids linking observations that do not have the same context. However, CTX without TF causes the processing time to diverge. As in the case of synthetic data, EA+MS further improves build time. Under TF+CTX, build time for real data is lower than for synthetic data. This can be explained by the differences between the two data sets. The synthetic data stream was generated by simulating multiple times all the activities of interests, so each observation in the stream belongs to a monitored activity and needs to be analyzed. In the case of the real-world travel data, we first acquired the observation table and then defined a number of activities of interest. As a consequence, not all observations in the table belong to an activity of interest, and some of them may be ignored altogether. Fig. 9 also shows that processing times under TF, although linear, are not exactly on a straight line as in the case of synthetic data. This is expected, as synthetic activity occurrences were simulated at a constant rate, whereas real-word activities
Fig. 9. Index build time (real data).
may not occur at a constant rate (also confirmed by Fig. 13). Fig. 10 shows how processing time increases with the number of nodes in the multigraph (note that both axes are on a log scale). Fig. 11 shows how memory occupancy for the synthetic data varies for different restrictions as the size of the observation table increases. For each restriction or combination thereof, the figure shows the size of the index (in KB) both before (b) and after (a) applying the compaction pruning strategy.4 As expected from the previous discussion, EA+MS generates the smallest index, followed by MS and EA, while TF is the upper bound. On average, EA+MS reduces memory occupancy by a factor of 5 w.r.t. TF. In general, compacting the index reduces its size by a factor of 4, with the exception of BP, which provides a compaction factor of 3, and EA, which provides a compaction factor of 6.4 (similarly EA+MS has a compaction factor of 5.7). This can be easily explained as BP avoids attaching additional records to partially completed occurrences as soon as it determines that they will not lead to viable solutions. Hence, there are less records to be removed during compaction. On the contrary, EA continues to attach records to a partially completed occurrence until the actual probability falls below the threshold. If this happens, all the records corresponding to the partially completed and inadmissible occurrence will need to be discarded in the compaction, whereas in general—when EA is not applied—each record may have multiple successors. Thus, it does not need to be removed unless all of its successors have been removed. Moreover, these results indicate that the average memory occupancy is about 0.06 KB per observation tuple in the worst case, with the total size of the index reaching 60 MB for 1 million tuples. Fig. 12 shows memory occupancy for the travel data set, and confirms our previous observation that activities are not executed at a constant rate. For instance, there is an increased volume of activities between 10K and 100K observations. Fig. 13 shows how memory occupancy grows as the size of the merged graph increases. We performed further experiments to assess the impact of TF and the interaction between EA and BP when fixing the size of the observation table to 10,000 and varying the 4. In these experiments, compaction is applied after bulk loading.
ALBANESE ET AL.: FAST ACTIVITY DETECTION: INDEXING FOR TEMPORAL STOCHASTIC AUTOMATON-BASED ACTIVITY MODELS
Fig. 11. Memory occupancy (synthetic data).
Fig. 14. Index build time (synthetic data).
Fig. 12. Memory occupancy versus size of the obs. table (real data).
Fig. 15. Memory occupancy (synthetic data).
Fig. 13. Memory occupancy when varying jVG j (real data).
value of the probability threshold pt . Fig. 14 shows the results on synthetic data, in terms of index build time. It is clear that the application of TF improves build time by one order of magnitude. The application of BP only marginally improves time when TF is applied. Instead, a significant improvement (10-18 percent) can be observed when BP is applied without TF. However, contrary to one’s expectations, the relative improvement provided by BP is higher for lower probability thresholds. This behavior can be explained by considering that a “basic” pruning mechanism is applied, even without BP, that discards partial occurrences whose probability drops below the threshold. So, the “forecast” capability5 provided by BP is more effective for lower probability thresholds, when this 5. We recall that BP prunes partially completed instances that cannot have a probability above the threshold upon completion.
371
mechanism does not come into play until the actual probability drops below the threshold. For higher thresholds, most partial occurrences will drop below the threshold sooner—typically after the first observations—and thus pruned before BP can even step in. The build time usually decreases as the probability threshold increases. This effect can be attributed to the fact that solutions are discarded as they fall below the threshold. However, when TF is applied, this effect is barely noticeable. Fig. 15 shows that memory occupancy generally decreases as the probability threshold increases, because a higher threshold causes more pruning and limits the number of records added to the index tables. As expected, the application of TF does not affect memory occupancy. Moreover, we note that the application of BP in conjunction with EA does not further reduce memory occupancy, but increases it. This is because EA allows a record in the index to have at most one successor. If a record is assigned a successor, and the corresponding occurrence is later found to be inadmissible, that record is removed and is not part of any other solution, whereas the record could have been linked to a different successor leading to a valid solution. BP limits this effect by trying to prune in advance solutions that cannot have a probability above the threshold upon completion, thus increasing the number of valid solutions identified by EA alone. In conclusion, EA+BP can index more occurrences than EA in less time.
5.2 Query Execution Time We conducted experiments to measure query execution times on indexes built on synthetic data under different
372
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 25, NO. 2,
Fig. 16. Identification query time (synthetic data).
Fig. 17. Identification query time (synthetic data).
restrictions. We first ran identification queries over the entire timespan of the data set and averaged the results on 10 independent runs. Fig. 16 shows query response times under TF w.r.t. the size of the observation table. Times are generally linear in the number of observations. Moreover, EA+MS, MS, and EA-built indexes provide smaller query time since they generate smaller indexes. In Fig. 17 we plot query execution times versus number of activity occurrences. The resulting cloud of points clusters nicely around the regression line shown in the figure, with the statistical coefficient of determination, R2 ¼ 0:882. This result shows that, independently of the specific restriction used to build the index, identification query time is linear to the number of activity occurrences. A similar conclusion can be reached for evidence queries with R2 ¼ 0:989.
REFERENCES
6
CONCLUSIONS
This paper studies the problem of automatically and efficiently detecting activities in very large observation databases collected by systems such as web servers, banks, and security installations. We proposed temporal stochastic automata to model activities of interest and defined a data structure, called a temporal multiactivity graph, to merge multiple activity graphs together and enable concurrent monitoring of multiple activities. We introduced the temporal multiactivity graph index to index very large numbers of temporal observations from interleaved activities. We have designed algorithms to build the tMAGIC index and we have shown how the tMAGIC index can be leveraged to develop algorithms to efficiently solve the Evidence and Identification problems. Finally, we have introduced complexity reducing restrictions and pruning strategies to efficiently solve these problems. Our experiments have shown that tMAGIC consumes reasonable amounts of memory and can quickly solve both the Evidence and Identification problems in both synthetic and a real-world data set.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8] [9]
[10]
[11]
[12]
[13]
[14]
ACKNOWLEDGMENTS Some of the authors were funded by the Army Research Office MURI award number W911NF-09-1-0525, ARO grant W911NF0910206 and AFOSR grant FA95500610405.
[15]
FEBRUARY 2013
M. Albanese, R. Chellappa, V. Moscato, A. Picariello, V.S. Subrahmanian, P. Turaga, and O. Udrea, “A Constrained Probabilistic Petri Net Framework for Human Activity Detection in Video,” IEEE Trans. Multimedia, vol. 10, no. 8, pp. 1429-1443, Dec. 2008. M. Albanese, V. Moscato, A. Picariello, V.S. Subrahmanian, and O. Udrea, “Detecting Stochastically Scheduled Activities in Video,” Proc. 20th Int’l Joint Conf. Artifical Intelligence (IJCAI ’07), pp. 18021807, Jan. 2007. M. Albanese, A. Pugliese, V.S. Subrahmanian, and O. Udrea, “MAGIC: A multiactivity Graph Index for Activity Detection,” Proc. IEEE Int’l Conf. Information Reuse and Integration (IRI ’07), pp. 267-278, Aug. 2007. A. Arasu, S. Babu, and J. Widom, “The CQL Continuous Query Language: Semantic Foundations and Query Execution,” Int’l J. Very Large Data Bases, vol. 15, pp. 121-142, June 2006. J. Ben-Arie, Z. Wang, P. Pandit, and S. Rajaram, “Human Activity Recognition Using Multidimensional Indexing,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 24, no. 8, pp. 1091-1104, Aug. 2002. S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M.A. Shah, “TelegraphCQ: Continuous Dataflow Processing for an Uncertain World,” Proc. Conf. Innovative Data Systems Research (CIDR ’03), 2003. T.V. Duong, H.H. Bui, D.Q. Phung, and S. Venkatesh, “Activity Recognition and Abnormality Detection with the Switching Hidden Semi-Markov Model,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR ’05), 2005. ¨ zsu, “Issues in Data Stream Management,” L. Golab and M.T. O ACM SIGMOD Record, vol. 32, pp. 5-14, June 2003. R. Hamid, Y. Huang, and I. Essa, “ARGMode Activity Recognition Using Graphical Models.” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR ’03), 2003. B. Kerkez, “Learning Plan Libraries for Case-Based Plan Recognition,” Proc. Midwest Artificial Intelligence and Cognitive Science Conf. (MAICS ’02), Apr. 2002. S. Lu¨hr, H.H. Bui, S. Venkatesh, and G.A.W. West, “Recognition of Human Activity through Hierarchical Stochastic Learning,” Proc. IEEE First Int’l Conf. Pervasive Computing and Comm. (PerCom ’03), pp. 416-422, Mar. 2003. F. Mo¨rchen, “Unsupervised Pattern Mining from Symbolic Temporal Data,” SIGKDD Explorations Newsletter, vol. 9, no. 1, pp. 41-55, June 2007. F. Reiss, K. Stockinger, K. Wu, A. Shoshani, and J.M. Hellerstein, “Enabling Real-Time Querying of Live and Historical Stream Data,” Proc. 19th Int’l Conf. Scientific and Statistical Database Management (SSDBM ’07), 2007. K. Seymore, A. McCallum, and R. Rosenfeld, “Learning Hidden Markov Model Structure for Information Extraction,” Proc. Workshop Machine Learning for Information Extraction (AAAI ’99), 1999. V.T. Vu, F. Bre´mond, and M. Thonnat, “Automatic Video Interpretation: A Novel Algorithm for Temporal Scenario Recognition,” Proc. 18th Int’l Joint Conf. Artificial Intelligence (IJCAI ’03), pp. 1295-1302, Aug. 2003.
ALBANESE ET AL.: FAST ACTIVITY DETECTION: INDEXING FOR TEMPORAL STOCHASTIC AUTOMATON-BASED ACTIVITY MODELS
[16] S. Wasserkrug, A. Gal, O. Etzion, and Y. Turchin, “Complex Event Processing over Uncertain Data,” Proc. Second Int’l Conf. Distributed Event-Based Systems (DEBS ’08), pp. 253-264, 2008. Massimiliano Albanese received the PhD degree in computer science and engineering from the University of Naples “Federico II,” Italy, in 2005. Since 2011, he has been an assistant professor in the Department of Applied Information Technology at George Mason University, and he is a member of the Center for Secure Information Systems (CSIS). His current research interests include modeling and detection of cyber activities, defense strategies, and moving target defense. Andrea Pugliese received the PhD degree in computer and systems engineering from the University of Calabria, Italy, in 2005. Currently, he is an assistant professor in the Department of Electronics, Computer and Systems Sciences at the University of Calabria. His main research interests include indexing techniques, semistructured data management, RDF and the Semantic Web, and inconsistency management.
373
V.S. Subrahmanian is a full professor of computer science at the University of Maryland, College Park, where he has served as a faculty member since 1989. He has worked extensively at the intersection of databases and artificial intelligence. He serves on numerous editorial boards, has more than 200 refereed publications, and is a fellow of both the AAAI and AAAS.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.