Cite this paper as: T. Klerx, M. Anderka, H. Kleine Büning, S. Priesterjahn. Model-based Anomaly Detection for Discrete Event Systems In Proceedings of the 26th International Conference on Tools with Artificial Intelligence (ICTAI 2014). Limassol, Cyprus 2014.
Model-based Anomaly Detection for Discrete Event Systems Timo Klerx, Maik Anderka, Hans Kleine Büning
Steffen Priesterjahn
Department of Computer Science University of Paderborn Paderborn, Germany Email: {timo.klerx, maik.anderka, kbcsl}@upb.de
DE R&D ACT 53 Wincor Nixdorf International GmbH Paderborn, Germany Email:
[email protected]
Abstract—Model-based anomaly detection in technical systems is an important application field of artificial intelligence. We consider discrete event systems, which is a system class to which a wide range of relevant technical systems belong and for which no comprehensive model-based anomaly detection approach exists so far. The original contributions of this paper are threefold: First, we identify the types of anomalies that occur in discrete event systems and we propose a tailored behavior model that captures all anomaly types, called probabilistic deterministic timed-transition automata (PDTTA). Second, we present a new algorithm to learn a PDTTA from sample observations of a system. Third, we describe an approach to detect anomalies based on a learned PDTTA. An empirical evaluation in a practical application, namely ATM fraud detection, shows promising results. Keywords—Model-based Anomaly Detection; Automatic Model Generation; Discrete Event Systems; ATM Fraud Detection
I.
I NTRODUCTION
Model-based anomaly detection, also known as modelbased diagnosis, is an important application field of artificial intelligence. It deals with the algorithmic analysis of whether a system operates abnormally given a system description (a model) and observations of its behavior. Figure 1 shows the fundamental procedure of model-based anomaly detection: A system, e.g. a production plant, is simulated using a behavior model, and the expected behavior predicted by the simulation is compared to the observed behavior of the real system. Model-based anomaly detection techniques have been successfully applied in many practical applications, including energy anomaly detection in production plants [1], fault identification in Web service processes [2], diagnosis of electrical power systems [3], fault detection in automotive systems [4], and network intrusion detection [5]. We consider discrete event systems—a system class for which, to the best of our knowledge, no comprehensive modelbased anomaly detection approach exists so far. Discrete event systems are characterized by three properties [6]: First, the system’s state space is finite, second, each new system state depends on the state’s predecessors, and third, state changes are caused by events occurring at various time instances. Moreover, in most practical applications of interest, events are often triggered by stochastic processes such as unpredictable effects of nature or user interactions. The mentioned properties apply to a wide range of relevant systems for which anomaly
model formation system
behavior model
observation
simulation
anomaly detection observed behavior
Figure 1.
X
expected behavior
Fundamental procedure of model-based anomaly detection.
detection is an important concern, including automated manufacturing plants, intelligent transportation systems, communication networks, software systems, or self-service devices. An essential prerequisite for model-based anomaly detection is a model that adequately describes the system’s (normal) behavior. More specifically, the model must capture those aspects of the system that are relevant to solve the anomaly detection task at hand. For example, in order to detect suboptimal energy consumptions in production plants by means of model-based anomaly detection, an appropriate model must capture the normal energy consumption of a plant. We propose a classification scheme for anomalies that occur in discrete event systems, which comprises four general anomaly types. Based on this classification scheme we identify a tailored behavior model that captures both the particular properties of discrete event systems mentioned above and the aspects of this system class that are relevant to detect the four anomaly types. Our model is called Probabilistic Deterministic Timed-Transition Automaton (PDTTA), and it is similar to a traditional probabilistic deterministic timed automaton. The task of model formation (see Figure 1) is usually done manually by human engineers, which entails several problems including lack of knowledge, high development costs, and increasing system complexity. A promising approach to overcome this modeling bottleneck is the automatic model formation using machine learning techniques. While the learning of simple model types can be considered as being state of the art, the learning of probabilistic deterministic timed automata is still a challenge [1]. Two learning algorithms have been proposed in the relevant literature recently: RTI+ [7] and BUTLA [8]. Although both algorithms achieve satisfactory results in terms of theoretical runtime behavior and model
This work was published at IEEE Xplore under [To be published]. [ISBN] ©2014 IEEE
quality, the learned models cannot (and were not indented to) capture the crucial aspects of discrete event systems that are relevant to detect all four anomaly types. We therefore propose a new learning algorithm for probabilistic deterministic timedtransition automata. The idea is to learn the system’s sequential behavior and its timing behavior individually, by first learning an initial probabilistic deterministic finite automaton using a state-of-the-art approach and then augmenting the initial automaton with timing information. Given a learned behavior model, the actual anomaly detection task (see Figure 1) is to decide whether the observed behavior of the real system is normal or abnormal. This decision highly depends on both the underlying model and the type of anomaly to be detected. Maier et al. [8] propose ANODA (ANOmaly Detection Algorithm) that is used in combination with BUTLA. However, the anomaly detection effectiveness of the algorithm is limited in the context of discrete event systems. The reason for this is twofold: First, as mentioned above, the underlying model does not capture all relevant aspects of this system class. Second, they only consider anomalies for two succeeding system states, and hence, they are not able to detect anomalous event sequences.1 We present a new anomaly detection approach that is able to detect all types of anomalies based on a probabilistic deterministic timed-transition automaton. The idea is to traverse the automaton for a given sample observation and aggregate the respective probabilities, which are then used to decide whether the observation is normal or abnormal. Contributions. In this paper, we present the first comprehensive model-based anomaly detection approach for discrete event systems. Our contributions can be summarized as follows: 1)
We propose a classification scheme for anomalies that occur in discrete event systems, and we propose probabilistic deterministic timed-transition automata as a crucial model class to capture all anomaly types (Section II).
2)
We present a novel machine-learning algorithm for probabilistic deterministic timed-transition automata, which overcomes the limitations of existing model learning algorithms outlined above (Section III).
3)
We describe an approach to detect all anomaly types that occur in discrete event systems based on a probabilistic deterministic timed-transition automaton (Section IV).
To evaluate its effectiveness and to highlight its practical relevance, we apply our model-based anomaly detection approach to a real-world application, namely automatic ATM fraud detection. An ATM fulfills all properties of a discrete event system (see Section II-A). Moreover, modelbased anomaly detection is an appropriate means to identify ATM fraud—with the assumption that a significant anomaly is a strong indicator of a fraud attempt [9]. We evaluate the anomaly detection effectiveness using a data set that has been recorded on a common Wincor Nixdorf ATM in a period of 1 Note that we are referring to the original approaches described in [8]. In theory, it might be possible to extend BUTLA to consider anomalies for sequences of succeeding system states.
ten months (Section V). The results are quite promising, and they show that our approach is able to detect anomalies with an acceptable effectiveness. II.
M ODELING D ISCRETE E VENT S YSTEMS
We start with a definition of discrete event systems, including a description of the crucial properties (Section II-A). We then identify four general types of anomalies that occur in discrete event systems (Section II-B). Finally, we describe probabilistic deterministic timed-transition automata, which is a tailored behavior model for discrete event systems that can be used to detect all four anomaly types (Section II-C). A. Discrete Event Systems System theory has provided a classification of different types of systems. Discrete event systems represent one particular system class, which is defined as follows: Definition 1 (Discrete Event System): “A discrete event system is a discrete-state, event-driven system, that is, its state evolution depends entirely on the occurrence of asynchronous discrete events over time.” [6] Specifically, a discrete event system is characterized by three properties: 1)
Discrete states. The system’s state space is finite (also called finite states).
2)
Dynamic. Each new state of the system depends on the state’s predecessors (also called with memory).
3)
Event-driven. State changes are caused by events occurring at various time instances.
As an illustrative example consider the board game chess, which can be expressed as a discrete event system as follows: every possible board configuration defines an individual state of the system (discrete states), moves depend on previous moves (dynamic), and moves take place asynchronously at various times (event-driven). The behavior of a discrete event system can be observed by means of sample paths. (In the chess example, a sample path could be the sequence of moves of an individual game.) Figure 2 depicts a sample path as a timing diagram. The first event e1 occurs at time i1 and causes a transition from state s2 to state s5 , the next event e2 occurs at time i2 and causes a transition from state s5 to state s4 , and so forth. The sample path can also be written as a timed event sequence he1 , i1 i, he2 , i2 i, he3 , i3 i, he4 , i4 i, he5 , i5 i, he6 , i6 i, he7 , i7 i. In this case, it is assumed that the start state (s2 ) is known, and that the system is deterministic, i.e., there exists exactly one succeeding state after an event occurred. Thus, the current state of the system can be determined at any point in time from a given timed event sequence. In practice, however, discrete event systems frequently operate in a stochastic setting. In this case, a system state describes a random process, such as unpredictable effects of nature or user interactions. Then, probability distribution functions can be assigned to the events if some statistical information about the set of sample paths of the system is available (this will be detailed in Section II-C).
e1
e3
e2
s5
s2
e5
e4
s4
s4
e6
s3
s1
e7
also be anomalous if its occurrence probability is lower than some anomaly threshold. In the ATM example the anomalous event could be a card entry that is directly followed by a cash dispense, but without PIN entry in between. This behavior can be caused by a software-based attack, where the attacker has been able to hook onto the card reading code and to directly issue a dispense if certain card data is read.
s6
s4
time i1
i3
i2
i5
i4
i6
i7
Figure 2. Timing diagram showing a sample path of a discrete event system (taken from [6]). Events are denoted by arrows at the times they occur, and states are shown in between events. card entry start
card entry stop
reading card
idle
PIN entry start
awaiting PIN
PIN entry stop
reading PIN
amount selection
awaiting amount
shutter dispense shutter cash open closed
preparing starting cash out cash out waiting
2)
Anomalous sample path. A sample path is anomalous if every single event is normal but the aggregation of the events’ individual occurrence probability is lower than some anomaly threshold. (This anomaly type only applies to a stochastic setting.) For example, consider two normal sample paths (card in, select “withdraw”, PIN entry, amount selection, dispense cash, card out) and (card in, select “show balance”, PIN entry, show balance, card out). Then, a third sample path (card in, select “show balance”, PIN entry, amount selection, dispense cash, card out) would be anomalous, although each pair of subsequent events is normal. This could indicate certain software-based attacks as well as a manipulation of the card reader or PIN pad.
3)
Anomalous event timing. Given a sample path, a single event e has an anomalous timing if the time gap between e and its directly preceding event is outside the regular range. In the ATM example an anomalous event timing could occur if the PIN entry takes an unusual long time. This indicates a PIN pad overlay that was placed by an attacker to capture the PIN, while impeding the normal PIN entry.
4)
Anomalous sample path timing. The timing of a sample path is anomalous if every single event has a normal timing but the aggregation of the events’ individual timing probabilities is lower than some anomaly threshold. Regarding an ATM this anomaly could be caused by a software or denial-of-service attack that leads to a general slowdown of the overall system.
idle time
i1
i2
i3
i4
i5
i6
i7
i8
Figure 3. Timing diagram showing a sample path of an ATM, which can be interpreted as a discrete event system.
Example. In our real-world application, we consider ATMs, which can be interpreted as a discrete event system. Figure 3 shows a typical sample path of an ATM. Events are triggered both by customer input, like inserting the card or entering the PIN, as well as by internal mechatronic devices, for instance for dispensing cash or ejecting the card. Continuous variables are not considered, i.e. the state space is finite and the state evolution is entirely event-driven. Moreover, because an ATM typically operates in a transaction-based manner (e.g. card insertion → PIN entry → amount selection → dispense cash → card return), it is also a dynamic system. In general, the behavior of an ATM is deterministic, but stochastic elements are brought in by customer interactions and environmental conditions; examples include operating errors, varying customer input times, or slow server connections. B. Anomalies in Discrete Event Systems In different system classes different types of anomalies can occur. Consider for example a continuous-state system where one anomaly type could be that a continuous variable reaches some critical threshold; for instance, the fill level of a liquid tank exceeds its capacity. In a discrete event system this type of anomaly cannot be modeled because continuous variables do not exist per definition. However, for each individual system class general anomaly types can be defined that apply for all systems in this class. We developed a classification scheme for the anomaly types that occur in discrete event systems. The benefits of this classification scheme are threefold: First, it reveals what kind of anomalies can be encountered in a discrete event system. Second, it provides the basis to identify the information that is essential to detect particular anomaly types, which also includes the design of a respective behavior model (detailed in Section II-C). Third, it provides a means to compare the capabilities of different anomaly detection approaches. We identified four general anomaly types that occur in discrete event systems: 1)
Anomalous event. Given a sample path, a single event is anomalous if it causes a transition to an irregular system state. In a stochastic setting, an event can
C. Probabilistic Deterministic Timed-Transition Automaton Discrete event modeling formalisms can be untimed, timed, or stochastic timed, according to the desired level of abstraction [6]. Untimed models only capture logical properties concerned with the sequential ordering of events, timed models capture properties that involve timing considerations, and stochastic timed models capture properties that involve a probabilistic setting. Here, we apply a stochastic timed model. The reason for this is twofold: First, as motivated earlier, many practical systems of interest operate in a stochastic setting; as it is also the case in our ATM fraud detection application. Second, we need a probabilistic mechanism to capture the anomaly type anomalous sample path (see above). We propose probabilistic deterministic timed-transition automaton (PDTTA) as a means to implement a stochastic timed model for discrete event systems. Figure 4 shows an example
PIN entry stop
time
reading PIN
probability
p = 0.95
PIN entry start probability
awaiting PIN
p = 0.7
awaiting amount
time
Figure 4. Example of a probabilistic deterministic timed-transition automaton (PDTTA).
of a PDTTA. For each transition a probability p for taking the transition is given. Additionally, a probability density function describes the relative likelihood for the event’s timing (e.g. the time it takes to enter the PIN). In the following, a PDTTA is formally defined. Definition 2 (PDTTA): A probabilistic deterministic timed-transition automaton (PDTTA), denoted by A, is a six-tuple A = (S, so , Σ, T, ξ, τ ), where •
S is a finite set of states, with s0 ∈ S the start state.
•
Σ is a finite alphabet comprising all relevant events.
•
T ⊆ S × Σ × S is a finite set of transitions. E.g. hs, e, s0 i ∈ T is the transition between states s, s0 ∈ S triggered by event e ∈ Σ.
•
ξ : T → [0, 1] is a transition probability function, which assigns a probability value p to every transition.
•
τ : T → Θ is a transaction time probability function, which assigns a probability distribution θ ∈ Θ to every transition, with Θ being the set of all possible probability distributions. Every θ ∈ Θ has the signature θ : I → [0, 1], with I ⊆ N a set of time values.
A PDTTA is similar to a traditional probabilistic deterministic timed automaton (PDTA).2 But in contrast to a PDTTA, in a PDTA, guards on the transitions limit the availability of these transitions to certain points in time. Consequently, there can be two transitions from a single state with the same label (or event), but with different guards. This implies that the future execution of the system depends on both the time values and the occurrence of events. In contrast, the future execution in a PDTTA only depends on the occurrences of events—the timing is inferred afterwards. This is in line with the definition of a discrete event system (see Definition 1). Moreover, note that no specific acceptance states are required in a PDTTA because for a given sample path the automaton is used to compute the likelihood that the path is generated by the model. This likelihood is then interpreted as the path’s anomaly score (we will come to this is Section IV). III.
M ODEL L EARNING A LGORITHM
In this section, we describe a two-step approach to learn a PDTTA from given sample paths of a discrete event system. As input, let X denote a set of timed event sequences x = he1 , i1 i, he2 , i2 i, . . . , hen , in i. In a first step, we learn a simple 2 Probabilistic deterministic timed automata are also often called stochastic timed automata or deterministic real-time automata. For further information about timed automata, refer to [10].
probabilistic deterministic finite automaton (PDFA) from X (Section III-A). Note that a PDFA is a PDTTA without the transaction time probability function τ (see Definition 2). In a second step, we learn the time probability distribution Θ for every transition t ∈ T and thus, we obtain τ (Section III-B). By combining the results from the first and the second step, we obtain the final PDTTA. A. PDFA Learning Learning a PDFA from sequence data is a well-studied problem. Several approaches have been proposed to solve this task, of which the ALERGIA algorithm, proposed in [11], is the most popular one. We apply ALERGIA in favor of a more complex algorithm to underline the robustness of our model learning approach. The time behavior is irrelevant for PDFA learning. Hence we omit the time information in the timed event sequences ˆ of (untimed) from X for now, which results in a new set X event sequences x ˆ = e1 , e2 , . . . , en , which constitutes the input for the ALERGIA algorithm. As additional input a confidence level parameter α is required. The ALERGIA algorithm works as follows. Initially a socalled prefix tree acceptor (PTA) is built that accepts every ˆ A PTA is a tree that accepts sequences, sequence x ˆ ∈ X. whereas every node can be a final node. For a given sequence the tree is traversed and the sequence is accepted iff the last symbol was read and a final node was reached in the PTA. Given the PTA, the algorithm starts with the root node and for each pair of nodes x and y the algorithm checks whether these two nodes are compatible. If they are compatible, they are merged. Two nodes are compatible if these two nodes and their successor nodes are determined to be equivalent recursively. The equivalence test uses the Hoeffding-Bound by comparing the confidence-level α, the number of sequences arriving at x and y, denoted as n and n0 respectively, as well as the number of sequences ending in x and y, denoted as f and f 0 respectively: r f 1 f 0 2 1 1 different(x, y) = − 0 > log ( √ + √ ) n n 2 α n n0 After all nodes have been checked for difference the algorithm outputs the resulting PDFA. For more information about ALERGIA, refer to [11]. B. τ Learning After obtaining the PDFA, we learn the transaction time probability function τ . Algorithm 1 describes this process. We iterate over the original input sequences X, and for each transition t ∈ T we create a list Pt where we store the time values that belong to the events that triggered t. Thus, for every timed event sequence x ∈ X we traverse the PDFA starting in s0 and for every tuple he, ii ∈ x we take the transition t = hs, e, s0 i ∈ T and put i into Pt . After we processed all sequences in X, we fit a probability density function (PDF) for each t ∈ T with the values stored in Pt . In previous experiments (cf. [9]) we found that Kernel Density Estimators with a Gaussian Kernel perform best on the data set at hand.
Algorithm 1: τ learning Data: Set of timed event sequences X, PDFA A0 = (S, s0 , Σ, T, ξ) Result: τ 1 foreach sequence x ∈ X do 2 s ←− s0 ; 3 foreach tuple he, ii ∈ x do 4 t ←− getTransition(s, e); 5 P [t].append(i); 6 s ←−getTargetState(t); 7 8
foreach transition t ∈ T do τ (t) ←−fitPDF(P [t]);
Algorithm 2: Anomaly detection Data: timed event sequence x, PDTTA A = (S, s0 , Σ, T, ξ, τ ), thresholds cE , cT Result: true/false 1 s ←− s0 ; 2 foreach tuple he, ii ∈ x do 3 t ←− getTransition(s, e); 4 PE .append(ξ(t)); 5 PT .append(τ (t)(i)); 6 s ←−getTargetState(t); 7 8 9 10
IV.
A NOMALY D ETECTION
We will now describe how the learned model can be used to detect abnormal behavior. This problem is defined as follows: Given a PDTTA A and a single timed event sequence x, decide whether x is normal or abnormal with respect to A. The idea is to traverse A and store all probabilities associated with transitions and time values, in lists PE and PT respectively, and to make the anomaly decision based on the aggregation of these values. A naïve approach to aggregate the values to a single anomaly score would be to multiply all probabilities in PE and PT . The result of this multiplication can then be compared to a threshold c. However, this approach has certain challenges: 1)
The anomaly score depends on the length of the input sequence—the longer the sequence the smaller the anomaly score.
2)
The probabilities in PE and PT are treated equally.
3)
A single very small probability has a big influence on the anomaly score.
We propose a pragmatic solution to address the first two challenges: We define two separate thresholds cE and cT for event and time probabilities, respectively. Then we separately multiply the Q probabilities in PE and Q PT , yielding anomaly values aE = pi ∈PE pi and aT = pi ∈PT pi . If we multiply different probabilities pi , every resulting anomaly score aX will decrease exponentially in expectation. Hence, we normalize every threshold cX based on the sequence length n, so we compare aX with cnX instead of comparing aX with cX . With this normalization sequences of different length can be compared.3 Regarding the third challenge, it may be desired that one very small probability has a big influence—depending on the application. A transition with probability almost zero can e.g. result from noise in the training set and should not be contained in an optimal model, thus indicating an anomaly. Algorithm 2 describes the anomaly detection. For a timed event sequence x = he1 , i1 i, he2 , i2 i, . . . , hen , in i we traverse the PDTTA A starting in s0 . For each tuple he, ii ∈ x we take 3 For the sake of clarity and ease of comprehension, we use products of probabilities here. In our implementation, however, we transfer the computation to a logarithmic space to avoid floating point underflows. The latter approach is described in [12], and it is mathematically equivalent to the former.
11 12 13 14
aE , aT ←− 1; for k ← 1 to length(x) do aE ←− aE · PE .get(k); aT ←− aT · PT .get(k); length(x)
if aE ≤ cE return true
length(x)
or aT ≤ cT
then
else return false
the transition t = hs, e, s0 i ∈ T . In the list PE we store all event probabilities ξ(t) that indicate how likely it is to traverse t in s (line 4 in Algorithm 2). The time probabilities τ (t)(i), which indicate the likelihood of t occurring at time i, are stored in the list PT (line 5). Then we aggregate the event and time probabilities by multiplication (lines 7–10). Finally, we decide based on the two thresholds cE and cT whether x is an anomaly or not. Specifically, x is an anomaly if one of the aggregated probabilities aE or aT is lower than the corresponding normalized threshold cE and cT , respectively (lines 11–14). V.
E MPIRICAL E VALUATION
The goal of the evaluation is to assess the anomaly detection effectiveness of our approach in a real-world application, namely ATM fraud detection. We first describe the underlying data set (Section V-A). We then discuss how anomaly examples can be inserted into the test data (Section V-B). Finally, we describe the experimental setting (Section V-C) and discuss the results (Section V-D). A. Data Basis and Preprocessing Modern ATMs comprise some Diagnosis and Serviceability module that processes and aggregates the data that is produced by an ATM’s internal devices. The aggregated data can be stored in a log file, and ATM-internal applications can obtain access to the log file for process control and for health or security state assessment, for instance. A log file comprehensively describes the ATM’s real-time behavior in all monitored operation scenarios, and hence, it is an appropriate source for model learning. The usage of log files for ATM fraud detection was proposed in [13]; in this study, however, only sequence-based anomalies were addressed. For our experiments, we use a log file that has been recorded on a public ATM in the period between June 2011 and April 2012. The log file was provided by Wincor Nixdorf International—one of the world’s leading ATM manufacturers. The respective ATM is placed in a bank institute located in
a German city, and hence, it is heavily frequented. The log file comprises more than 15 million status messages, resulting in 1.6 GB file size. Each status message has the following structure:
The timestamp represents the moment the message occurred and the message ID identifies the sending device as well as the reason why the message was sent. A message can also comprise an optional payload that contains additional devicespecific data, such as sensor values or control information, but we do not use the payload here.4 The log file can be interpreted as a timed event sequence. Each status message represents the occurrence of a single event. The set of possible events (denoted as Σ in Definition 2) is given by the set of possible message IDs. The log file mentioned above comprises 17 different message IDs or events respectively. (A subset of the events is shown in Figure 3.) In a preprocessing step, we split the log file into weekly chunks to account for seasonality effect, which often occur in time series data. For example, it is likely that the normal usage patterns of the ATM on a weekday differ from the normal usage patterns on a Sunday, where the bank institute might be closed. We therefore perform the evaluation on the basis of whole weeks (and not days for instance). In another preprocessing step, we split the weekly chunks into individual sample paths. This is necessary because a sample path is the unit element for both our model learning algorithm and our anomaly detection approach. A typical sample path through an ATM is a transaction, which is initiated by a customer who inserts the card. We identify transactions by splitting the log file using the respective event card entry start. B. Anomaly Examples In the period where the log file was recorded no attacks were registered, so we consider the monitored behavior as normal. To assess the anomaly detection effectiveness, examples of normal and abnormal behavior are required. The lack of anomaly examples for testing is a general issue in many practical anomaly detection applications. We distinguish four strategies to derive anomaly examples, which correspond to different types of anomaly examples or attacks respectively: 1)
Real-world attacks. Examples are derived from monitored attacks on ATMs in the wild.
2)
Physical simulated attacks. Examples are derived from the reproduction of known attacks under laboratory conditions using a real test system.
3)
Model-based simulated attacks. Examples are derived from the reproduction of known attacks by manipulating a (learned) behavior model.
4)
Artificially generated anomalies. Examples are generated by some probabilistic or random process.
4 Part of our current research is the analysis whether and how payloads can be incorporated into a behavior model in order to improve the anomaly detection effectiveness.
Data of real-world attacks is in general not available for reasons of security and secrecy. Physical simulated attacks and model-based simulated attacks both require a human domain expert. Moreover, for most of the relevant attacks physical simulation is complicated, expensive and therefore inefficient. Model-based simulation is more appropriate, but it requires a generative and interpretable model. However, especially the latter requirement often applies only for manual created models because most model formation algorithms apply some state merging technique, which yields a model that most often cannot be interpreted by a human expert (cf. Section III-A). We therefore decided to generate artificial anomaly examples. How this is done in detail will be described in the next subsection. C. Experimental Setting Each experiment is performed on the data of three consecutive weeks in the log file, which serve as training set, validation set, and test set respectively. The training set is used for model learning. The validation set is used to identify the optimal operating point of our anomaly detection approach, i.e. the optimal combination of the parameters confidence level α, event threshold cE , and time threshold cT . Finally, the test set is used to evaluate the anomaly detection effectiveness at the optimal operating point. We intersperse artificially generated anomaly examples in the validation and test set. This is done by randomly selecting a certain proportion of timed event sequences from the respective set and then modifying this sequences. We intersperse an equal number of anomaly examples for each of the four anomaly types introduced in Section II-B. The modifications that are performed in order to generate the respective anomaly type are: •
Delete a randomly chosen hevent, timei-tuple from the timed event sequence (type: anomalous event).
•
Replace the timed event sequence by another sequence that occurs relatively rare in the validation and test set respectively (type: anomalous sample path).
•
Modify a single time value in the timed event sequence (type: anomalous event timing).
•
Modify every time value in the timed event sequence (type: anomalous sample path timing).
Note that the generated anomaly examples are used for testing purpose only—model learning is performed solely on normal examples. The anomaly detection effectiveness is assessed in terms of F-measure, which is the harmonic mean of precision and recall. Because we want to classify anomalies we define the “positive” class as detected anomalies and the “negative” class as detected normal sequences. Thus, the precision is the ratio between correctly detected anomalies and all detected anomalies. The recall is the ratio between detected anomalies and all anomalies. Formally, F-measure, recall (rec) and precision (prec) are defined using true positives (tp), true negatives (tn), false positives (fp), and false negatives (fn): F-measure = 2 ·
prec · rec tp tp ; prec = ; rec = prec + rec tp + fp tp + fn
Table I. A NOMALY DETECTION EFFECTIVENESS USING A TEST SET WITH 10% OF ARTIFICIALLY GENERATED ANOMALIES OF ALL FOUR ANOMALY TYPES . F-Measure
PDTTA (ALERGIA + τ learning)
0.63
PDFA (ALERGIA)
0.51
Parameters α
cE
cT
0.5
0.01
10−5
0.4
0.1
−
Precision / Recall
Underlying model (learning algorithm)
1 0.9
Precision Recall
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Table II.
A NOMALY DETECTION EFFECTIVENESS SHOWN IN TABLE I BROKEN DOWN INTO THE FOUR ANOMALY TYPES : 1. ANOMALOUS EVENT, 2. ANOMALOUS SAMPLE PATH , 3. ANOMALOUS EVENT TIMING , AND 4. ANOMALOUS SAMPLE PATH TIMING . Anomaly type
Underlying model
1
PDTTA PDFA
2
F-Measure
Parameters α
cE
cT
0.62 0.69
0.4 0.7
0.1 0.3
10−9 −
PDTTA PDFA
0.90 0.97
0.001 0.9
0 0.01
0 −
3
PDTTA PDFA
0.36 0.00
0.7 −
0.1 −
0 −
4
PDTTA PDFA
0.71 0.00
0.05 −
0 −
0 −
P
PDTTA PDFA
0.63 0.51
0.5 0.4
0.01 0.1
10−5 −
D. Analysis Results Table I shows the anomaly detection effectiveness. We report F-measure values for both our new PDTTA-based anomaly detection approach (with time information) and anomaly detection using the state-of-the-art PDFA (without time information). The latter provides us a baseline. The parameters for the confidence level (α), the event threshold (cE ), and the time threshold (cT ) are tuned for each approach individually. Table I shows that both approaches achieve reasonable F-measure values of 0.63 and 0.51 respectively. As expected, the PDTTA-based approach outperforms the baseline. This can be explained due to the fact that the test set comprises anomaly examples of all four anomaly types, including timebased anomalies (anomaly type three and four). However, timing behavior cannot be modeled using a PDFA. Table II breaks down the results shown in Table I into the four anomaly types. The parameters are tuned to the respective types of anomalies. For anomaly type one and two our PDTTAbased approach performs worse than the PDFA-based baseline approach. This indicates that the additional timing information slightly diminishes the performance in this cases. On the other hand, the PDTTA-based approach is able to detect the timebased anomalies (type three and four); in this cases the baseline approach fails, as expected. It can be seen from Table II that the PDTTA-based approach achieves a relatively good performance in detecting those anomalies that cover a whole sequence (type two and four), compared to anomalies that can only be observed between two succeeding events (type one and three). This can be attributed to the fact that our approach operates in a sequence-based manner, i.e. the final anomaly decision is made based on aggregated probabilities over sequences.
0 10-3
10-2
10-1
100
Threshold cE
Figure 5. Anomaly detection effectiveness of a PDTTA in terms of precision and recall over the event threshold cE (in log scale). The underlying test set and the remaining parameters are the same as in Table I (α = 0.5, cT = 10−5 ).
Figure 5 shows the anomaly detection effectiveness of a PDTTA in terms of precision and recall for varying event threshold cE , while fixing α = 0.5 and cT = 10−5 . By varying cE the precision/recall tradeoff can be controlled: For high values of cE the recall becomes 1 while the precision drops to 0, meaning that every anomaly is detected (high recall) but simultaneously, almost every normal sequence is falsely labeled as anomaly (low precision). The break even point is roughly at cE = 0.05, where precision and recall are equal (≈ 0.6). For cE smaller than 0.2 the precision first rises (while the recall drops), but both stay almost constant for cE = 10−2 or lower (precision ≈ 0.69, recall ≈ 0.58). In terms of F-Measure the best performance was achieved for cE = 10−2 (cf. Table I). The results for varying the time threshold cT while fixing cE are not shown because they are very similar, i.e. precision (and recall) range from 0 (and 1) over 0.6 (and 0.6) to 0.69 (and 0.58). Altogether, the experimental evaluation shows that the anomaly detection effectiveness of both approaches is quite good for anomalous sample paths and acceptable for anomalous events. Only the PDTTA-based approach can detect anomalous event timings and anomalous sample path timings, whereas the effectiveness is rather poor for the former and acceptable for the latter. VI.
C ONCLUSION
We propose probabilistic deterministic timed-transition automata as a tailored behavior model for model-based anomaly detection in discrete event systems. The model can be learned from given system observations by first building a state-of the-art PDFA and then augmenting it with the relevant timing information. Anomaly detection is performed by traversing the learned model for given system observations while comparing the aggregated probabilities with some anomaly threshold. The empirical evaluation indicates the practical applicability of our anomaly detection approach. The overall results in the ATM fraud detection scenario are quite promising; whereby the anomaly detection effectiveness differs among the four anomaly types. Two types can be detected with acceptable F-measure values of 0.62 and 0.71, and the other two types can be detected with relatively poor and quite good F-measure values of 0.36 and 0.90 respectively.
We are convinced that the presented or similar anomaly detection approaches will help to increase the security and safety of technical systems. Although we evaluate our approach in the context of ATM fraud detection, it is applicable to all technical systems that can be interpreted as a discrete event system, such as automated manufacturing plants, intelligent transportation systems, communication networks, software systems, and selfservice devices. As for future work, instead of using the well-known ALERGIA algorithm for PDFA learning, more elaborated techniques, which have been proposed recently, should be investigated regarding their applicability and quality (see e.g. [14]). Moreover, part of our current work is to improve the generation of anomaly examples for testing, in order to get more reliable evaluation results. In the context of ATM fraud detection, this includes the creation of physical simulated attacks as well as model-based simulated attacks (see Section V-B). Finally, the relatively low performance for the anomaly types one and three (anomalies between two succeeding events) could be improved by introducing additional (one-class) classifiers that model the normal behavior at each transition. ACKNOWLEDGEMENTS This work was supported by Wincor Nixdorf International, and partly funded by the German Federal Ministry of Education and Research (BMBF) within the Leading-Edge Cluster “Intelligent Technical Systems OstWestfalenLippe” (it’s OWL). R EFERENCES [1]
[2]
[3]
[4]
[5]
[6] [7]
[8]
[9]
O. Niggemann, B. Stein, A. Vodencarevic, A. Maier, and H. Kleine Büning, “Learning Behavior Models for Hybrid Timed Systems,” in Proceedings of the 26th International Conference on Artificial Intelligence (AAAI’12). AAAI, 2012, pp. 1083–1090. Y. Yan, P. Dague, Y. Pencole, and M. Cordier, “A model-based approach for diagnosing fault in web service processes,” International Journal of Web Services Research (IJWSR), vol. 6, no. 1, pp. 87–110, 2009. O. Mengshoel, M. Chavira, K. Cascio, S. Poll, A. Darwiche, and S. Uckun, “Probabilistic model-based diagnosis: An electrical power system case study,” IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, vol. 40, no. 5, pp. 874–885, 2010. A. Azarian and A. Siadat, “A global modular framework for automotive diagnosis,” Advanced Engineering Informatics, vol. 26, no. 1, pp. 131– 144, 2012. K. Wang and S. J. Stolfo, “Anomalous payload-based network intrusion detection,” in Proceedings of the 7th International Symposium on Recent Advances in Intrusion Detection (RAID’04). Springer, 2004, pp. 203– 222. C. G. Cassandras and S. Lafortune, Introduction to discrete event systems. Springer, 2008. S. Verwer, M. de Weerdt, and C. Witteveen, “A likelihood-ratio test for identifying probabilistic deterministic real-time automata from positive data,” in Proceedings of the 10th International Colloquium Conference on Grammatical Inference: Theoretical Results and Applications (ICGI’10). Springer, 2010, pp. 203–216. A. Maier, A. Vodencarevic, O. Niggemann, R. Just, and M. Jäger, “Anomaly detection in production plants using timed automata,” in Proceedings of the 8th International Conference on Informatics in Control, Automation and Robotics (ICINCO’11). SciTePress, 2011, pp. 363–369. T. Klerx, M. Anderka, and H. Kleine Büning, “On the usage of behavior models to detect ATM fraud,” in Proceedings of the 21st European Conference on Artificial Intelligence (ECAI’14). IOS Press, 2014, pp. 1045–1046.
[10] [11]
[12]
[13]
[14]
R. Alur and D. L. Dill, “A theory of timed automata,” Theoretical Computer Science, vol. 126, no. 2, pp. 183–235, 1994. R. C. Carrasco and J. Oncina, “Learning stochastic regular grammars by means of a state merging method,” in Proceedings of the 2nd International Colloquium on Grammatical Inference and Applications (ICGI’94). Springer, 1994, pp. 139–152. P. Sun, S. Chawla, and B. Arunasalam, “Mining for outliers in sequential databases,” in Proceedings of the Sixth SIAM International Conference on Data Mining (SDM’06), 2006, pp. 94–105. M. Anderka, T. Klerx, S. Priesterjahn, and H. Kleine Büning, “Automatic ATM Fraud Detection as a Sequence-based Anomaly Detection Problem,” in Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM’14). SciTePress, 2014. S. Verwer, R. Eyraud, and C. Higuera, “Pautomac: A probabilistic automata and hidden markov models learning competition,” Machine Learning, vol. 96, no. 1-2, pp. 129–154, 2014.