Comparing Software Behaviour Models - Computer Science - The ...

Comparing Software Behaviour Models Neil Walkinshaw, Kirill Bogdanov Department of Computer Science, The University of Sheffield E-mail: {n.walkinshaw,k.bogdanov}@dcs.shef.ac.uk Technical Report: CS-08-16

vice-versa. Language-based comparison techniques can at best provide an approximate guess at how similar the structures of two models are. However, no technique exists that can: (a) precisely describe the structural difference between two state-based models and (b) accurately quantify these differences in terms of the input/output behaviour of the models. This paper makes two contributions. It presents the FSMDiff algorithm, which concisely captures the difference between two models in terms of added / removed states and transitions. It also shows how the output of the algorithm can be used with the well established precision and recall measure [14] to provide a more descriptive measure of similarity between two models. The rest of this paper is structured as follows. Section 2 introduces the challenge of comparing finite state models, and shows why existing techniques are insufficient. Section 3 presents our model comparison algorithm and shows how to combine it with the precision and recall measure. Section 4 presents a case studies that demonstrates the application of the algorithm in comparing the outputs from two different state machine inference techniques. Section 5 Finally section 6 provides conclusions and outlines our future work.

Abstract Software development relies on the use of models that either specify how the system is intended to behave, or capture its behaviour in practice. It can often be the case that a developer ends up with models of the same system that are different in some respect. One of the models may be reverse-engineered, or one of the models may stem from an older version that has evolved. In these circumstances, it is important to be able to concisely capture the differences between them - a challenging task when the models are nontrivial. This paper presents an automated technique that addresses this problem.

1 Introduction Models of software behaviour are useful for a variety of development tasks. Developers can use them as a basis for communicating with each other, they can be inspected, or used for specification-based testing or model checking. If the model is complete and up-to-date, the aforementioned techniques can be used in harmony to ensure that the software ultimately behaves correctly. The need to authoritatively compare two different models arises in a number of areas in software engineering. The accuracy of reverse-engineering techniques can only be evaluated by establishing how close a hypothesis is to a reference model. In model-driven development the developer will inevitably end up with multiple models, corresponding to different versions, alternative specifications, or reverseengineered implementations. Being able to concisely characterise the differences between these models is key to developing the system in such a way that it is both correct and consistent. Conventional model comparison techniques tend to compare the “languages” of two models, where the language is the set of all possible sequences of events. However, a major difference in languages does not necessarily imply a commensurate difference in the underlying model structure, and

2 Background This section presents some preliminary definitions, and then provide an overview of existing techniques that are used to compare state-based models. These techniques are primarily from the domain of finite state machine / regular grammar inference, where they are used to establish the accuracy of inferred models with respect to reference models.

2.1 Preliminary definitions In this work, we assume that software behaviour is modelled in terms of a Labelled Transition System. Definition 2.1 (Labelled Transition System (LTS)) A LTS [16] is a quadruple A = (Q, Σ, ∆, q0 ), where where Q is a finite set of states, Σ is a finite alphabet, ∆ : Q × Σ → Q 1

is a partial function and q0 ∈ Q. This can be visualised as a graph, where states are the nodes, and transitions are the edges between them, labelled by their respective alphabet elements.

capture the specific differences between two languages? What does this say about the underlying model? The most common approach has its roots in the field of regular grammar inference [9]. A random sample is taken from one of the machine languages and the other machine is used to count the proportion of the sample that is accepted by both machines. This is however problematic because (1) it is virtually impossible to select a random sample that will fairly evaluate the underlying machine and (2) a single proportional value gives little insight into any reasons for the (in-)accuracy. Various techniques have been developed to address these problems. Lo et al. [11] propose the use of precision and recall [14] to better interpret the accuracy of a machine. To address the sampling problem, the authors propose the use of established state machine testing techniques, which can produce a complete, characteristic set of paths if the machine is minimal and deterministic [17]. In this case, it becomes possible to produce the complete test-sets for both machines, and to compare them wholesale - producing an absolute measure for comparing the languages.

Finite state machines can be interpreted as LTSs. A broad range of generalisations of finite state machines have been developed (e.g. Abstract State Machines [2] and Extended Finite-State Machines [4]), which incorporate explicit data transformations with state transitions. This makes it easier to model complex software systems at a higher level of abstraction. State-based descriptions of software behaviour are usually presumed to be complete; if a particular transition cannot be followed in the model, it is assumed to be impossible. In practice however, models are usually incomplete. Whether designed by hand or reverse-engineered, it is rarely realistic to assume that the model is complete in every respect. As a result, the absence of a transition may be either due to the fact that it would be impossible, or it is simply not known. For this reason we assume that the models we are going to compare are Partial LTSs (PLTSs) [16]. If they are known to be incomplete, it is possible to explicitly distinguish between model transitions that are known to be invalid, and transitions that are simply not known to exist at all.

2.3 Emphasis on Language can Skew Comparison of Software Behaviour Models One of the main motivations for the work in this paper is the fact that language-based model comparisons are not always suitable by themselves. Emphasis on language ignores the underlying state transition structure. By comparing languages alone it is difficult to identify precisely what FSM features cause one language to differ from another. For state-based models of software systems, software engineers are not solely interested in the language or range of behaviours that are represented by a model. A software engineer wants to gain insights about the inter-dependencies between specific parts of the software system as well, and to investigate potential interactions between different parts of the model. In other words, the actual structure of a model is at least as important as the language it produces or accepts. Many evaluation techniques for software-engineering models make the assumption that a similarity in the languages of two models implies a similarity in their underlying structure and vice-versa. This is however not always the case. Figure 1 illustrates this with a model of a simple text editor. Model (a) is correct and the models (b-d) contain errors. Using our precision and recall score [17], the positive recall (overlap of valid sentences with the language of the correct model) is recorded under each model. In (b) a single loop is missing, and in model (c) two transitions and a state are missing, but they both receive similar scores. In terms of transition structure, (c) fares much worse than (b), but because the proportion of overlapping language remains similar, this is not reflected in the scores.

Definition 2.2 (Partial LTS (PLTS)) A PLTS is a 5-tuple A = (Q, Σ, ∆, q0 , Ψ). This is defined as a LTS, but it is assumed to be only partial. To make the explicit distinction between unknown and invalid behaviour, Ψ makes invalid transitions explicit - Ψ : Q × Σ → Q where Ψ ∩ ∆ = 0. / Definition 2.3 (Language of a (P)LTS) The sequence α ∈ Σ∗ is belongs to the language of a FSM if there is a path α from the initial state q0 to a state q. In symbols, q0 → q, for q ∈ Q.

2.2 Language-Based Model Comparison One common characteristic of all current comparison techniques is the fact that models are compared in terms of their respective languages. This effectively treats the model as a ‘black box’, where the internal structure is unimportant. The only aspect that is of interest is its language - the sequences of events it uses to interact with its environment. Measuring the extent to which two languages overlap in a valid and useful way is however challenging. Languages are usually infinite, making it impossible to exhaustively enumerate every possible sentence. As a result, the validity of the final measure is dependent on obtaining two finite samples that are in some sense ‘representative’ of the languages. Interpreting the (lack of) overlap between the languages can be equally challenging; how can you concisely 2

(a) Recall + = 1

(b) Recall + = 0.57

(c) Recall + = 0.52

(d) Recall + = 0.04

be non-deterministic. The process of removing the nondeterminism, and minimising the non-deterministic machine [8] can result in a large machine, even if there are only few differences between the two original machines. If this is the case, the difference score that is computed by comparing each machine with the union can be skewed. The challenge of differentiating two state machines is difficult because the search space is so vast. We need to find out which states and transitions in one machine have have equivalent counterparts in another machine. States cannot be identified uniquely, because they do not have their own labels; they can only be distinguished by the topology of surrounding states and transitions. Any two transitions in the two machines that have the same label could potentially be equivalent. Whether or not this is actually the case can only be established by pairing up the surrounding states and transitions - a process that can become very expensive. The challenge lies in doing so in an efficient manner.

Figure 1. Simple Text editor model, (a) is correct and models (b-d) are faulty

In (d) the transition structure is identical, but one transition has been labelled wrongly. This almost completely changes the language of the machine, which leads to a very low recall score. However, from a software-engineering perspective the rest of the diagram still provides a good overview of the software behaviour; a human would still find the diagram relatively useful, but this is not reflected in the low recall score.

3 The FSMDiff Algorithm Our FSMDiff algorithm is inspired by the cognitive process a human would be expected to adopt when comparing two state machines. It is usually possible to identify “landmarks” [15] - certain pairs of states that are can be deemed equivalent. Once a set of landmarks has been identified, it can be used as a basis for further comparison of the remaining the states and transitions in the machines. Potential landmarks are identified by producing a similarity score for every pair of states. This score is computed by matching up the surrounding network of states and transitions. The rest of the section describes how this score is computed for every pair of states in the machine. This is followed by a description of the FSMDiff algorithm, which takes these similarity scores, identifies a selection of initial landmark pairs, and uses these to compute the differences between two machines.

2.4 The Challenge of Structural Comparison of Finite-State Models When comparing or evaluating models of software behaviour, it is important to account for the structure of the model, as well as the range of behaviours that it represents. Ideally, there should be a measure of structural overlap between two models, that can be presented alongside the conventional language-based similarity measures. One intuitive approach is to look at the difference between two models in terms of the set of states and transitions that are superfluous or missing. This is reminiscent of Levenshtein’s edit-distance metric [10], but instead of strings, we are comparing state machines. This approach was proposed by Quante and Koschke [13]. To evaluate their technique to recover software protocols, they propose an approach to measure the number of state transitions that would have to be added and deleted to establish the distance between their inferred machines and the target machines. To do this, their approach computes a union machine of the two machines, and compares each machine to the union (counts missing / inserted transitions) by computing the product of each machine with the union of the two machines. The technique can however run into a number of problems that hamper its usefulness in practice. Computation of a union produces a non-minimal state machine that can

3.1 Scoring the similarity of states The algorithm is dependent upon the ability to quantify the similarity of a pair of states in two separate state machines. The similarity of two states is computed by matching the extent to which the surrounding behaviour of the two states overlaps. This involves comparing the transitions that can (transitively) reach the state, and those that can be reached by it. Computing the overlap of the surrounding behaviour from two states is a recursive process. It is first necessary to establish the overlap of their immediate surrounding transitions (local similarity), and then to consider the similarity of the target / source states of these transitions (global similarity). Below, we describe the process of calculating the 3

lows: σ

σ

SuccA,B = {(SA , SB ) ∈ QA × QB |∃σ ∈ Σ.A → SA ∧ B → SB } (1) Equation 1 computes the set of matching pairs of states, with respect to outgoing transitions. However, we are interested in the structure of the state machine, which characterises a state both in terms of its potential past behaviour (incoming transitions) as well as its potential future behaviour (outgoing transitions). For this reason, we define the set of matching incoming transitions as follows:

Figure 2. Illustration of score computation in a non-recursive case.

σ

σ

PrecA,B = {(SA , SB ) ∈ QA × QB |∃σ ∈ Σ.SA → A ∧ SB → B} (2) The incoming and outgoing transitions can be combined to the set of matching surrounding transitions as follows:

local similarity of two states. We show how this can be adapted for non-deterministic states, and finally show how it can be expanded to a recursive algorithm to compute the global similarity of two states.

SurrA,B = PrecA,B ∪ SuccA,B

Computing local similarity

(3)

Given that the number of possible surrounding states of matched transitions is |SurrA,B|, the definition of a score between states A and B can be written as

Essentially, the local similarity SAB of two states A and B is computed by dividing the number of overlapping surrounding transitions by the total number of surrounding transitions. Given a set Γ, we denote the number of elements in Γ as |Γ|. Using this notation, SAB = |(matchingSurroundingTransitions)| |(AllSurroundingTransitions)| . To begin with, we consider only the similarity of two states in terms of their outgoing transitions. The set of elements of an alphabet used on incoming and outgoing transitions from a specific state s can be denoted by Σs = {σ ∈ σ σ Σ|∃t ∈ Q.(s → t ∈ ∆) ∨ (t → s ∈ ∆)}. Assuming that the state machines are deterministic, the score can simply be A ∩ΣB )| computed by SAB = |(Σ |(ΣA ∪ΣB )| . If there are no outgoing transitions from either of the two states, the score is considered to be zero. Examples (A,B,C) in Figure 2 all show examples for deterministic states. In (A) the outgoing transitions are identical, which produces a score of 1. In (B) there are no common outgoing transitions, producing a score of 0. In (C) only one of the two outgoing transitions from state B is matched, producing a score of 0.5. The calculation that we actually use is a slightly expanded version that accounts for non-determinism. As an example, in 2(D) it is impossible to say whether the trana a sition B → matches the upper or lower A →. In (E) there are four possible ways in which the outgoing transitions labelled a from states A and B can match each other. To account for this we take the total number of transitions that might overlap and divide it by the total number of outgoing transitions. The number of matching outgoing transitions from states A, B is counted in terms of the number of individual pairs of target states that can be reached with any permutation of matching transitions. This is defined as fol-

S(A, B) =

|SurrA,B| |ΣA − ΣB | + |ΣB − ΣA | + |SurrA,B|

(4)

A pair of states (A, B) may be considered incompatible. x This is either the case if an outgoing transition A → is conx sidered valid in one machine but B → is invalid (or vice versa). With respect to the PLTS, the same transition x belongs to Ψ in one machine, and ∆ in the other. If this is the case, pairs are given a score of -1. By assigning negative scores, we are drawing a distinction between states that are merely dissimilar (which can in the worst case obtain a score of 0), and states that are definitely incompatible. It is important to note that the inclusion of incompatibility information is not crucial for the purposes of this work. The algorithm can just as well be applied to a simple (non-partial) LTS. Computing global similarity The previous subsection shows how states are matched in terms of their adjacent transitions. However, we want to be able to account for the similarity of pairs of states in terms of their wider context. When comparing a pair of states, for every matching pair of adjacent transitions we want the final similarity score to account for the similarity of the source or target states of these transitions as well. For two matched transitions, we want to produce a higher score if the source / target states of these transitions are identical, and a lower score if they are dissimilar. We now describe an 4

algorithm that extends the local similarity scoring scheme to compare states recursively, creating an aggregate score by accounting for the similarity of the states that are connected to the surrounding transitions. We start with the local similarity scoring algorithm for (potentially) non-deterministic states shown in equation (4).The score is entirely dependent on the number of shared surrounding states |SurrA,B|. We extend this to aggregate the similarity score for every successive matched pair of surrounding states.

S(A, B) =

|SurrA,B| + ∑(s1 ,s2 )∈TAB S(s1 , s2 ) |ΣA − ΣB | + |ΣB − ΣA | + |SurrA,B|

(5)

AE AF BE BF CE CF

Equation (5) augments the local similarity equation to recursively account for the similarity of surrounding states. This can however lead to unintuitive scores because the score of every successive pair of states is given an equal precedence. The score can be skewed by high scores for state pairs that are far away. To give precedence to state pairs that are closer to the original pair of states, we introduce an attenuation ratio k, which gives rise to equation (6):

S(A, B) =

|SurrA,B | + k ∑(s1 ,s2 )∈SurrA,B S(s1 , s2 ) |ΣA − ΣB | + |ΣB − ΣA | + |SurrA,B|

AE 4 -k

AF

BE

6 6

-k

BF -k -k -k 8-k

CE

CF

-k

b 1 1 2 2 -1 -1

Figure 3. Illustration of a system of equation for computation of scores in the general case.

(6)

compatible incompatible (7)

from their comparison. The first row can be obtained by rearranging the scoring equation S(A, E) (equation (7)). So here, SurrA,E = {(B, F)}, which means that |SurrA.E | = 1. |ΣA − ΣE | = 0, |ΣE − ΣA | = 1. So, put together, the equation is S(A, E) = 12 1+kS(B,F) 0+1+1 . This can be rewritten to: 4S(A, E) − kS(B, F) = 1, which provides us with the encoding for the first row in the matrix. A solution to the system of equations will produce a similarity score for every pair of states.

Section 3.1 shows how to compute the similarity of two states. In this section we show how this notion can be used to compare all of the states in two state machines. This will give us a similarity-score for every pair of states. The pairs of states with the highest scores are referred to as key pairs, and are the starting point for the FSMDiff algorithm that will be described in the next subsection. Given two state machines X and Y , it is possible to solve the system of equations SC(qx , qy ) for every pair (qx , qy ) ∈ QX × QY . The equation system is obtained by supplying the local knowledge for every pair of states (i.e. Surrqx ,qy and ΣA − ΣB ), so that the only unknown variables for each equation are the scores of the succeeding pairs of states. Figure 3 contains an example of two state machines, and the matrix containing the system of equations that results

We use the system of linear equations to identify those pairs of states that are most likely to be equivalent - i.e. those pairs that produce the highest score. We adopt a twostage approach to select the pairs that are most likely to be equivalent. First, we use a threshold parameter, and consider only those pairs that fall above that parameter (e.g. consider only the top 25%). However, a state may happen to be similar to several other states; even if it is well matched to them, it is unclear which of those states it should be paired with. For this reason, the second criterion is introduced: a ratio of the best match to the one below. With such a ratio of two, only pairs where the best match is at least twice as good as any other match are added to the set of key pairs. Intuitively, if one plots the score between a state and all other states, the first parameter is a horizontal threshold eliminating everything under it and the second one checks that there is a clear “peak” in the diagram.

So the final algorithm, accounting for incompatible merges, is provided in equation (7):

S(A, B) =

(

|SurrA,B |+k ∑(s ,s )∈Surr S(s1 ,s2 ) 1 2 A,B |ΣA −ΣB |+|ΣB −ΣA |+|SurrA,B |

−1

3.2 Identifying Key-Pairs

5

3.3 Computing the difference between two machines

of landmarks). To do so, it starts off with the most obvious pairs of equivalent nodes, and uses these to compare the surrounding graph areas, until no further pairs of landmarks can be found. What is left is the difference between the two graphs. The algorithm is shown in algorithm 1. The external functions are self-explanatory. the functions of computeScores(PLTSA , PLT SB ) and identi f yLandmarks(PairsToScores,t, k) were described in the previous subsection. pickHighest(PairsToScores) picks the pair of states in PairsToScores with the highest score. removeCon f licts(Pairs, (qa , qb )) removes any pair (x, y) from Pairs where either x = qa or y = qb .

The process of comparing two graphs is similar to the cognitive process a human might undertake when navigating through an unfamiliar landscape with a map. A map reader operates in terms of landmarks, by selecting an easily identifiable point in the landscape and trying to find it in the map, or vice-versa. The cognitive process of comparing two graphs is similar - easily identifiable states (e.g. states with a unique surrounding topology of states and transitions) are used as landmarks, and the rest of the comparison is carried out with respect to them.

1 2 3 4

5 6 7 8 9 10 11 12 13 14 15

16

17

Input: PLT SA , PLT SB , t, k /* PLTSs are the two machines, t is the threshold that is used to identify landmark pairs, and k is the attenuation value */ Data: KPairs, PairsToScores, NPairs Result: (Added, Removed) /* sets of transitions */ PairsToScores ← computeScores(PLTSA , PLT SB ); KPairs ← identi f yLandmarks(PairsToScores,t, k); if KPairs = 0/ then KPairs ← (p0 , q0 ); /* p0 is initial state in PLT SA , q0 is initial state in PLT SB */ end S NPairs ← (qA ,qB )∈KPairs Surr(qA, qB ) \ KPairs; while NPairs 6= 0/ do while NPairs 6= 0/ do (qA , qB ) ← pickHighest(NPairs, Scores); KPairs ← KPairs ∪ (qA, qB ); NPairs ← removeCon f licts(NPairs, (qA , qB )); end S NPairs ← (qA ,qB )∈KPairs Surr(qA, qB ) \ KPairs; end σ Added ← {(a → b) ∈ ∆B |(∄c ∈ QA .(c, a) ∈ KPairs) ∨ (δ ∋ Σc )}; σ Removed ← {(a → b) ∈ ∆A |(∄c ∈ QB .(a, c) ∈ KPairs) ∨ (δ ∋ Σc )}; return (Added, Removed) Algorithm 1: FSMDiff

3.4 Using FSMDiff to compute precision and recall FSMDiff returns the structural difference between two FSMs. For certain tasks, such as comparing the accuracy of two inferred state machines, it is however also necessary to make some quantitative evaluation of their accuracy. Given the output from FSMDiff, this is relatively straightforward. Here we show how the output can be used to compute the Precision and Recall [14] of a state machine. Previous work by the authors [17] shows how this can be achieved with respect to the languages of two machines; this section provides an equivalent measure with respect to the actual state transition structure. Precision and recall is a measure that was originally developed by van Rijsbergen to evaluate the accuracy of information retrieval techniques [14]. The measure consists of two scores: Precision measures the ‘exactness’ of the subject, and recall measures the ‘completeness’. Given a set REL of relevant elements, and a set RET of returned items, precision and recall are computed as follows: Precision = |REL∩RET | |REL∩RET | |RET | , Recall = |REL| . In our case, given two state machines A and B, we assume that machine A is considered ‘relevant’. Thus, rel = ∆A and ret = ∆B . The intersection of ret and rel is computed with the help of FSMDiff, and can be defined as follows: REL ∩ RET = {δ ∈ (∆A ∪ ∆B )|δ ∋ (Added ∪ Removed)}

3.5 Implementation The FSMDiff algorithm has been implemented as part of our state machine inference and analysis framework StateChum 1 . The user can supply two state machines, and the difference between them is graphically displayed in a window. The linear system of equations is solved by with the UMFPACK library [6], which takes advantage of the sparse

The FSMDiff algorithm we present below imitates this process; landmarks are key-pairs of states, and the rest of the state machine is navigated (compared) with respect to them. Whereas a human might merely rely on a couple of landmarks for the sake of navigating from one point to another, our algorithm wants to make a wholesale comparison between the two graphs. For this reason it wants to map every feature in one graph to another (i.e. identify every pair

1 http://statechum.sourceforge.net

6

The two techniques we use are Markov and EDSM. Markov was proposed by Cook and Wolf [5] and uses the concept of Markov models to find the most probable state machine. EDSM [9] is an inference algorithm that originated in the domain of grammar-inference. Without going in to detail, EDSM is a heuristic alternative to the k-Tails algorithm [1] that scores sets of state-pairs before merging the pair with the highest score. The purpose of this section is primarily to illustrate an application of the FSMDiff algorithm. A comprehensive comparison of inference approaches is beyond the scope of this paper. A general, empirically valid comparison would of course require a larger set of models and diverse range of input samples, whereas here we only consider a single arbitrary sample, and a single target model. Nonetheless, we can still use this illustrative example to make some anecdotal remarks about the quality of the inferred models with the help of FSMDiff. The inferred models are presented in figure 5. Conventionally these would be compared in terms of their languages. A precision and recall-based comparison of the languages [17] is provided in the table under headings PrecLang and RecLang . The scores indicate a sharp contrast between the two approaches. Positive precision measures what proportion of positive event sequences in the inferred model are present in the target. Positive recall measures the proportion of positive event sequences in the target that are captured by the inferred model. It is possible to also compute negative precision and recall to measure the accuracy of the machines with respect to invalid sequences of events, but we omit these here for the sake of simplicity. The language measures that the model produced by the Markov learner is more precise than EDSM. In fact, it suggests that the language of the Markov model is a proper subset of the language of the target model. The EDSM model has a much higher positive recall, which means it correctly inferred a broader range of target language. However, the low precision score suggests that it includes a lot of false positive sequences. This is about as much as can be ascertained from language-based scores; the generality of one machine versus an other. However, with our FSMDiff algorithm, we can gain more insights into the accuracy of the structure of the inferred machines themselves. The output from the FSMDiff algorithm for the markov model is presented in figure 6 and figure 7 for the EDSM model. Dashed (green) transitions are added by the inferred model and thin (red) transitions are removed. The bold transitions in the FSMDiff output are those transitions that are correctly inferred. In this case, 15 out of the 21 edges in the markov model are also present in the target, whereas only 10 of the 21 edges in the EDSM model are accurate. Alternative precision and recall measures introduced in

initialise

connect

login

changedir listfiles

makedir

retrievefile storefile changedir listnames

changedir listfiles

setfiletype makedir

delete appendfile

logout

logout

renamestorefile setfiletype

logout

logout storefile

logout appendfile

disconnect

Figure 4. Original specification of CVS client nature of the matrices that are produced. This is combined with the Automatically Tuned Linear Algebra Software (ATLAS) package [18], which can be calibrated to take advantage of any specific hardware features. In a preliminary, informal study to assess the scalability of FSMDiff, it was used to compare random state machines of various sizes. Bearing in mind that the number of states is only a crude indicator of the complexity of a machine, the algorithm scaled very well for machines up to 1000 states, but was deemed too expensive when machines contained over that number. A more formal, controlled evaluation of the scalability is part of our future work (see section 6).

4 Practical Example and Discussion This section serves to show how FSMDiff can be used in practice. This is illustrated with a small example of how to compare the state machines that are produced by two diverse state machine inference techniques. This is followed by a section to discuss the merits and potential problems that can arise when applied in practice.

4.1 Applying FSMDiff in practice: Comparing the accuracy of two state machine inference techniques Comparing state machines is particularly useful for the evaluation of different reverse-engineering techniques, and was the main motivation for this technique. Here, we take a small model of a CVS client (shown in figure 4 - derived from a similar model by Lo et al. [11]), along with a sample of program traces. We use two different inference techniques to infer models from the traces, and use our FSMDiff implementation to compare the results with the ideal specification. 7

makedir setfiletype

initialise

rename

connect delete

delete

login listnames changedirectory

Markov

changedirectory

appendfile

changedirectory setfiletype listfiles storefile makedir

storefile

retrievefile makedir

storefile rename

initialise setfiletype

setfiletype

changedirectory listfiles logout

logout

makedir logout

appendfile storefile logout logout connect

rename

disconnect login delete delete appendfile listnames

listfiles changedirectory

changedirectorylogout

storefile

Figure 6. Results of FSMDiff for Markov model

retrievefileretrievefile

logoutlistfiles changedirectory

disconnect

section 3.4 are presented in table under headings PrecFSM and RecFSM . These scores still show that the Markov model is more precise, but indicate a different perspective on the recall, where the EDSM score is much lower. EDSM produces models that are more general in terms of their language, because they tend to merge too many states together. Although these merges are often wrong, the language-based score rewards this by a higher recall. However, under the structural precision and recall scores produced here, this is penalised, by accounting for the set of transitions that disappear as a reasult of erroneous merges. Interestingly, there is very little overlap between the transitions that were correcly inferred by the two approaches. The patch of correct transitions in the EDSM model starting with the appendfile event is missing from the Markov model, and most of the correct transitions in the Markov model are missing from the EDSM model. This suggests that the two techniques offer complementary strengths, and is something that could not be ascertained from the language-based measures.

EDSM

initialise

connect

login delete makedir storefile changedirectory changedirectory listnames logout listfiles

setfiletype

retrievefile logout

disconnect storefile

disconnect

appendfile appendfile

setfiletype storefile rename

logout

4.2 Discussion disconnect

Markov EDSM

PrecLang 1 0.47

RecLang 0.45 0.8

PrecFSM 0.71 0.47

FSMDiff has a number of advantages over existing state machine comparison techniques (see section 2). Most state machine comparison techniques require their inputs to be minimal and deterministic. This is not the case with FSMDiff, which can just as easily be applied to non-minimal and non-deterministic machines. The Markov model above is non-deterministic. It also demonstrates that FSMDiff can accept graphs as input that are not even fully connected. Besides its versatility, the technique also tends to be very accurate. In this context, a comparison technique is accurate if it does not claim that there are more difference between two machines than there are in reality. Given that FSMDiff identifies the correct state-pairs, it will produce a minimal

RecFSM 0.57 0.38

Figure 5. Inferred models with precision and recall scores

8

initialise

connect retrievefile login changedirectory storefile

listfiles logout changedir listfiles

makedir delete setfiletype

delete

makedir changedirectory listnames logout

storefile

makedir

storefile disconnect changedir

appendfilelistnames changedir

disconnect

setfiletype appendfile

appendfile

logout

storefile setfiletype logout

logout

logout

storefile rename

logout

disconnect

Figure 7. Results of FSMDiff for EDSM model ‘diff’ between the two graphs. Other techniques such as Quante and Koschke’s [13] rely on determinisation procedures that can (in theory) cause an explosion in the number of states, leading to the inclusion of a large number superfluous states and transitions in the final diff. The performance of the algorithm is contingent upon the values of the threshold and attenuation parameters (t and k in algorithm 1). If the parameters are suitable, the algorithm will identify a sufficient set of equivalent key pairs, which will lead to an accurate comparison. If the set of initial pairs is insufficient, the two machines will not be compared in their entirety, leading to an inaccurate comparison. The selection of the values depends on the characteristics of the state machines. If they are highly homogeneous, with lots of very similar vertices, the threshold should be higher, to make sure that the algorithm only starts with the pairs of states that stick out as being particularly similar. Typically however, software models contain a set of states that can easily be distinguished from each other. When this is the case, anecdotal experience by the authors suggests that by setting the threshold value to 25%, and the attenuation factor to 0.5 produces correct results.

exacerbated by the fact that the implementation is treated as a black box, so the difference can only be established by probing the implementation with tests derived from the specification, and using the output from the implementation to deduce where the faults are occurring.

The problem shares several traits with the protein sequence-alignment problem [12], which is concerned with aligning two strings of amino acids. The protein alignment problem is solved by finding the optimal alignment of two proteins that maximises their overlapping segments. The challenge of comparing state machines can be interpreted as an alignment problem; here the primary challenge is however to identifying the equivalent / overlapping state pairs, as opposed to symbol sequences. Invariably both solutions involve the computationally intensive task of identifying ‘landmarks’, to identify overlapping states / protein segments.

The challenge is an specific instance of the graph isomorphism problem, called error-correcting graph isomorphism [3]. Conventional approaches to computing the edit-distance between graphs take as input two undirected graphs, and have a wide range of applications, particularly in computer vision and structural pattern recognition. We are concerned specifically with directed graphs. In our case, nodes are not merely compared by their topological location in the graph, but also by the labels that are associated with their incoming and outgoing edges.

5 Related Work The problem of comparing state machines of software systems is related to the challenge of fault location in model-based testing [7]. Here, the specification and implementation are both perceived as finite state machines, and the challenge is to work out where the implementation deviates from the specification. However, the challenge is 9

6 Conclusions and Future Work

[9] K. Lang, B. Pearlmutter, and R. Price. Results of the Abbadingo One DFA learning competition and a new evidencedriven state merging algorithm. In Proceedings of the International Colloquium on Grammar Inference (ICGI’98), volume 1433, pages 1–12, 1998. [10] V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady., 10(8):707–710, Feb. 1966. Doklady Akademii Nauk SSSR, V163 No4 845-848 1965. [11] D. Lo and S. Khoo. QUARK: Empirical assessment of automaton-based specification miners. In Proceedings of the Working Conference on Reverse Engineering (WCRE’06), pages 51–60. IEEE Computer Society, 2006. [12] S. Needleman and C. Wunsch. A general method applicable to the search of similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–453, 1970. [13] J. Quante and R. Koschke. Dynamic protocol recovery. In Proceedings of the International Working Conference on Reverse Engineering (WCRE’07), pages 219–228, 2007. [14] C. J. V. Rijsbergen. Information Retrieval. ButterworthHeinemann, Newton, MA, USA, 1979. [15] M. Sorrows and S. Hirtle. The nature of landmarks for real and electronic spaces. In Spatial information theory - Cognitive and computational foundations of geographic information science, pages 37–50. Springer, Berlin, 1999. [16] S. Uchitel, J. Kramer, and J. Magee. Behaviour model elaboration using partial labelled transition systems. In ESEC/SIGSOFT FSE, pages 19–27. ACM, 2003. [17] N. Walkinshaw, K. Bogdanov, and K. Johnson. Evaluation and comparison of inferred regular grammars. In Proceedings of the International Colloquium on Grammar Inference (ICGI’08), St. Malo, France, 2008. [18] R. Whaley and A. Petitet. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35(2):101–121, February 2005.

We have presented an algorithm for computing the difference between two finite state machines, along with a proof-of-concept implementation. The FSMDiff algorithm has several advantages over existing approaches to state machine comparison. Machines can be non-deterministic, nonminimal, and disconnected. The implementation has been tested on reasonably large state machines, and has scaled well for state machines with hundreds of states. In our future work we intend to carry out a more formal study into the scalability of the approach. The technique depends on solving linear systems of equations. This is known to scale reasonably well if the matrix of equations is sparse, and it will be our priority to investigate how the scale is effected by dense state machines, which would produce matrices that are also correspondingly dense.

Acknowledgements This research is supported by the AutoAbstract EPSRC grant (EP/C511883/1). The authors thank Jonathan Cook at New Mexico State University for kindly supplying a copy of the Balboa software, which contained the Markov learner used in our case study.

References [1] A. Biermann and J. Feldman. On the synthesis of finite-state machines from samples of their behavior. IEEE Transactions on Computers, 21:592–597, 1972. [2] E. Börger. Abstract state machines and high-level system design and analysis. Theoretical Computer Science, 336(23):205–207, 2005. [3] H. Bunke. On A relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 18(8):689–694, Aug. 1997. [4] K. Cheng and A. Krishnakumar. Automatic functional test generation using the extended finite state machine model. In 30th ACM/IEEE Design Automation Conference, pages 86– 91, 1993. [5] J. Cook and A. Wolf. Discovering models of software processes from event-based data. ACM Transactions on Software Engineering and Methodology, 7(3):215–249, 1998. [6] T. Davis. Algorithm 832: Umfpack v4.3—an unsymmetricpattern multifrontal method. ACM Trans. Math. Softw., 30(2):196–199, 2004. [7] R. Hierons. Minimizing the cost of fault location when testing from a finite state machine. Computer Communications, 22(2):120–127, 1999. [8] J. Hopcroft, R. Motwani, and J. Ullman. Introduction to Automata Theory, Languages, and Computation, Third Edition. Addison-Wesley, 2007.

10

Comparing Software Behaviour Models - Computer Science - The ...

Comparing Software Behaviour Models - Computer Science - The ...

Suggest Documents

Comparing the Behaviour of Mountainous Forest Succession Models ...

Comparing Bioinformatics Software Development by Computer ...

Software for Graphical models: a review - Computer Science at UBC

Comparing Approaches to Preference ... - Computer Science - UCC

Comparing Images Using Joint Histograms - Computer Science

Software Requirements - Computer Science Department

SOFTWARE DESIGN TECHNIQUES - Computer Science

TEACHING CLEANROOM SOFTWARE ... - Computer Science

Climate Change: Science and Software - Computer Science

The Software Freedoms - Computer and Information Science

The Sheffield Software Engineering Observatory ... - Computer Science

Comparing the Performances of GARCH-type Models ... - Science Direct

Comparing Robot and Animal Behaviour

Comparing impacts across climate models - Software @ SFU Library

Comparing impacts across climate models - Software @ SFU Library

Comparing models for identifying fault-prone software ... - CiteSeerX

Comparing the Walking Behaviour between Urban

Comparing some neural network models for software development ...

Blended Deformable Models - Rutgers Computer Science

Geographical Latent Variable Models for ... - Computer Science

Towards Realistic Assumptions, Models, and ... - Computer Science

Bounded Rationality :: Bounded Models - UBC Computer Science

Lessons from RNA Models - Dartmouth Computer Science

Undirected graphical models - UBC Computer Science