Hardening Adversarial Prediction with Anomaly Tracking - IEEE Xplore

7 downloads 0 Views 356KB Size Report
This is inappropriate, especially in adversarial settings where rare but unusual ... responses, such as: human intervention, using a variant process for some ...
Hardening Adversarial Prediction with Anomaly Tracking M.A.J. Bourassa

D.B. Skillicorn

Department of Mathematics and Computer Science, Royal Military College of Canada

School of Computing, Queen’s University

Abstract—Predictors are often regarded as black boxes that treat all incoming records exactly the same, regardless of whether or not they resemble those from which the predictor was built. This is inappropriate, especially in adversarial settings where rare but unusual records are of critical importance and some records might occur because of deliberate attempts to subvert the entire process. We suggest that any predictor can, and should, be hardened by including three extra functions that watch for different forms of anomaly: input records that are unlike those previously seen (novel records); records that imply that the predictor is not accurately modelling reality (interesting records); and trends in predictor behavior that imply that reality is changing and the predictor should be updated. Detecting such anomalies prevents silent poor predictions, and allows for responses, such as: human intervention, using a variant process for some records, or triggering a predictor update.

B

A

Fig. 1.

Humans have an ability to adapt to changing and unforeseen circumstances that has not proven straightforward to reproduce in computer systems. If two people agree to meet and one of them does not appear, the other does not wait for ever – but this kind of behavior happens all the time in computer systems. This ability of humans to “go meta” has been used as an argument that the human mind is something more than a Turing machine [7]. Here we suggest ways in which systems, in particular adversarial data analysis systems, can be constructed so that they are, in a limited way, reflexively aware of goals and process. When unanticipated actions occur, they are able to react, or at least provide hooks to which reactions can be attached. For concreteness, we will concentrate on prediction in adversarial settings, but the entire discussion applies to other forms of analysis as well. Some of the weaknesses of knowledge-discovery technologies such as prediction in adversarial settings are:





They can be misleading. Most predictors will silently predict a record to be in a given class whether or not it resembles any record seen during the training process. They rarely provide associated confidences about their predictions, and these confidences can themselves be misleading. They are easily manipulated, either by altering data at collection (e.g. identity theft), changing the data to alter the prediction boundary, or changing the data to cause

978-1-4244-4173-0/09/$25.00 ©2009 IEEE

A typical prediction scenario

the prediction results not to be acted upon (e.g. social engineering). • They rarely take into account that the world being modelled is changing and the predictor may need to be updated to reflect this. Anomaly-tracking tools can be wrapped around predictors to make up for these deficiencies, providing signals when the action of the predictor may need to be enhanced or discounted. A typical situation is shown in Figure 1. The points A and B will be predicted to be in the cross class, even though neither looks like any previously seen record in this class. Worse still, the prediction for point A will be made with high confidence, as it is far from the decision boundary. The point labelled C will be predicted to be in one of the classes, but small changes in the construction of the predictor might have caused it to be predicted in the other class. Records very similar to C might be predicted to be in either class. Furthermore, if a point such as B can be inserted in the training data as a member of the circle class, then its presence will cause the decision boundary to rotate slightly, creating regions (worse still, known regions) where predictions will be incorrect [3]. These problems, though in a way obvious, have received very little attention in the literature. The contribution of this paper is to propose a general scheme for allowing knowledge-discovery tools to “go meta” by detecting when the results they produce are not meaningful. Signals that this has occurred can be used for a range of responses, from applying special-purpose tools to ‘difficult’ records down to simply discounting results that are questionable. We define several ways of thinking about out-of-band data that require different kinds of responses, and suggest some of the technologies that can be used to implement the

I. I NTRODUCTION



C

43

ISI 2009, June 8-11, 2009, Richardson, TX, USA

necessary reflexive awareness in knowledge-discovery tools.

inaccurate. In situations where this matters, as it usually does, the simplicity can be compensated for by making the models conservative. The cost is that there are then more anomalies than there could, in principle, be. This view of adaptation is only useful if these models can be implemented computationally and at reasonable cost. We now examine this issue, admittedly at a high level because of space limitations. In some settings, it is necessary to set up explicit models in parallel with the action they support. In other settings, the required model is present implicitly in the underlying action and all that is required is to make use of it in a better way. In what follows, we will consider the action to be prediction in an adversarial setting.

II. A MODEL FOR ADAPTATION

III. D ETECTING NOVEL RECORDS

In order to be able to react to unforeseen circumstances, each action must have, associated with it, three different kinds of models of what should happen. In other words, reacting to the unforeseen can only happen when there is some sense of the foreseen. The three different models of what should happen are: • An input model, m(i). This model determines whether an input to the system meets the criteria for being a ‘normal’ input or not. This model is embodied in questions such as “Have I been in this situation before?”. Inputs that fail to meet the criteria are novel. • A mapping model, m(i, o). This model determines whether the mapping of input to output meets the criteria for being ‘normal’. This model is embodied in questions such as ”Do I know what will happen?”. Inputs that fail to meet the criteria are interesting1 . • A process model, {m(i, o)}. This model determines whether the action is ‘normal’ over a time period or sequence of actions. This model is embodied in questions such as “Is it time to change?”. A given data record may be considered anomalous from the perspective of any or all three of these models, but the action to take in response depends on which of the models considers it anomalous. A novel record signals that some aspect of the data has not been properly accounted for; an interesting record signals that some aspect of the model has not been properly accounted for; and an anomaly in the process model signals that change has not been properly accounted for. Figure 2 illustrates a very simple dataset. The normal records are shown as crosses. The point A is an example of a point that is novel – it does not resemble any of the other records. The point B is an example of a point that is interesting – its presence suggests that perhaps the lower two clusters are actually only one. Of course, these three models must be simpler than the action itself, or else they are useless. In human situations, they often seem to be implemented as rules, perhaps nested rules. This simplicity also implies that they must be partially

Novel records are those that are somehow dissimilar to the records from which the existing predictor was built. In order to consider novelty, therefore, there will always be a need to be some definition of similarity and some parameter that defines how unusual a record must be to be considered anomalous. Choosing this parameter will, of course, always be somewhat problematic, at least if it is to be chosen in a principled way. There are three broad strategies for detecting when a record is novel: • A novel record is hard to represent using structure(s) derived from the normal records; • A novel record lies in regions that are otherwise empty (which requires some geometric view of the data); or • A novel record is hard to allocate in a clustering.

A B Fig. 2.

A typical dataset

A. Novel records are hard to represent One way to distinguish new records that are normal from those that are anomalous is to find a way of expressing the training records in terms of a set of basic building blocks or structures. Novel records are those that are hard, perhaps impossible, to represent using this set. This straightforward idea can be realized in a number of ways. 1) Autoassociative neural networks: Autoassociative neural networks use the standard structure and training algorithm of a feedforward neural network, but instead of learning the classes associated with each input record they learn to reproduce these input records on their outputs. Each record in the training set is presented to the inputs of the network, and the error between its attribute values and the outputs computed by the network is used to adjust the weights on the internal edges using a standard algorithm, perhaps backpropagation. The presence of the hidden layer, which is much smaller than the number of attributes, forces the network to encode the structure of the training data in a compact way. Once the network has been trained, any record similar to those in the training set will be represented fairly well by the network; so the difference between the input and output values will be small. However, a record that is different from those used for training is unlikely to be well represented by the network; and so the difference between the input and output

1 As Asimov famously said, “The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, ‘hmm.... that’s funny...’”.

44

will be large. Hence the autoassociative neural network can act as a novelty detector. 2) Minimum description length: Minimum description length approaches [5] estimate the information content of some structure by how many symbols are required to describe it in some language. For our purposes, we want the language to depend on the training data. New records that resemble the training data should be coded in compact ways; while those that are unlike the training data should require larger representations to code them. This idea can be implemented using compression based on a data-dependent codebook such as a Gray code. If this coding is learned from the training data, then its symbols allow short codings of similar records, but long codings of other records. (Minimum description lengths correspond in a natural way to probabilities of encountering particular records given a particular model, which is exactly what we want.) 3) Non-Negative Matrix Factorization: Any matrix factorization can be understood as expressing the original data records are ‘mixtures’ of a set of underlying ‘factors’. This view is most clearly seen in Principal Component Analysis. Non-negative matrix factorization (NNMF) [6] expresses the original data as a mixture that is purely additive (unlike other decompositions such as, for example, singular value decomposition, where records are mixed both positively and negatively from the underlying factors). NNMF, either automatically (claimed by some authors) or by construction, also emphasizes sparsity in the representation, that is each record is expressed as a sum of relatively few of the underlying parts. If a non-negative matrix factorization is performed on the training dataset, then the magnitudes of the rows of the mixing matrix (corresponding to the records) should all be small because of the sparsity. New records can be mapped into the same space. If they are novel, the magnitudes of their mixing coordinates may be larger than those of the records in the original data.

data lie and examine each new record to see if it lies within this region or not. 1) Wrapping by a convex hull: The simplest way to describe a region populated by the training data is to wrap this data in a convex hull. Newly-encountered records that lie outside this convex hull are then considered to be novel. The underlying assumption in this representation is that new records that lie ‘between’ existing records are generally similar to them. This is plausible for many kinds of data, provided that the attribute values have been normalized in a reasonable way. 2) Covering by small blocks: If convexity is not a justifiable assumption about normal data, then an alternative is to cover the space of the training-data records in another, non-convex way. This can be done by using a set of m-dimensional blocks, each centered on a data record. A newly-encountered record is considered novel if it lies outside the union of these blocks. There are two choices of properties for the covering blocks: (a) what shape they are, and (b) how large they are. For example, the blocks could be m-dimensional cubes or mdimensional spheres, or other shapes. In each case, a scale parameter, perhaps more than one, is required; for example, the radius of the sphere, or the side length of the cube. Both of these choices encode assumptions about in what way and by how much a novel record must differ from the training data to be considered novel. 3) 1-class support vector machines: Even more fitted wrappings of the training data points can be achieved using support vector machines. Although these are usually used as 2-class predictors, they can also be used as 1-class predictors using radial basis function kernels [9]. These allow the training-data points to be wrapped with surfaces of chosen curvature, so that the wrapper can be fitted arbitrarily tightly to the available points. 4) Far from a manifold – SVD: Often in real datasets the training-data records do not form discrete clusters occupying many or all of the available m dimensions, but rather inhabit a lower-dimensional manifold within the apparently m-dimensional space. This is, of course, not usually easy to detect because the axes are not aligned with the manifold. It is also common for the data to occupy a set of manifolds, each of relatively low dimensionality, but not aligned with each other. In such situations, the empty space unoccupied by data points has a kind of structure. The existence of this structure can be exploited to detect when records are novel. One important way to extract the inherent dimensionality of the data is to use a singular value decomposition (SVD) [4] on the training-data records. This corresponds to a change of basis (and scaling) in which the first new axis is aligned with the direction of maximal variation in the data, the second new axis with the direction of remaining maximal variation uncorrelated with (that is, orthogonal to) the first, and so on. The singular values capture the amount of variation in each dimension. If the data inhabits a manifold of dimension k then the singular values (extents) in dimensions k + 1 and upwards will all be small. For the training data records, the entries in columns k + 1

B. Novel records are in ‘empty space’ A second way in which novel records can be seen to be different from those previously encountered is to use a geometric view of the data, and look for records that are somehow ‘far from’ other records, in ‘empty space’. Given a record with m attributes, it can naturally be viewed as a point in m-dimensional space whose coordinates are given by the values of each of its attributes. When some appropriate metric, that is one in which distances are inversely proportional to similarities, can be defined on such a space, the notions of ‘far from’ and ‘empty space’ can be given meanings. Novel records are those whose distances from all of the records in the training set are sufficiently great. Detecting such a record could be done by computing its distance to all of the training records and classing it as novel if these are all large enough. However, training datasets are typically large so this would be an expensive calculation for every newly-encountered record. It is more practical to construct some representation of the region(s) where the training

45

and higher in the U S product matrix will tend to have small magnitude. New records can be transformed using the same SVD, but the corresponding entries will tend to have larger magnitude if they do not lie on the k-dimensional manifold.

cluster, and the process repeated until only a single cluster remains. The record of the order in which clusters are joined is recorded in a dendrogram. There are many different metrics that can be used to judge the similarity of two existing clusters, for example the minimum distance between a point of one and a point of the other, or the average of these distances over all such pairs. It is easy to see, intuitively, that a novel record is one that would be connected into a larger cluster only towards the end of the merging process (high up the dendrogram). However, it is not trivial to turn this intuition into an algorithm, since it requires (at least) recording all of the distance calculations between clusters throughout the construction.

C. Outliers in a clustering The third way to detect when new data records are unusual is by considering them in relationship to a clustering derived from the training data. We expect that novel records will not fit well with existing clusters, in other words they will be outliers. What it means to be an outlier depends on the particular similarity strategy used by each clustering. 1) Far from a centroid: Distance-based clustering assumes that a cluster is a set of records that are close together in some geometric representation. The best-known example is the kmeans algorithm, a two-phase algorithm in which protocenters are moved to the centroids of sets of points, and then points are reallocated to the closest center repeatedly until the allocations become stable. In the standard version of the algorithm, every point belongs to a cluster, although points on the ‘outside’ of the entire set of points can be very far from any center. Novel records can be detected by imposing a maximum distance that a point can be from the center of its allocated cluster. For example, the maximum distance could be slightly larger than half the distance between the most-distant (or, conceivably, the nearest) pair of centers. 2) In a low-probability region: Distribution-based clustering fits a set of probability density functions to the given data in a maximal likelihood way. In practice, these distributions are usually multidimensional Gaussians. The standard algorithm is Expectation-Maximization (EM). Each record is allocated to every cluster with some probability, that is a distribution-based clustering is inherently a soft clustering. A novel record is one whose probability of allocation to all of the clusters is low. 3) In a low density region ‘outside’: Density-based clustering allocates records to clusters when each point has sufficiently many ‘close’ records in the same cluster. The apparent circularity of this definition is resolved by choosing a single record to nucleate a cluster, and then defining rules, often quite complex, to decide whether its neighbors, and their neighbors, transitively, should be included in the cluster. Density-based clustering algorithms, unlike those discussed previously, incorporate the idea of an outlier in a natural way – a point is an outlier if it is not a member of any cluster. Thus novel records can be detected by considering whether they fulfill the membership criteria of any of the existing clusters. However, points that lie between existing clusters may have no or few neighbors, but would not normally be considered as novel. Thus these techniques may operate best in tandem with a simple boundary test, since novelty requires being in a low-density region that is ‘outside’ the main region occupied by the data. 4) Connecting high up in a dendrogram: Hierarchical clustering algorithms treat each point initially as a cluster of size 1. The two most similar clusters are then joined into a new

IV. D ETECTING INTERESTING RECORDS We now consider records that do not satisfy the model of the expected relationship between input and output. For a predictor as the underlying action, this means records for which the prediction is somehow unexpected or anomalous. We assume, of course, that records have first been checked for normality, since we would not expect a predictor to operate correctly for novel data. As we have seen, detecting novelty often requires choosing a parameter to characterize how different a record must be to be regarded as novel. In contrast, interesting records typically identify themselves without such a parameter choice. There are two broad ways in which records can be detected as interesting: • They lie close to the decision boundaries of the existing predictor, so that they are either ambiguous or suggest that the boundaries are not quite in the correct places. • They lie within the region occupied by the existing data (so they are not novel) but they lie far from other data records. A. Lies close to boundaries 1) Provided confidence: Some predictors provide an associated confidence along with each prediction they make. For example, a neural network can be trained so that there is one output for each class, and the value on the outputs are values between 0 and 1 that can be treated as the probabilities that the current input record is a member of each of the classes. In other words, the total outputs approximately sum to 1, and the predicted class is the output with the largest value. Other predictors do not provide an explicit confidence, but the output contains a kind of surrogate for confidence. Support vector machines use maximal margin separators as decision boundaries. In the simplest case, the value allocated to each record is either greater than +1 or less than −1 depending on which of two classes it comes from. The magnitude of the value can be treated as a surrogate for the confidence in the prediction – a value close to zero is a weak prediction for one or other of the classes. Records are interesting when the predictor’s confidence about them is unusually low (assuming, of course, that the predictor has been properly built, with appropriate power for the domain, and high test accuracy).

46

2) Ensembles – low winning margin: An ensemble predictor [1] is one that contains a number of internal predictors, usually trained from different subsets of the training data. The prediction of the ensemble is the plurality of the predictions of its component predictors. Ensembles can be built using almost any component prediction technology, and are attractive because learning predictors from subsets tends to cause some variance cancelling. Hence, the accuracy of the ensemble is typically better than for a monolithic predictor learned from the same data. The pattern of voting can provide a surrogate for the confidence of the ensemble’s prediction in a number of ways. The simplest is to count the fraction of component predictors that voted for the winning class. The greater this fraction, the more confident the prediction. An interesting record is one where the number of component predictors that vote for the winning class is only slightly more than 1/C, where C is the number of classes. However, it is better to consider the difference between the number of predictors that vote for the most-popular class, and the number that vote for the next-most-popular class. For example, if there are 100 component predictors, 45 of them vote for class A, 44 for class B, and 11 for class D, then not much confidence should be placed in the prediction, even though the number of winning votes is much greater than 1/C. An interesting record is one where the difference between the size of these two sets of predictors is small. 3) Random forests: Random forests [2] are decision-tree based predictors, in which each tree is built from a different subset of both the records and attributes of the training data, and the decision about which attribute to test at each internal node is made from a randomly-chosen subset of those available. Random forests are a kind of ensemble predictor, but with several added strengths. The same tests of confidence described in the previous section can be used with random forests. However, random forests calculate a measure of proximity between pairs of training records based not only on how often component trees agree about how to classify them, but on whether or not this occurs at the same leaf nodes of these trees. This finer-grained view of similarity can be extended to new records to compute their proximity to the training records, prefiguring cluster-based measures of interestingness, discussed below.

homogeneously from a single class, so this has implications for class boundaries. 1) Lies on boundary of a cluster: In distance-based clustering, interesting records are those that lie between two or more clusters, that is equidistant from cluster centers. Although a clustering algorithm will typically allocate such a record to one cluster or other, the allocation is arbitrary, and may depend on such accidental properties as round-off error in the calculations. Unlike the novelty case, an interesting record is one that has approximately the same distance to more than one cluster center, even if these distances are small. 2) In an equal-probability region: In distribution-based clustering, an interesting record is one whose probability density with respect to more than one component distribution is the same, even if these probabilities are large. Such a record is equally well explained by more than one distribution, and so belongs with equal likelihood to more than one cluster. 3) In a low-density region ‘inside’: Density-based clusterings find all of the points that are outliers. Records that are outliers but not novel, that is which occur ‘inside’ the region occupied by all of the data, are interesting. V. D ETECTING PROCESS ISSUES AND TRENDS The third model is one that observes the actions as they occur repeatedly over time. In our setting, the process model observes the outcomes of predictions over time. The simplest models are those that compute and maintain simple statistics about the process. Alerts may be triggered by changes in these statistics with time. In general, process issues occur because either: • The predictor has underfit or overfit the actual data, so that errors necessarily increase as more and more data are encountered; or • The underlying reality is changing, so that the predictor no longer captures appropriate decision boundaries. The solution in both cases is to rebuild the predictor using more, or fresh, data. Statistics from a number of parts of the process can be tracked, and deviations above a defined threshold used to signal an alert. These include: • Statistics about the input records. Two subsets of the records, a reference set, and a current set, can be maintained. The change in the entropy of each attribute is given by:

B. Lies in gaps The second way that records can be interesting is that they lie far from other records (in some geometric view of the data), but they are not novel, that is they do not lie on the ‘outside’ of the entire data. Clustering algorithms can provide information to determine this. Such records suggest that the visible structure in the data is incorrect or incomplete. For example, such a point may indicate that what appear to be two distinct clusters are actually one; or that there is another cluster present, for which the record is the only member so far observed. Clusters should be

Hi =

C X K X

(nck log nck − mck log mck )

c=1 k=1

where the sum is over C classes, K intervals into which the attribute values are divided, each n is a count of the number of attribute values in a given interval corresponding to a given class in the current set, and each m is the corresponding count for the reference set. The average of the absolute values of Hi over all of the attributes provides a measure for how much the underlying data

47

It has been argued elsewhere [8] that adversarial data analysis problems should not be addressed by building single monolithic predictors, but rather by building sequences or more complex arrangements of predictors. Hardened predictors fit well into such systems – the alert outputs can be used to trigger different behavior or actions as a consequence of encountering particular records, for example: • Alerting a human analyst of a special case that may need extra attention; • Passing a record to a different analysis path that uses more sophisticated prediction, perhaps involving extra attributes; • Triggering an incremental update of the underlying predictor; • Triggering an alarm because it seems likely that the process is being deliberately subverted.

P

m(i,o)

I N

Predictor

Fig. 3. A complete predictor with meta-analysis tools. N is the novelty detector, I is the interestingness detector, P is the process detector, m(i, o) models the expected mapping by the predictor. Solid lines indicate the normal flow of data, dashed lines the flow to meta-analysis, and dotted lines the possible alerts.





VII. C ONCLUSIONS

is changing. An alert can be issued whenever this value exceeds some predefined threshold, at which point the current set becomes the reference set. Statistics about the mapping, for example the distribution of predictions across the available classes. The distribution of records from each class in the training set is known. The distribution for new predictions can be compared to this using the same entropy-based approach as for the input attributes. Statistics about the confidences of each prediction when the predictor provides them. For example, in an ensemble the difference between the fraction of predictors predicting the most-popular class and the next most popular class can be recorded. This difference should remain roughly constant when the predictor is working well.

In adversarial settings, rare but unusual records are critically important; and adversaries can be expected to probe and otherwise manipulate the prediction process. Using predictors as if they were infallible classification oracles is dangerous because they fail silently in exactly the situations where we do not want them to. We have suggested that the solution is to harden predictors by surrounding them with anomaly-tracking tools that watch for unusual input cases, unusual mappings (predictions), and unusual trends in prediction over time. Basic algorithmic techniques for doing this already exist; they need to be developed and used in practice. R EFERENCES [1] L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [2] L. Breiman. Random forests–random features. Technical Report 567, Department of Statistics, University of California, Berkeley, September 1999. [3] J.G. Dutrisac and D.B. Skillicorn. Subverting prediction in adversarial settings. In 2008 IEEE Intelligence and Security Informatics, pages 19– 24, 2008. [4] G.H. Golub and C.F. van Loan. Matrix Computations. Johns Hopkins University Press, 3rd edition, 1996. [5] Peter Gr¨unwald. A tutorial introduction to the minimum description length principle. In Advances in Minimum Description Length: Theory and Applications. MIT Press, 2005. [6] D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999. [7] J.R. Lucas. Mind, machines, and g¨odel. Philosophy, XXXVI:112–127, 1961. [8] D.B. Skillicorn. Knowledge Discovery for Counterterrorism and Law Enforcement. CRC Press, 2008. [9] D.M.J. Tax. One Class Classification. PhD thesis, Technical University Delft, 2000.

In all of these cases, a deviation from the behavior over time that exceeds a given threshold can be used to signal an alert. This approach addresses, in a principled way, the issue of when predictive models need to be updated to reflect changes in the environment. This is especially important in adversarial settings which can be viewed as a kind of arms race between adversaries and analysts, each trying to exploit weaknesses of the other. This in turn causes the underlying situation being modelled to change much more substantially and perhaps more rapidly than, for example, commercial and business applications. VI. I MPLICATIONS We have suggested that, in adversarial settings, analysis tools such as predictors should be surrounded by anomalytracking tools that maintain the three models, m(i), m(i, o), and {m(i, o)} and compare the behavior of the action to the simpler expected behavior embedded in the models. These anomaly-tracking tools provide alerts when some aspect of the analysis goes outside of the envisaged boundaries. The overall module is shown in Figure 3.

48

Suggest Documents