Improving Supply Chain Security Using Big Data - CiteSeerX

1 downloads 56267 Views 838KB Size Report
used to create and deliver the hardware and software products .... used to power our analytics. .... for such vendors to distort the topological characteristics of.
Improving Supply Chain Security Using Big Data David Zage, Kristin Glass, and Richard Colbaugh Sandia National Laboratories Albuquerque, NM, USA 87185-9300 {djzage,klglass,rcolbau}@sandia.gov

Abstract—Previous attempts at supply chain risk management are often non-technical and rely heavily on policies/procedures to provide security assurances. This is particularity worrisome as there are vast volumes of data that must be analyzed and data continues to grow at unprecedented rates. In order to mitigate these issues and minimize the amount of manual inspection required, we propose the development of mathematically-based automated screening methods that can be incorporated into supply chain risk management. In particular, we look at methods for identifying deception and deceptive practices that may be present in the supply chain. We examine two classes of constraints faced by deceivers, cognitive/computational limitations and strategic tradeoffs, which can be used to developed graph-based metrics to represent entity behavior. By using these metrics with novel machine learning algorithms, we can robustly detect deceptive behavior and identify potential supply chain issues. Keywords-Supply Chain Risk Management, Deception Detection, Machine Learning

I. I NTRODUCTION In today’s world, the very nature of trade and commerce causes supply chains to be globally distributed, interconnected sets of people, organizations, and services. A single weakness or vulnerability in any portion of the supply chain anywhere across the globe has the ability to have adverse affect activity thousands of miles away. Unfortunately, business and federal agencies typically have neither a consistent nor comprehensive way of understanding the often opaque processes and practices used to create and deliver the hardware and software products and services that they procure [1]–[3]. This lack of protocols and understanding significantly increases the challenges and risks of maintaining a viable supply chain that remains free of exploits ranging from counterfeit materials, malicious software, or untrustworthy products [4]. Previous attempts at supply chain risk management (SCRM) are often non-technical and rely heavily on policies/procedures to provide security assurances [5]–[7]. This is particularity worrisome as there are vast volumes of data that must be analyzed and data continues to grow at unprecedented rates. In order to mitigate these issues and minimize the amount of manual inspection required, we propose the development of mathematically-based automated screening methods that can be incorporated into SCRM. This way, we can put big data to use instead of drowning under the flood of data. In particular, Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000

we look at methods for identifying deception and deceptive practices that may be present in the supply chain. Towards this end, there has been considerable interest in developing robust, reliable methods for detecting and defeating deception. The availability of such methods would represent a crucial advance in SCRM and a variety of other domains such as cybersecurity, counter-terrorism, border protection, and crime prevention [8], [9]. Recognizing that dependable, scientifically-rigorous detection of deception is a central aspect of this problem, many researchers and practitioners have examined this identification task. Most of this work has focused on exploiting physiological and verbal/linguistic cues, often using concepts and tools taken from psychology, cognitive science, and neuroscience, and have relied upon human expertintensive observations, evaluations, and decisions [10]–[13]. While recent research has begun to produce promising computational methods for detecting deception [14]–[16], much work remains to be done. In particular, it is desirable to develop automated techniques which discover exploitable deception cues in multiple supply chain domains, combine analyst input with rigorous computational learning strategies, and have been validated against important real-world problems. We note that the goal of this work is not to create a “perfect” SCRM solution, but to help supply chain analysts focus their efforts. This paper proposes a computational approach to robust deception detection that addresses the above challenges. We begin by identifying constraints faced by deceivers which are both widespread and exploitable. These constraints can be organized into two broad classes: 1) Strategic tradeoffs, reflecting the fact that devoting additional resources to deception typically increases the likelihood that the deception will be successful but decreases the net payoff associated with this success 2) Cognitive or computational limitations, capturing the fact that “the deceiver cannot think of everything”, and in particular that behavioral signatures which are indirectly related to the deceptive activity are both abundant and difficult to distort in a consistent manner We examine a variety of graph-based metrics to capture these constraints which can used as behavioral signatures for potential deceiver activity. We then employ novel machine learning (ML) algorithms to exploit these constraints to robustly detect deceptive behavior. Importantly, these algorithms require little labeled data (i.e., known examples of honest and deceptive activity) which is expensive to obtain [17]–[19]. The algorithms compensate for the limited availability of labels by

2

Fig. 1. Depiction of a bipartite graph data model, in which instances (red vertices) are connected to the features (blue vertices) they contain, and link weights (black edges) reflect the magnitudes taken by the features in the associated instances.

leveraging unlabeled data. We demonstrate the effectiveness of our solution through experiments on real-world datasets, showing that supply chain risks can be identified with surprisingly high (> 90%) accuracy. The rest of the paper is organized as follows: In Section II we provide an overview of deceiver constraints and Section III we examine potential metrics that can be used as drivers behind our analytics. In Section IV we demonstrate the effectiveness of using these constraints combined with ML and conclude in Section V. II. O UR A PPROACH “If falsehood, like truth, had only one face, we would be in a better shape. For we would take as certain the opposite of what the liar said. But the reverse of truth has a hundred thousand shapes and a limitless field.”- Michel de Montaigne, Essays, 16th century As noted by Montaigne, identifying deception has always been a difficult, laborious task that has only become more difficult as time has progressed. A key element of our approach to robust deception detection is the recognition that deceivers face constraints, which can be leveraged to reveal deceptions. Identifying these deceptions can then allow us to identify potential supply chain risks. Our analysis has identified two broad classes of constraints that are both widespread and exploitable: cognitive/computational limitations and strategic tradeoffs. Cognitive/computational limitations capture the fact that “the deceiver cannot think of everything”. These limitations often manifest themselves in behaviors and patterns that are indirectly related to the deceptive activity. For instance, it was demonstrated by Colbaugh et al. [20] that relational data, such as networks of collaborators, are difficult to distort so as to be consistent with the deception. It is even more challenging to manipulate the way such networks evolve over time to produce dynamical patterns that match those found in legitimate interactions [20]. Consider, for example, the Internet site of a supply chain vendor that purports to be something it is not. While it may be straightforward to construct content that is consistent with the deception, and even link to appropriate sites, it is quite another matter to manipulate the inlinks a site receives and the way these links change over time. Additionally, virtually all complex networks, from the Web to social interaction, can be efficiently and effectively represented as graphs. These observations are the basis for the graph-theoretic metrics we develop which can be

then be incorporated into behavior signatures and are in turn used to power our analytics. The second class of deceiver constraints that are of interest are strategic tradeoffs. These typically reflect the fact that the deceiver must balance the requirements of the deception process with the underlying goals, which are to be achieved via deception. For instance, typically devoting additional resources to deception increases the likelihood that the deception will be successful but decreases the net payoff associated with this success. Adversarial tradeoffs are naturally represented and analyzed using game theory [21], [22], and as we will demonstrate in Section IV, these tradeoffs can be leveraged to nominate “signatures” that may be associated with deceiver activity. These candidate signatures can then be employed as features in ML algorithms [23] that are designed to distinguish between deceptive and honest behavior. We exploit the deceiver constraints associated with cognitive/computational limitations and strategic tradeoffs through their incorporation into analyst-informed semi-supervised ML algorithms [24]–[27]. Our approach formulates deception detection as a classification problem, in which deceptive and honest behavior are to be distinguished. As demonstrated in Fig. 1, the data is modeled as a bipartite graph of activity instances and the features that characterize them. It is assumed that only limited prior information is available regarding the class (deceptive or honest) of any of the instances or features. This label knowledge is augmented with analyst guidance regarding which features might be expected to be useful and information present in unlabeled instances that are typically easy to acquire. III. M ETRICS FOR AUTOMATED F RAUD D ETECTION While we focus on leveraging big data to increase supply chain security, one consistent challenge in developing solutions which function accurately in real-world settings is obtaining labeled instances of known ground truth. For example, verified instances of deceptive activity are typically in short supply. For these reasons, we examine domains such as e-commerce which hold similarities to supply chains (e.g., big data, complex interactions, etc) in order to develop our deceiver constraints. E-commerce sites, such as online auctions, have become extremely popular by creating appealing environments that facilitate convenient interaction between consumers and vendors. As might be expected, these environments also attract dishonest individuals who seek to profit by deceiving honest users. We explore the development of mechanisms for detecting and constraining fraud activity in an unpublished proprietary e-commerce dataset consisting of multiple millions of transactions between millions of users. Besides the transactions (links) and users (vertices), information on the (un)successful completion of each of the transactions was also provided, giving us ground truth. Many e-commerce sites implement some form of reputation system, which computes and publishes reputation scores for users based on opinions and feedback from their transaction partners (and possibly other behavior measures). One way a fraudster can cheat honest users of a site is to manipulate the

3

site’s reputation system in order to appear to be a legitimate transaction partner, typically by engaging in legitimate activity for some time before committing a fraudulent act [28]. This creates a tradeoff: the fraudster must balance the benefit of legitimate behavior - enhanced reputation - with its costs time and money. Fraudsters manage this tradeoff by inventing clever ways to deceive the reputation system (see the work by Hoffman et al. [28]), and the reputation systems adapt accordingly, leading to an arms race between fraudsters and defenses. One consequence of this co-evolutionary behavior is that the signatures of fraudulent activity become increasingly difficult for defenses to detect but also for the fraudsters themselves to manage. One effective strategy employed by sophisticated fraudsters is to recruit accomplices (or create fake accomplice identities) with whom to interact [29]. The role of these accomplices is to build good reputations for themselves by conducting legitimate transactions with honest users, and then to “propagate” their good reputations to fraudsters by engaging in (or pretending to engage in) legitimate transactions with the fraudsters. This fraudster-accomplice structure enables the design of the following innovative means of deceiving reputation systems. A new fraudster (identity) joins the e-commerce site and inflates his reputation through interactions with accomplices. Once his reputation score is sufficiently high, he is able to attract honest users and defraud them, at which time he is typically removed from the site. However the accomplices, and their good reputations, remain on the site because they have not engaged directly in the fraud. Thus the fraudster can simply rejoin the site with a new identity, quickly develop a good reputation by transacting with accomplices, and exploit this reputation to commit additional fraud, and of course the cycle can be repeated. It is easy to show through simple calculations that this strategy provides a cost-effective means of circumventing standard reputation systems. The fraudster-accomplice strategy summarized above is an innovative way for fraudsters to build good reputations without expending excessive resources. However, it turns out that implementing this strategy creates structure in the transaction network of buyers and sellers, which may be easy to detect and difficult for the fraudster to mask [30]. To see this, observe that fraudsters are motivated to interact with accomplices, not with other fraudsters or honest users (at least until the fraud takes place), because their goal is to inflate their reputation. Similarly, accomplices interact with honest users to boost their own reputation, and with fraudsters to propagate this reputation, but there is no incentive for them to interact with other accomplices. As a result, this fraudster-accomplice behavior generates “bipartite cores” within the overall transaction graph. This behavior, in which the subgraphs in which fraudsters are not connected to other fraudsters and accomplices are not connected to other accomplices can be see in Fig. 2. With this in mind, we look to answer three questions: Do these bipartite cores exist in our e-commerce transaction dataset?, Can they be efficiently detected in very large transaction networks?, and Do the cores contain fraudsters? If these bipartite cores actually contain fraudsters, then developing an automated fraudster discovery capability would be straightfor-

Fig. 2. Subgraph of an e-commerce transaction graph. The vertices are users, the edges denote transactions, and the colors denote members of graph communities. The presence of a “bipartite core” of possible fraudsters (red vertices) and accomplices (magenta vertices) is clearly visible.

ward. Interestingly, one method we can use to detect these bipartite cores, even in very large graphs, is through the use of unsupervised ML. We employ a simple clustering approach based on spectral analysis and introduce a novel measure of “similarity” to perform clustering on the transaction graph. We identify clusters of vertices that tend not to link to each other. This similarity definition is based on the observation that a bipartite core forms a kind of “anti-community” graph structure, consisting of pairs of communities composed of vertices that are connected less densely than would be expected in a random graph. (The standard definition for graph communities as noted by Newman [31] is groups of vertices with interconnection densities that are greater than expected at random.) This unsupervised ML method can be implemented using standard community detection methods (see [31] for background on community detection), but rather than searching for groups of vertices that maximize intercommunity link density, we search for vertex groups that minimize this density. This search procedure can be conducted very efficiently, and so can be applied to large-scale data to detect the bipartite cores characteristic of fraudster-accomplice interaction. We implemented the proposed approach to detecting bipartite cores and the associated fraudster-accomplice groups on transaction graphs taken from the e-commerce dataset mentioned above. In particular, our tests showed that multiple bipartite cores exist in the dataset and the cores can be detected accurately and efficiently using the proposed unsupervised ML algorithm. More importantly, the discovered bipartite cores contain confirmed fraudsters as indicated by information provided with the dataset. For example, we extracted a subgraph of the transaction network consisting of 55 individual vertices and 310 transactions, as seen in Fig. 2, the analysis resulted in the identification of 12 bipartite core fraudster candidates. Of these 12, 3 were confirmed by outside evidence to be fraudsters and the other 9 were deemed highly suspicious

4

Fig. 3. Analysis of the triadic closure of an e-commerce transaction graph using the clustering coefficient. The fraudsters and accomplices have low clustering coefficients (below the red horizontal line) while benign users have higher clustering coefficient.

based on the “interaction feedback” scores they received on the site (however, the suspicious nature of the latter 9 was not corroborated by evidence external to the site). Due to time and budgetary constraints, further analysis of the graph and bipartite cores is left for future work. Not only can we leverage ML-based community detection, we can also use the triadic closure of the network to identify regions of the transaction graph, which are incongruent with the rest of the structure. Triadic closure captures the intuitive sense that if two vertices have a link in common, they are more likely to be linked themselves [32], [33] and is typically measured using the clustering coefficient [34]. As large-scale graphs of interest form, they tend to have smallworld characteristics, in which most vertices are connected by a small number of links and have a higher than random clustering coefficient [35]. As the links created by the fraudster-accomplice do not form through natural interactions, fraudster and accomplice vertices will have markedly different clustering coefficients than normal users. We calculate the clustering coefficient Cn of a vertex v as follows: Cv =

ev kv (kv − 1)

where kv is the number of neighbors of v and ev is the number of links between these neighbors. As we can see from Fig. 3, the fraudsters and their accomplices (those points below the red line) have a lower clustering coefficient than the normal users. The benign users have clustering coefficient more in line with that found in small-world graphs while those of the fraudster and accomplice vertices tend more towards clustering coefficients found in random graphs. In this section, we have developed complimentary mechanisms for constraining deceivers behavior patterns. As we will demonstrate in Section IV, we can incorporation these into semi-supervised ML algorithms designed to identify potential supply chain risks.

IV. C ONDUCTING AUTOMATED S CREENING O F L ARGE DATA S ETS As we noted in Section I, current SCRM screening practices have not kept pace with the evolving environment in which the supply chain exists today, especially when it comes to information technology suppliers. Recent surveys have found an alarming number of business executives and decision makers do not consider their SCRM strategies and techniques to be effective [2], [3]. In particular, current practices typically do not adequately leverage available information (e.g., Web data), are not sufficiently scalable to handle the vast volumes of data that characterize suppliers, and do not provide quantitative assessments to allow risk prioritization, information sharing, evaluation of assumptions, and analytics. In this section, we examine the feasibility of developing Web-based, scalable, and quantitative screening process for information technology suppliers in conjunction with the graph-based constraints that were developed in Section III. As legitimate businesses are motivated to develop and maintain a commercially-beneficial Web presence, they devote considerable resources to this task. In contrast, vendors that represent increased supply chain risk (e.g., those attempting to make a profit by selling out of specification or counterfeit parts) are only motivated to cultivate a “realistic” Web presence in order to appear legitimate. This effort does not contribute directly to their monetary goals. We posit that this discrepancy in incentives will be detectable via analysis of large-scale Web data. Moreover, it may be especially difficult for such vendors to distort the topological characteristics of their Web presence, such as the structure and composition of their links, as seen in Section III. We approach the task of identifying supply chain risk as a ML classification problem of predicting the label of a given vendor vertex v in a given Web graph G. More specifically, for a given Web-based graph G = (V, E), where V and E are the vertex and edge sets, we consider the following vertexlabel prediction problem: given an vertex v ∈ V that is of interest but unlabeled, infer the label of v using information contained in the remainder of the network. Logically, each vendor vertex v ∈ V is represented by a feature vector x ∈ R|P | which models the vendor’s Web presence. We wish to obtain a vector w ∈ R|P | , such that the classifier sign(wT v) returns +1 if vendor v is “interesting” from a supply chain risk perspective and −1 if v is “not interesting”. The semisupervised ML algorithm presented in [24], which we will briefly discuss below, is well-suited for this learning task, as it provides accurate classification even when limited labeled data is available for training by leveraging the information present in unlabeled instances, can be implemented efficiently at Webscale, and permits analyst knowledge to be incorporated in a natural and effective manner. We model the problem data as a bipartite graph Gb of vertex-sign instances and features as seen in Fig. 1. The adjacency matrix for the graph Gb is given by   0 X XT 0 where the matrix X ∈ Rn×|P | is contains the features vectors

5

Predicted Value

Positive Negative

True Value Positive Negative 6 2 0 70

TABLE I S EMI - SUPERVISED ML A LGORITHM USING (26,39) LABELS

Predicted Value

Positive Negative

True Value Positive Negative 6 3 0 69

TABLE II S EMI - SUPERVISED ML A LGORITHM USING 13 LABELS

Fig. 4. Sample Web graph obtained in Web crawl for supply chain screening case study. The vertices are web domains, the edges denote hyperlinks, and colors delineate Web communities

of the n vendor vertices as rows. Let dest ∈ Rn be the vector of estimated signs for the vertices in the dataset and define the augmented classifier waug = [dTest wT ]T ∈ Rn+|P | that estimates the polarity of both vertices edges and features. We learn waug , and therefore w, by solving an optimization problem involving the labeled and unlabeled training data, and then use w to estimate the sign of any new vertex of interest with the simple classifier orient = sign(wT x). In order to learn the augmented classifier waug , we solve the following minimization problem: T min waug Lkn waug + β

waug

|V1 | X

(west,i − wi )2

(1)

i=1

where L = D − A is the graph Laplacian matrix for P Gb , with D the diagonal degree matrix for A (i.e., Dii = j Aij ), 1 1 Ln = D− 2 LD− 2 is the normalized Laplacian, k is a positive integer, and β is a nonnegative constant. Minimizing Equation 1 solves for waug while enforcing that the learned values for the labeled instances correspond to the true values. For further information, including performance comparisons to other ML algorithms, see [24]–[26]. Given the ML framework, we create sets of candidate quantitative features suitable for use with it. The set of features employed to characterize the Web presence of a vendor of interest are based on website content, hyperlink connectivity, and Web-graph community structure. For example, a candidate feature set might include the number of links on a vendor’s website, the centrality of a vendor’s website within its Web graph community, and the clustering coefficient of a vendor in relation to the entire graph or its Web graph community. Fig. 4 provides a visual example of the type of Web graph and its community structure that can be used in our analysis. It should

be noted that our technique works on graphs with millions of vertices and edges and a smaller network is displayed in Fig. 4 for visual clarity. We tested the proposed vendor screening system with a range of datasets, and here we summarize one such test. The dataset used in this section is a representative collection of 78 suppliers of information technology products that were randomly selected from a larger corpus. They were evaluated by a team of SCRM analysts, which resulted in labeling 6 of the vendors as interesting and requiring further assessment and 72 of the vendors as uninteresting. To test the performance of the proposed ML-based vendor screening system, we first conducted a large domain-based Web crawl (using a crawling tool similar to Apache Nutch [36]), and collected a large-scale graph of 100,000 domains that contained all the data necessary to compute a feature vector x for each of the 78 vendors. The accuracy of the proposed ML classifier was estimated using standard two-fold cross-validation [23]. Specifically, we randomly partition the data into two groups A and B, giving 39 labeled training instances in each validation round. We used set A to train the classifier and tested with set B, then used set B to train the classifier and tested it with set A, resulting in 78 classification instances. To evaluate the extent to which good performance could be achieved with limited labeled data, we varied the amount of labeled information made available to the algorithm during training. More specifically, we considered training procedures using 39 labeled vendors (100% of labels), 26 labeled vendors (67% of labels), and 13 labeled vendors (33% of labels). The results of our experiments are presented in Table I and Table II. It can be seen in the confusion matrix in Table I that using the majority of labeled instances resulted in excellent performance, with the classifier having only two misclassifications and achieving better than 97% accuracy. The classifiers were able to correctly identify the 6 interesting vendors that were manually identified by the analysts. More interestingly, the results of classifier using only 13 labeled instances (Table II) are almost identical to those in the experiments, only having one more false positive and achieving 96% accuracy. This is particularly important in the present application, as it is a time-consuming (i.e., multi-day) undertaking to label even a single vendor. Note also that there are no false-negatives produced by the screening system; that

6

is, no interesting vendors that could increase the supply chain risk slipped through the screening. V. C ONCLUSION In this paper, we have created automated screening methods based on robust deception detection that can be incorporated into supply chain risk management strategies. We identified two classes of constraints faced by deceivers, cognitive/computational limitations and strategic tradeoffs, which are both widespread and exploitable. Based on these constraints, we have developed graph-based metrics which can accurately and efficiently be used as behavioral signatures for potential deceiver activity. We then used machine learning algorithms to exploit these constraints to robustly detect deceptive behavior. Importantly, these algorithms require little labeled data and can function in multiple domains. Finally, we demonstrated the effectiveness of our solution through experiments on real-world datasets, showing that minimal labeled data coupled with our constraints can lead to highly accurate detection of potential supply chain risk. VI. ACKNOWLEDGMENTS This work was supported by the Laboratory Directed Research and Development Program at Sandia National Laboratories. R EFERENCES [1] J. Boyens, C. Paulsen, N. Bartol, R. Moorthy, and S. Shankles, “Notional supply chain risk management practices for federal information systems,” National Institute of Standards and Technology, Tech. Rep., 2012. [2] L. Glendon, “Supply chain resilience,” Business Continuity Institute, Tech. Rep., 2011. [3] K. Marchese and S. Paramasivam, “Global supply chain risk,” Deloitte, Tech. Rep., 2013. [4] Georgia Tech Information Security Center & Georgia Tech Research Institute, “Emerging cyber threats report 2013,” Georgia Tech, Tech. Rep., 2012. [5] T. Syrus, M. Pecht, and R. Uppalapati, “Manufacturer assessment procedure and criteria for parts selection and management,” IEEE Transactions on Electronics Packaging Manufacturing, vol. 24, pp. 351 –358, 2001. [6] R. Ellison and C. Woody, “Supply-chain risk management: Incorporating security into software development,” in Proc. of the Hawaii International Conference on System Sciences (HICSS), 2010. [7] D. Waters, Supply chain risk management: Vulnerability and resilience in logistics. Kogan Page, 2011. [8] Proc. of the IEEE International Conference on Intelligence and Security Informatics, 2011. [9] Proc. of the IEEE International Conference on Intelligence and Security Informatics, 2012. [10] A. Vrij, Detecting lies and deceit: Pitfalls and opportunities. WileyInterscience, 2008. [11] P. Ekman and E. Rosenberg, What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA, 1998.

[12] M. Newman, J. Pennebaker, D. Berry, and J. Richards, “Lying words: Predicting deception from linguistic styles,” Personality and Social Psychology Bulletin, vol. 29, pp. 665–675, 2003. [13] B. Jin, A. Strasburger, S. J. Laken, F. A. Kozel, K. A. Johnson, M. S. George, and X. Lu, “Feature selection for fMRI-based deception detection,” BMC bioinformatics, vol. 10, p. S15, 2009. [14] EACL Workshop Computational Approaches to Deception Detection, 2012. [15] V. Rubin and N. Conroy, “Discerning truth from deception: Human judgments and automation efforts,” First Monday, vol. 17, 2012. [16] M. Ott, C. Cardie, and J. Hancock, “Estimating the prevalence of deception in online review communities,” in Proc. of the International Conference on World Wide Web (WWW), 2012. [17] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: transfer learning from unlabeled data,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 759–766. [18] S. J. Pan and Q. Yang, “A survey on transfer learning,” Knowledge and Data Engineering, IEEE Transactions on, vol. 22, no. 10, pp. 1345– 1359, 2010. [19] R. Colbaugh, K. Glass, and J. Gosler, “Some intelligence analysis problems and their graph formulations,” Journal of Intelligence Community Research and Development, p. 27, 2010. [20] R. Colbaugh and K. Glass, “Early warning analysis for social diffusion events,” in Proc. of the IEEE International Conference on Intelligence and Security Informatics (ISI), 2010. [21] H. Peters, Game theory: A multi-leveled approach. Springer, 2008. [22] R. Colbaugh and K. Glass, “Predictive defense against evolving adversaries,” in Proc. of the IEEE International Conference on Intelligence and Security Informatics, 2012. [23] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning. Springer Series in Statistics, 2001. [24] R. Colbaugh and K. Glass, “Agile sentiment analysis of social media content for security informatics applications,” in Proc. of the European Intelligence and Security Informatics Conference (EISIC), 2011. [25] ——, “Leveraging sociological models for prediction I: Inferring adversarial relationships,” in Proc. of the IEEE International Conference on Intelligence and Security Informatics (ISI), 2012. [26] ——, “Leveraging sociological models for prediction II: Early warning for complex contagions,” in Proc. of the IEEE International Conference on Intelligence and Security Informatics (ISI), 2012. [27] K. Glass and R. Colbaugh, “Estimating the sentiment of social media content for security informatics applications,” Security Informatics, vol. 1, pp. 1–16, 2012. [28] K. Hoffman, D. Zage, and C. Nita-Rotaru, “A survey of attack and defense techniques for reputation systems,” ACM Comput. Surv., vol. 42, pp. 1:1–1:31, 2009. [29] S. Pandit, D. H. Chau, S. Wang, and C. Faloutsos, “Netprobe: A fast and scalable system for fraud detection in online auction networks,” in Proc. of the International Conference on World Wide Web (WWW), 2007. [30] D. Chau, S. Pandit, and C. Faloutsos, “Detecting fraudulent personalities in networks of online auctioneers,” Knowledge Discovery in Databases (PKDD), pp. 103–114, 2006. [31] M. E. Newman, “Modularity and community structure in networks,” Proceedings of the National Academy of Sciences, vol. 103, pp. 8577– 8582, 2006. [32] L. J. Grady and J. R. Polimeni, Discrete Calculus: Applied Analysis on Graphs for Computational Science. Springer, 2010. [33] D. Easley and J. Kleinberg, Networks, crowds, and markets. Cambridge Univ Press, 2010. [34] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’ networks,” Nature, vol. 393, pp. 440–442, 1998. [35] J. Seibert, D. Zage, and C. Nita-Rotaru, “‘Won’t you be my neighbor?’ Neighbor selection attacks in mesh-based peer-to-peer streaming.” Purdue University Computer Science, Tech. Rep. 1694, 2008. [36] Apache nutch. http://nutch.apache.org/.