Detection and Characterization of Anomalous Entities in ... - CNRS

0 downloads 0 Views 904KB Size Report
Automated anomaly detection using outlier analysis helps in identifying entities ..... Detection: A Survey ACM Computing Surveys (2009). [4]. S. Wasserman, K.
2010 International Conference on Pattern Recognition

Detection and Characterization of Anomalous Entities in Social Communication Networks Nithi Gupta, Lipika Dey TCS Innnovation Labs, Delhi, India [email protected],[email protected] within large social networks. Though visual detection of anomalous patterns from social networks has become quite popular, automated recognition of anomalous entities has received far less attention. Automated anomaly detection using outlier analysis helps in identifying entities that exhibit abnormal behavior. All abnormal entities are not necessarily suspicious. However, identification of anomalous behavior provides a useful first step to the analysts to restrict the data for visual inspection. Visual inspection is not only tedious it is also error-prone. Besides, social mavericks change their behavior continuously to dodge detection. Statistical analysis can help in this regard, since at any given time an anomalous entity remains sufficiently different from the “normal” or “expected” behavior as shown by the majority of the entities. In this paper, we propose a rich feature set that can model social network entities very effectively for anomaly detection. These features are derived from a given network using a number of direct and indirect user interaction parameters. The anomaly detection algorithm ranks entities according to their anomaly scores. We also present a unique visualization technique that explains the reasons for the anomaly score of an entity. The visualization helps analysts understand why a person is considered as anomalous by the system. Analysts may use the information in conjunction with other available meta-data to discard non-suspicious entities. For example, in an email network, it is often observed that a person who heads a group is copied in most of the mails that flow within the group. Thus this person may stand out as an exception because of very high connectivity, but can be eliminated using additional knowledge. The applicability of the method has been established over diverse data sets including the Enron e-mail data set and the synthetic VAST 2008 dataset [8].

Abstract Social networks generated from emails or calls provide enormous geospatial and interaction information about subscribers. These have served as important inputs to intelligence analysts. In this paper, we propose an efficient algorithm for anomaly detection from social networks. Anomalous users are detected based on their behavioral dissimilarity from others. A rich feature set is proposed for outlier detection. A method for providing visual explanation for the results is also proposed.

1. Introduction Social network analysis [SNA] is dedicated to the study of relationships among entities like people, groups, organizations or websites. A social network can be viewed as a graph in which nodes represent entities like people or groups, while the edges depict relationships or information flows between the nodes. Social networks evolve from communication networks like e-mails or telephone calls, or social networking sites like Facebook, Orkut etc. Enterprise portals accessible to employees within an organization can also be the source for social network analysis data. SNA has gained a lot of momentum within the pattern recognition and data mining research communities. Social Network Analysis encompasses both visual [1,2] and mathematical analysis of relationships. Recognizing significant patterns within various sections of the network and characterizing them are being viewed as important tasks by intelligence and investigative analysts. Social networks are viewed as rich sources of information by consumer market researchers also. In this paper, we propose a method for detecting and characterizing anomalous behavioral patterns 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.186

742 738

2. Anomaly detection in social networks – A review

network. Content is characterized as SHORT if it is less than d, otherwise LONG. Two users may communicate through all SHORT, all LONG or a mixture of SHORT and LONG content.

Anomaly detection is a mature area of research. It has been successfully applied to various areas like network intrusion detection, credit-card fraud detection, email based network analysis etc. Anomaly detection algorithms work on the premise that normal behavior is more pre-dominant than abnormal behavior and will be exhibited by majority of the network entities. Chandola et al. [3] presents a comprehensive review of this. Wasserman and Faust [4] proposed the use of centrality based measures for social network analysis. Lin and Chalupsky [5] had proposed the use of indirect connections to detect anomalous subscribers from mail data records. Chalupsky [6] used it to solve the VAST 2008 challenge [8].

Fig. 1. Distribution of call duration for mobile networks in India

3. Feature set identification for anomaly detection Most of the earlier works on social network analysis assume that users can be characterized by the number of calls made or received, number of contacts, towers used for making calls etc. Statistical properties like average number of calls made per day, average number of contacts, etc. have also been used. However, a user’s behavior cannot be analyzed by sums or averages only. A user can be better characterized by taking into account individual interaction patterns in the context of global behavioral features. We propose a richer feature set that can capture the entire range of user behavior more comprehensively. The proposed feature set is derived using both interaction data and content-related data. A communication record contains information about its originator, receiver(s), time of communication. Call details additionally contain call duration, tower used, call type, details about the handset used etc. Mail records contain mail text, possible subject, receiver type (TO or CC or BCC). Each interaction can be characterized as OUTGOING or INCOMING for a given user. Figure 1 shows the distribution of call durations for two Indian call data sets. Network 1 contained data about 8627134 calls made by 30866 users over 9 days. Network 2 had 1125297 calls made by 31494 users over 12 days. Figure 2 shows the distribution of message size for Enron mails. These figures show that shorter calls or communications are much more frequent than longer ones. A communication can be characterized as “long” or “short” based on its relative size within the network. Let d denote a value which can cover 95% percentile of all communications in a

Fig. 2. Distribution of mail length for Enron Interaction between two users can also be characterized based on their frequency of interactions. Figure 3 shows pair-wise interaction patterns for the earlier data sets. It is seen that pairs of users interacting frequently are higher than pairs with infrequent interactions. However, the distribution of interaction patterns is different for email and phone networks. Choosing d similarly as above, type of interaction between two users is termed as LOW, MED or HIGH. Relationship between a pair of users may change with time.

Fig. 3. Pair-wise interactions follow power law (left) Call data records (right) Enron dataset Based on the above observations, we now characterize a user with a total of 23 features in three categories :

739 743

OUTGOING features - fOUT (LOW, SHORT), fOUT (MED, SHORT), fOUT (HIGH, SHORT), fOUT (LOW, LONG), fOUT (MED, LONG), fOUT (HIGH, LONG), fOUT (LOW, MIXED), fOUT (MED, MIXED), fOUT (HIGH, MIXED) INCOMING features - fIN (LOW, SHORT), fIN (MED, SHORT), fIN (HIGH, SHORT), fIN (LOW, LONG), fIN (MED, LONG), fIN (HIGH, LONG), fIN (LOW, MIXED), fIN (MED, MIXED), fIN(HIGH, MIXED) GLOBAL features - number of people contacted, number of people contacting, people contacted but not contacting, people contacting but not contacted, activity time period. fOUT (LOW, SHORT) in the above feature set represents the total percentage of contacts of a user with whom the user-initiated interactions are few or LOW, and communications are only SHORT messages/calls/mails. Similarly fIN(MED, SHORT) represents the frequency of incoming messages/calls that are received by the user from all contacts with whom his communications are medium frequent. All other features can be similarly described. The feature f(people contacted but not contacting) denotes the percentage of the people contacted by the user who never contact the user. Similarly, f(people contacting but not contacted) denotes the percentage of people who are contacted by the user, but the user is not contacted by them. These two features characterize one-way interactions, a crucial feature in determination of anomalous behavior. Number of people contacted/contacting is represented as a normalized value. The feature set described above can effectively capture time-varying interactions in social networks. It ensures that all nuances of interaction for a user with all other users are represented in an efficient way.

This distance measure has been found to be very effective for analyzing spatial datasets. Given a social network, the following algorithm finds the top n outliers based on the above definition. The parameters n and k are provided as inputs. The complexity of the algorithm is O(N2), where N is the total number of nodes. The steps of the algorithm are: Step 1: Choose a value of k. Step 2: For each node find the distance from its kth closest neighbor. Step 3: Arrange the data points in descending order of the distance obtained in step 2. Step 4: Choose top n points as outliers. This algorithm finds all those entities which are very different from their neighbors and therefore qualify as outliers. To improve performance on real data sets, we have applied Principal Components Generator to identify the dimensions along which the data set shows maximum variation. One can chose the significant principal components only to represent the transformed dataset. This increases the efficiency of the whole system. The number of principal components can vary with applications.

5. Experimental results The proposed approach has been tested on various real and synthetic social network datasets. Correctness of the approach is established through results obtained over the publicly available VAST 2008 challenge dataset [8] and Enron dataset. VAST dataset has 400 entities with 10000 interactions. The challenge was to identify 10 members of a gang who conducted suspicious activities. Table 1 compares the performance of our system with those reported by [6]. All other results reported in VAST were for humandriven visual analysis.

4. Anomaly detection

Table 1. Comparison of result with KOJAK on VAST2008 dataset

The anomaly detection algorithm proposed in this paper is based on the principle of outlier detection. An outlier in a dataset is defined as an observation which deviates so much from other observations so as to arouse suspicions that it was generated by a different mechanism. In most real scenarios outliers are outnumbered heavily by the more commonly occurring observations. Unsupervised approaches to identify outliers are useful in detecting anomalies or nonconforming entities from large datasets. In this paper, an outlier is defined as follows [7]: Definition: Outliers of a set are the top n data elements that are farthest from their kth nearest neighbours.

Algorithm

Proposed Algorithm

KOJAK[6]

Precision

90%

60%

Recall

90%

60%

Enron dataset contains mail data of about 150 users, most of which are senior members of Enron. The top 2 anomalous entities are Jeffrey Skilling and Kenneth Lay. It is obtained from literature, that Jeffrey Skilling was convicted of multiple federal felony charges relating to Enron's financial collapse [9]. Kenneth Lay was indicted by a grand jury on eleven counts of

740 744

this entity was possibly involved in sending out inside information. In figure 6, the stacked bar charts show that percentage of one-way incoming (blue) and outgoing (green) interactions are much higher for the anomalous entity than its neighbors. In fact, most of the outgoing interactions for the anomalous entity are one way, again suggesting information leakage. These behavioral factors are also substantiated by social research.

securities fraud and related charges [10]. This establishes the correctness of the approach. Figure 4 shows the distribution of the anomalous and non-anomalous entities for the two datasets on the transformed feature space after applying PCA. Top 3 principal components were selected for this experiment for effective visualization. In both the cases, top 10 anomalous entities are colored red and the rest blue.

Fig 6. One way interactions for anomalous entity (leftmost) and its closest neighbors.

Fig 4. Anomalous and non-anomalous entities (left) Enron (right) VAST 2008 challenge

7. Conclusions In this paper, we have presented a method for identifying anomalous entities from social networks using a rich feature set for characterizing user interactions. We have also presented a visualization mechanism for understanding anomalous behavior using stacked pie charts. The work is being presently extended to identify aliases of anomalous entities from evolving social networks.

6. Explanation for anomalous behavior We now propose the use of stacked pie charts to provide explanation of anomalous behavior. A pair of stacked pie charts is used to describe the outgoing and incoming behavioral features of an entity. The width of an annular ring in the pie is a function of the total number of contacts of the entity. The innermost ring corresponds to SHORT interactions, outermost ring to LONG interactions whereas the middle ring corresponds to MIXED interactions. The color intensity denotes the frequency of interactions. OUTGOING

Anomalous

8. References [1]. E. Swing: Solving the Cell Phone Calls Challenge with the Prajna Project. IEEE Symposium on Visual Analytics Science and Technology, Columbus, Ohio, USA (2008) [2]. J. Payne, J. Solomon, R. Sankar,B. McGrew: Grand Challenge Award: Interactive Visual Analytics Palantir: The Future of Analysis. IEEE Symposium on Visual Analytics Science and Technology, Columbus, Ohio, USA (2008) [3]. V. Chandola, A. Banerjee, V. Kumar.: Anomaly Detection: A Survey ACM Computing Surveys (2009) [4]. S. Wasserman, K. Faust: Social Network Analysis: Methods & Applications. Cambridge, UK: Cambridge University Press (1994) [5]. S. Lin and H. Chalupsky: Discovering and explaining abnormal nodes in semantic graphs. IEEE Transactions on Knowledge and Data Engineering, Volume 20 (2008) [6]. H. Chalupsky: Using KOJAK Link Discovery Tools to Solve the Cell Phone Calls Mini Challenge. Proceedings of the IEEE Symposium on Visual Analytics Science and Technology, Portugal (2008) [7].S. Ramaswamy, R. Rastogi, K. Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. Proceedings of the 2000 ACM SIGMOD international conference on Management of data, Texas, United States (2000) [8]. IEEE VAST Challenge 2008, http://www.cs.umd.edu/hcil/VASTchallenge08 [9]. Wikipedia, http://en.wikipedia.org/wiki/Jeffrey_Skilling [10]. Wikipedia, http://en.wikipedia.org/wiki/Kenneth_Lay

INCOMING

Anomalous

Non-anomalous Non-anomalous Fig 5. Comparing behavior of an anomalous entity with its closest neighbors Figure 5 shows the incoming and outgoing behavior of the topmost anomalous entity from Enron dataset compared to its two closest neighbors. It can be seen that the outgoing behavior of the anomalous entity shown at top left is very different from that of the neighbors. It shows unusually high volumes of outgoing long mails sent to infrequent contacts. The incoming behavior is less discriminatory showing that

741 745

Suggest Documents