Mining (Social)
Network
Random Link
Graphs
to Detect
Attacks
Nisheeth Shrivastava, Anirban Majumder, Rajeev Rastogi Bell-Labs Research, India {nisheeths,manirban,rastogi}@alcatel-lucent.com
Abstract- Modern communication networks are vulnerable to attackers who send unsolicited messages to innocent users, wasting network resources and user time. Some examples of such attacks are spam emails, annoying tele-marketing phone calls, viral marketing in social networks, etc. Existing techniques to identify these attacks are tailored to certain specific domains (like email spam filtering), but are not applicable to a majority of other networks. We provide a generic abstraction of such
attacks, called the Random Link Attack (RLA), that can be used
to describe a large class of attacks in communication networks. In an RLA, the malicious user creates a set of false identities and uses them to communicate with a large, random set of innocent
users. We mine the social networking graph extracted from user interactions in the communication network to find RLAs. To the best of our knowledge, this is the first attempt to conceptualize the attack definition, applicable to a variety of communication networks. In this paper, we formally define RLA and show that the problem of finding an RLA is NP-complete. We also provide two efficient heuristics to mine subgraphs satisfying the RLA property; the first (GREEDY) is based on greedy set-expansion, and the second (TRWALK) on randomized graph traversal. Our experiments with a real-life data set demonstrate the effectiveness of these algorithms. I. INTRODUCTION
The past decade has experienced rapid advances in communication platforms, ranging from traditional phone services (mobile and fixed telephony), internet based emails and the recently popular trends of instant messaging (IM) and social networking. These platforms provide a plenitude of new and innovative ways to connect to other people, which not only make it easier to reach distant and worldwide users, but also provide richer and nicer interfaces (scrapbooks, SMS, voice mails, etc.) to exchange messages between them. As these communication networks are becoming increasingly popular, they are also becoming attractive targets for malicious users who leverage them to spread unsolicited (spam) messages. Along with several well-known examples of such misuse like annoying tele-marketing calls and spam emails etc., recently, several new kind of attacks are coming into focus, such as invitations to become a buddy or friend in a social network, viral marketing in social networks [1], fake scrapbook entries, etc. With almost no cost of sending messages in the network, and no guaranteed way of tracing back the identity of the sender, the number of abuses in these networks are growing at an alarming rate; according to [2], almost 80% of emails received at a mail server are spam.
978-1-4244-1837-4/08/$25.00 (© 2008 IEEE
Ongoing research aimed at spam identification has addressed this problem by creating filters that detect spam messages. The most common theme in these filters is to look at the content of the message (email headers or text) and classify them as either good or spam. These techniques typically train themselves on the set of known spam messages to form models that can be used in the classification process. Although these techniques have proved fairly effective in detecting email spam, they do not address the spam problem in other networks. For instance, content-based filtering solutions are not applicable in voice based communication networks, where it is hard to identify patterns in voice that can be marked as spam. We observe that, just looking at interaction behavior of the attackers can reveal some interesting spamming patterns, even without profiling the content of messages. For example, telemarketers usually call a large set of phone numbers to market their product, which is very different from a good user, whose calls are restricted to a community of her family or friends. Our approach is to find spammers by profiling properties of whom they send the messages to, instead of what they are sending. We generalize the specific attacks mentioned above into a unified attack schema, called the Random Link Attack (RLA). An RLA consists of two sets of nodes, the attackers and the victims (see Figure I). The eventual goal of the malicious user is to send spam messages to, or spread some other forms of influence on, a large number of innocent users, whom we call victims. To this end, she creates a set of identities, in the form of email addresses, social network profiles, phone numbers, etc., which we call the attack identities or attackers. She then randomly chooses the set of victim nodes, and uses the attack identities to send messages to them. To have a successful attack, typically the size of the victim set is very large as compared to the size of the attack set. Since our techniques are oblivious to the content of the interaction among users (e.g. text in emails), they are applicable to a variety of networks. We work with a generic abstraction called the social networking graph, which captures the interactions among individuals that are extracted from these communication networks. In this graph, each user (or its identity in the network) is a node and there exists an edge between two nodes if the corresponding users communicate with each other. Such social networking graphs can be extracted from various sources like email server logs, call detail records
486
ICDE 2008
\//,------------
users. It has been noted in [8] that in certain popular diffusion models, a randomly chosen set of seed users very successfully diffuses the advertisement. The only investment on the marketer's part is the cost of establishing a relationship with the seed users (and sending them the advertisement). The marketer creates a set of seemingly innocent profiles and uses them to make friendship links with a large set of seed users. These links can now be used to send the initial advertisements.
*ATTACK NODE VICTIM NODE GOOD NODE
Fig. 1. An RLA example >
,
(CDR) of telecom companies, friend communities in public social networks (like Orkut [3], LiveJournal [4]), blogspaces,
",
etc1.
We assume that the attacker is fully oblivious of the interaction pattern among its victims. This means that her choice of the victim set will be random, i.e., each node in the graph has an equal probability of being chosen by the attacker, independent of other victims. Hence the attack nodes form a subgraph in the social network, where a small number of nodes (attackers) are connected to a large number of other randomly selected nodes (victims), which are by themselves sparsely connected. Before going into the details of the attack, we briefly give two real-life scenarios which are good examples of RLAs. . Email Spam: Spam emails are undoubtedly one of the biggest problems in the internet. Malicious users take advantage of the free medium of email communication to send unsolicited messages on topics ranging from product marketing, adult content, financial scams, political opinion, and so on. A majority of such spam mails are sent to victim email addresses that are generated randomly from a popular domain (e.g.
[email protected]) or chosen from already existing mailing lists from various websites (personal webpages, email subscription services, etc.). Studies [5] have found that these victim mail addresses rarely exchange emails in practice, hence the email spam is a form of RLA. . Viral Marketing: The traditional "word-of-mouth" marketing has gained considerable interest in the internetbased marketing community under the name of "viral marketing" [6], [7], [8]. It is based on the simple fact that users are more receptive to a product or service recommended by their friends. In viral marketing, the marketer sends promotional advertisements, in the form of a link to an entertaining video or a website, to a set of seed users. If a user likes it, she recommends it to her friends, who recommend it to their friends, and so on. This generates a cascading effect and diffuses the advertisement in the social structure rooted at the seed
'Although we work with the knowledge of the entire graph, our techniques can be used even by operators that control and have information only about a specific domain (hence only a subgraph of the social networking graph). For example, a corporate mail server can run our schemes on their private email graph, etc.
160 120 00
80/
M0/
6
z
40
0
2000
4000
6000
8000
10000
Size of victim set (V) Fig. 2. The average number of edges in a set of randomly chosen nodes in the LiveJournal dataset.
The random nodes in the neighborhood of the attackers have a strikingly different structure than the nodes in the neighborhood of a good user. Watts et al.[9], in their seminal work, concluded that the neighborhood of a node in a social network has many triangles2 as compared to a node in a random graph. In other words, the neighborhood of a good user typically contains a set of communities, or a group of nodes that also have edges between themselves. Some examples of such communities are: in a friend network, many of my friends are also friends of themselves; or, in an email graph, colleagues who interact via email with me also send mails to each other. On the other hand, the number of links between a randomly chosen subset of nodes (victims nodes) tends to be very small. Typically, social networks are modeled as power law graphs [10], where it can be shown that the probability of an edge between a pair of randomly selected nodes is 0( l9, n being the number of nodes in the network. Figure 2 shows the number of edges in a randomly chosen victim set from the LiveJournal dataset. As an example, for a victim set of size 5000, the number of edges is only 40. To masquerade as good users, the attackers may form a dense web of connections with each other, increasing the number of triangles. From an attacker's point of view, this ensures that her neighborhood is structurally similar to that of a good user; such a collaborative nature of the attack also makes it hard to detect an RLA. However, if we were able to collapse all the attack nodes into a single node, all the randomly selected victim nodes would then appear in the neighborhood, making it possible to detect RLA. To the best of our knowledge, this is the first attempt to formalize and identify such collaborative attacks in a social network.
487
2A triangle is defined as a clique containing three nodes.
We would like to point out that RLAs are very different from Distributed Denial of Service (DDoS) attacks, where the goal is to overwhelm a single victim node by bombarding it with spurious requests from attack hosts or hosts compromised by attackers so that it runs out of resources and thus becomes incapable of providing service. There too, like our RLA scenario, the attack is launched by a group of attack nodes to evade detection. However, there is a crucial difference between these two attacks in the case of a DDoS, attackers target a single victim with a large number of requests, while in an RLA, the targets are a large number of randomly selected victims. Our approach will ultimately mark a set of spammers as a blacklist, which can be blocked by the mail servers. There are various other works based on similar ideas of blacklisting email addresses or IP-addresses [11], but they all rely on some form of content profiling, either based on spam filters or through explicit user feedback. It is important to note that our techniques are orthogonal to these approaches since we exploit only the underlying interaction structure of the network.
A. Our Contributions The key contributions of our work are as follows. . We formalize the notion of the Random Link Attack (RLA) based on simple interconnection properties of individuals in any communication network. Our definition of RLA incorporates various attack scenarios in communication networks, such as spam emails, annoying telemarketing calls, and viral marketing in social networks. We also prove that the problem of finding an RLA is NPcomplete. To the best of our knowledge, this is the first attempt to unify the attack definition across a variety of communication networks. * We present two simple tests, called the clustering test and the neighborhood independence test, that can be used to quickly mark a small subset of nodes as suspects and prune away the rest of the social networking graph. We also empirically evaluate the effectiveness of these tests in identifying suspects (attackers and non-attackers). * We present two heuristics, GREEDY and TRWALK to mine subgraphs satisfying the RLA property, starting from the suspect nodes. . Using extensive experimental evaluation on a real-life LiveJournal dataset [4], we show that the above techniques perform extremely well in practice, detecting the injected RLA group with almost complete certainty. B. Roadmap The rest of the paper is organized as follows. In the next section, we give a formal definition of an RLA and establish the hardness of finding it. Section III discusses the related research work. In Section IV, we describe tests to mark nodes in the graph as suspects. In Section V, we present the GREEDY and TRWALK algorithms to find an RLA starting from the suspect nodes. We describe the experimental evaluation of our
techniques in Section VI, and finally, present some concluding remarks in Section VII. II. THE RANDOM LINK ATTACK In an RLA, the attacker aims to create connections with a large set of victim nodes (V), which are then used to propagate one of the attack scenarios listed in the previous section. To this end, she creates a set of up to k fake identities, which is called an attack set (A). To launch a successful and scalable attack, the set of victims has to be much bigger in size than the attack set itself. Specifically, only if the size of the victim set is larger than a constant (a) factor of the attack set, we consider it significant enough to be detected. Further, we define external triangles as the triangles formed by the attackers with the rest of the graph. These triangles contain one attack node and a pair of non-attackers (victims). Two external triangles are called distinct if they contain two different pairs of victim nodes (or two different edges from the victim set). Since edges among victims must be a part of these triangles, and these edges are few in number, the number of distinct external triangles (AA) will also be very small. With this motivation, we formally define an RLA as follows. Definition 1: [RLA] Let A be a (non-empty) subset of nodes in the social network graph G and V be a subset of nodes in G -A that share an edge with some node(s) in A. A is called a random link attack (RLA) iff it satisfies the following properties.
AI < k VI > aoA AA < 0
(1) (2)
(3)
In this paper, we will discuss techniques to find instances of an RLA in the given social networking graph. We first present a simple approach to detect a trivial RLA, consisting of a single attacker. While this approach does not scale to detect larger attacks, it will be useful to quickly prune out good nodes in the graph (Section IV).
A. A Simple Approach The most important property that distinguishes the attack group from a social subgraph is the existence of few external triangles (property 3 in definition 1). A simple approach to detect the attacker would be to count the number of triangles containing it. For a malicious node, we would expect this count to be very low, whereas for an innocent node it will be high. For a node v e G, with degree more than a, if Av is smaller than the threshold 0, we mark it as malicious. Using this test, we will obviously find the RLA instances of size k = 1. This approach is motivated by the clustering coefficient [9] property, which was used successfully in [5] to find a set of spammers in email graphs. There is, however, a smarter way of creating the RLA by employing collaboration among the attackers, which is harder to detect by simply counting triangles. In a collaborative RLA, the malicious user starts connecting the attack nodes (e.g. by sending email from one spam-id to another) to camouflage
488
the attack. If the attackers are connected densely, they will form a large number of triangles inside the attack group (e.g. a k-clique will have 0(k3) triangles). Thus each attack node will have a high clustering coefficient. This attack is hard to identify, since no single attacker appears as a malicious node. But, if we collapse the attack group into a single node, the collapsed node will still have very few triangles with the rest of the graph. In this paper, we target such collaborative efforts of the attackers.
work was later enhanced by [7], [8] to find the optimal set of users to target for viral marketing. Spam email has been the most effective and successful attack in email networks [2]. Existing techniques against spam [26], [27], [28] use various machine learning and Bayesian algorithms to build models of spam based on a (previously-classified) set of good and spam emails. Since we look at only the sender and receiver of an email, and not the content, these techniques are not applicable to our problem. Recently there has been some research on spam detection that B. Hardness of Finding RLA leverages the structure of the social networking graph [5], We now show the hardness of detecting (the collaborative [29], [30]. In [5], the authors utilize the notion of clustering version of) an RLA, and prove that it is NP-complete. We coefficient for finding spammers in a social network. This is assume that enumerating all possible attacks in the network, similar to the approach presented in Section II-A, since they which takes G lk time is infeasible, and non-polynomial in G . also exploit graph properties to detect spam. However, they do We show a reduction from the following NP-hard problem, not consider the collaborative nature of attacks; if the sources called min-partition [12]: given an undirected graph G = of the spam mails are interconnected, then their method will (V, E) and a parameter k, find a non-empty subgraph T of not be very effective. size at most k which minimizes the number of edges between There has been a substantial amount of research in the T and V -T. We omit details of the reduction due to lack of database and data mining community on finding large dense space, and only state the following theorem. subgraphs in a massive graph [31], [32], [33]. But these apTheorem 1: The problem of detecting an RLA is NP- proaches do not cater to our problem formulation. First, unlike complete. these problems, the RLA graphs are not large as compared to the size of the entire graph. Second, we are not interested in III. RELATED WORK finding all the dense subgraphs, but only those which have a In the past few years, there has been a growing interest sparse neighborhood. Moreover, social networking graphs are in social networks, both on the theoretical front [9], [10] as heavily clustered in the sense that they contain many dense well as innovative system designs in the form of various social subgraphs. In this case, the above approaches may end up networking websites [4], [3], [13], [14]. In their seminal work, generating many false positives. Milgram [15] established the "Six degrees of Separation" in IV. CATCHING SUSPECTS (snail) mail networks. Since then, researchers have studied many interesting properties of social networking graphs. Watts As a first step towards finding an RLA, we identify a set et al. [9] introduced the notion of clustering coefficient that of suspect nodes that are potentially part of the attack cluster. differentiates social interaction graphs from random graphs. We test each node in the graph individually for two properties, Later works [16], [17], [18] have used this clustering property and if it doesn't satisfy either of them, we mark it as a to generate models of social networks. suspect. These properties, called clustering and neighborhood Along with their popularity, social networks have always independence, are intuitively derived from randomness in the been vulnerable to attacks. In [19], the author describes a Sybil set of victim nodes. They both use the fact that victims are Attack, in which a malicious user creates multiple identities expected to have very few edges among themselves. Both these and uses them to bias the outcome of an online voting process. methods help in pruning away almost all of the good nodes. A sybil attack assumes that the edges between attackers and In the next section, we describe in detail how to grow the victims are based on mutual trust, and are very small in neighborhood of the suspect nodes to catch the attack. number. This is very different from an RLA, since contrary to our definition, a sybil group will have few edges with the A. Clustering Property victims. Recently, Yu et al. [20], [21] presented a protocolThe first test is motivated by the definition of the clustering based approach for preventing the sybil attack in a P2P system. coefficient of nodes in a social graph [9]. We say that a node The sybil attack has also been studied in the context of sensor v satisfies the clustering property iff networks [22], [23]. In another recent paper, Kleinberg et al.
[24] described a new kind of attack in anonymized social networks, which can be used to compromise the identities of victim nodes and find out the communication pattern between them. A slightly different kind of attack that is becoming popular in social networking is the idea of Viral Marketing [25], which leverages social networks to produce rapid increase in brand awareness. In [6], Mahajana et al. showed how a friend network can be used in marketing. This
Av
>
cnv(nv 1)/2.
(4)
where A, and nr are respectively, the number of triangles and neighbors of v and 0 < c < 1 is a constant. A node in the graph is marked as a suspect if it does not satisfy the clustering property and its degree exceeds a. In the LiveJournal dataset, we found that for c = .005, almost 99% of the nodes satisfy the clustering property.
489
a)
a) a)
100
)KXX X
A
-
-X
large independent set of neighbors. We say that a node satisfies the neighborhood independence property iff
a
Uniform (C) Uniform (I) Non-Uniform (C) ----------Non-Uniform (I)............
-X--
IV {