Adaptive Spammer Detection at the Source Network - Semantic Scholar

8 downloads 28074 Views 344KB Size Report
generated by spammers, who usually send emails to their ... are used only for sending spam. ... is iterative, automatically selecting new training sets after each.
Globecom 2013 - Communications QoS, Reliability and Modelling Symposium

Adaptive Spammer Detection at the Source Network Pedro Henrique B. Las-Casas

Jussara M. Almeida

Marcos A. Gonçalves

Universidade Federal de Minas Gerais [email protected]

Universidade Federal de Minas Gerais [email protected]

Universidade Federal de Minas Gerais [email protected]

Dorgival Guedes

Artur Ziviani

Humberto T. Marques-Neto

Universidade Federal de Minas Gerais [email protected]

National Laboratory for Scientific Computing (LNCC) [email protected]

Pontificia Universidade Catolica de Minas Gerais [email protected]

Abstract—The large volume of unwanted email (spam) traffic wastes network resources. We have previously proposed SpaDeS, a method for spammer detection at the source network, which uses only network-layer metrics. We here present an extension of SpaDeS, focusing on its diversity and adaptability to new behavior patterns of spammers. To that end, we propose the use of a new active-learning-based strategy to select new, very informative, training samples, aiming at reducing the loss of effectiveness over time. The new method was applied to a real data set and the results show that, despite some variation in performance, the use of active learning to better select the training set improves the classification of legitimate users by as much as 21%, with just a small performance loss (less than 3%) in spammer classification.

I.

I NTRODUCTION

Despite the large number of anti-spam techniques available, spam remains a large portion of Internet traffic. Indeed, it has been reported that almost 80% of all email messages in the network are spam [1], often related to the spread of malware (e.g., trojans, worms, and viruses). The volume of spam traffic in the network remains large because, despite effective, antispam techniques are often only applied at the destination mail servers (or an appropriate intermediate). We have proposed SpaDeS (Spammer Detection at the Source) [2], a method to be used as a complement to antispam techniques at the receiving server. SpaDeS is used for early detection of spammers still at their source networks, thus contributing to reduce the spam traffic in the network. By analyzing SMTP (Simple Mail Transfer Protocol) sessions, SpaDeS is able to detect potential spammers. With that information, the network administrator can take action in accordance to that network’s policies, such as blocking traffic from that source, or other less drastic measures, like sending alert messages to the suspected users, using (periodic) challenges to test their legitimacy, or introducing delays in the messages from them. It uses only network-level metrics, not requiring the inspection of message content, and thus can be applied, for example, by broadband Internet access providers. SpaDeS is based on a supervised classification technique. As such, it requires a consistent training set, composed by examples of users previously classified in each group of interest (e.g., spammers and legitimate users) based on external

978-1-4799-1353-4/13/$31.00 ©2013 IEEE

1434

information (e.g., manually labeled by the system administrator). The algorithm then builds a classification model that can differentiate legitimate users and spammers based on patterns inferred from the training set. Since user characteristics may change over time, it is necessary to rebuild the classification model periodically to capture possible behavior changes. This is achieved by periodically retraining the model to maintain its effectiveness. To reduce the frequency at which external data is needed to build a new training set, SpaDeS uses an iterative strategy: during each classification period, it selects the users classified with higher confidence to compose the training set to be used for the next period [2]. SpaDeS’ iterative strategy may lead to the introduction of errors in the training set, since users selected in each iteration may have been misclassified, despite getting high confidence levels, as confidence is not a perfect discriminative measure. More importantly, the SpaDeS heuristic to include in the training set only users previously classified with high confidence may indeed compromise the diversity of the training samples as selected users tend to exhibit more homogeneous behavior (more similar to the current training set used in their classification). In the long run, this can hurt performance as new spammers may adopt different behavior patterns and old spammers may adapt to the current detection strategies, sometimes in a fast pace. In fact, initial experiments to assess the impact of the SpaDeS iterative solution on effectiveness as time goes by, using a real data set with anonymized information about user SMTP sessions from a Brazilian broadband ISP, demonstrated that SpaDeS performance in fact fluctuates significantly over a short period of 28 days [2]. Accordingly, we here propose a new strategy to dynamically select training examples aiming at improving the effectiveness of SpaDeS, particularly when applied iteratively over long periods of time. Our strategy uses an active sampling technique (ALAC — Active Lazy Associative Classification [3]) which selects users with more “informative” behavior patterns in each class, with regards to the current body of knowledge available in the training set, in order to update this training set with new patterns. More informative users have a more diverse behavior in comparison to others whose behavior is already known, strengthening the set of considered patterns and making the learned models more robust. As an additional advantage, user selection based on informative capacity

Globecom 2013 - Communications QoS, Reliability and Modelling Symposium

reduces the need for training considerably, when compared to traditional supervised techniques, while keeping similar effectiveness. Thus, ALAC tends to use much fewer users as training while preserving classification effectiveness over time. Those users must then be manually labeled by SpaDeS administrators (or other sources), a simple task given the reduced set of users selected by this technique. To evaluate the proposed approach, we first assess ALAC’s potential for building SpaDeS initial training set. For that, ALAC was applied to another data set from the same broadband ISP, collected in 2009, to select a training set. That set, comprising 129 users, was used as input for the classification of a more recent data set, collected in 2010, using SpaDeS. Both data sets covered periods of 28 days. Results show that more than 99% of legitimate users and 84% of spammers were correctly classified. This performance is very close to that of the original method, but was achieved using only 2.5% of the amount of training data used by it. We also analyze ALAC’s effectiveness along with the previously proposed iterative strategy [2]. For this, we evaluate SpaDeS over the 28 days of the 2010 data set, using users classified with high confidence in one day’s iteration, along with users selected by ALAC (and manually classified) to compose the training set for the next day. The new approach improved legitimate users classification by as much as 21%, with a small performance loss (less than 3%) in spammer classification. More importantly, results suggest a good tradeoff between correct classification of legitimate users and spammers, an important issue for detection methods at the source network. In sum, the main contribution of this work is the improvement of SpaDeS with a new training selection strategy to improve its adaptability and robustness in dynamic and evolving scenarios, as users (both legitimate and spammers) tend to adapt their behavior over time, exactly to try to evade detection measures. This paper proceeds as follows. Section II discusses related work, and Section III describes SpaDeS and the new proposed strategy. Section IV presents our data sets. Section V presents our results, and Section VI concludes the work. II.

analyzing SMTP protocol packets, inferring patterns that indicate spamming activities. Venkataraman et al. [9] studied the effectiveness of using the historical behavior of IP addresses to predict whether an email is legitimate or spam. Regarding the analysis of e-mail senders, Duan et al. [10] showed that most email servers tend to send only spam or only legitimate messages. Xie et al. [11] showed that the vast majority of e-mail servers running on dynamic IP addresses are used only for sending spam. Several studies have proposed to identify spammers at intermediate points in the network. Ramachandran and Feamster [12] investigated network-level characteristics, such as persistent IP addresses and routes as well as specific characteristics of botnets, common to spammers. Hao et al. [13] applied machine learning techniques on data collected from the network layer to classify spammers and legitimate users on a server positioned between the source and destination networks. Schatzmann et al. [14] proposed detecting spammers at the level of autonomous systems, collecting and combining views of multiple local destination servers. Unlike these works, some have proposed the detection of spammers at the source network, in order to minimize the waste of network resources. For example, Xie et al. [15] developed DBSpam, to detect proxy-based spamming activities relying on the packet symmetry of the traffic. The difference between SpaDeS and DBSpam is that DBSpam only detects spam proxies, while SpaDeS identifies bots involved in all kinds of spamming activities. Further, other methods for detecting spam use supervised classification techniques, however explored characteristics of the content of the messages [16], [17]. In contrast, SpaDeS [2] explores similar techniques, but it considers only metrics related to the protocols involved, without inspecting the content of messages to ensure the privacy of legitimate users. Finally, ALAC’s active sampling technique has been used to reduce the training set required by a supervised method to detect content pollution on YouTube [18]. Its usage to improve SpaDeS effectiveness over time by boosting adaptability is a key contribution of this work. To our knowledge, active learning has not been used in this context before. III.

R ELATED W ORK

We need to understand the spammers’ characteristics in order to develop better methods to detect them. Kim et al. [4] showed that the interval between arrivals of spam is below the range of legitimate emails (less than 5 seconds in 95% of cases). Gomes et al. [5] analyzed a workload of user messages from a Brazilian university and highlighted a number of characteristics that distinguish spam from legitimate messages. In an extension of that work, the same authors showed that legitimate traffic has lower entropy than the traffic generated by spammers, who usually send emails to their targets indiscriminately [6]. Using network traffic properties to determine if a message is spam, Ouyang et al. [7] recently showed that only metrics of a single packet or metrics of a flow are not effective for spam classification alone, but the combination of both increases classification effectiveness. Clayton et al. [8] proposed SpamHINTS, which aims to develop detection techniques by

1435

S PA D E S AND THE I TERATIVE S TRATEGY

SpaDeS uses a machine learning technique to classify users into a number of pre-defined target classes. Its core component is a supervised classification algorithm, LAC (Lazy Associative Classifier) [19], which takes as input a training set containing pre-classified users from each class considered. The algorithm first “learns” a user classification model from the training set. After the learning phase, the derived model is used to classify new users (the test set) into the pre-defined classes. The method is iterative, automatically selecting new training sets after each iteration to be used in the training phase for the next iteration. A. User representation model The ISP flow logs were processed to extract user information for each day of the period. During the development of the method, five attributes were selected as most efficient for spammer identification: number of observed SMTP transactions, number of distinct SMTP servers targeted, average transaction

Globecom 2013 - Communications QoS, Reliability and Modelling Symposium

size in bytes, average geodesic distance to destination, and average SMTP transaction inter-arrival time (IAT).

and automatic mechanisms, such as black lists or any other automatic spam detection mechanism. Such reports are thus a reliable source of information.

B. Supervised Classification

In order to reduce the frequency of use of the strategy based on data from abuse, which is subject to a fine grouping of users and data from external sources that are not always available, we also proposed an iterative strategy that exploits users previously classified with higher confidence as a training set for the next iteration. This strategy considers successive test sets t1 , t2 · · · tn and selects as training set for the classification of test set ti , the users of the set ti−1 who were classified with a confidence above a certain threshold. Algorithm 1 shows the strategy used. It ensures that at least α% of the users from each class are selected, maintaining a uniform minimal trust among all classes.

The LAC classification algorithm uses the fact that, frequently, there are strong associations between attribute values and classes. Such associations are generally implied in the training set and when found, reveal aspects that can be used to predict the classes of users. LAC produces a classification model composed of rules X → ci that indicate an association between a set of attribute values X and a class ci . LAC learns the classification model in two phases: on demand rule extraction and class prediction. Rule extraction is performed on demand, based on each user in the test set. In other words, for each user u in the test set, it projects and filters the training set according to the attribute values of u, extracting rules from the filtered set. Thus it ensures that only rules with relevant information for u are extracted from the training set, reducing the number of possible rules. LAC then estimates a confidence θ (X → ci ) for each rule X → ci extracted. Considering that a user u contains all values of attributes contained in X , θ(X → ci ) estimates the conditional probability of class of u being ci based on the attributes contained in X . To predict the class of u, LAC combines all rules X → ci where X contains attribute values that coincide with u. Each rule is treated as a vote for the class of u is ci . The probability for the class of u be ci is estimated by averaging all confidence votes for ci . It is considered as the class of u the one with higher probability. LAC has great scalability, with polynomial time complexity. Unlike many classifiers, it provides a relatively reliable estimated confidence for each prediction. That confidence can be interpreted as a probability of correct classification.

Algorithm 1 Generate Training Set for ith Iteration (Input for the Classification of Users in Testing Set Tsi ). Require: Li−1 , list of users in test set Tsi−1 , with the corresponding classes and classification confidences produced by LAC in (i − 1)th iteration. Ensure: Training set Tri for iteration i. for each class c = 1..4 do Sort the users classified by LAC as c in descending order of classification confidence; Select the top α% of the users with highest confidence; end for Let θcmin be the smallest confidence among users of class c (c = 1..4) selected in previous step; Let δ = min(θ1min , θ2min , θ3min , θ4min ); Insert in Tri all users for which the confidence of the classification produced by LAC is at least δ, keeping the classes defined by LAC in (i − 1)th iteration.

C. Iterative Operation

D. Active Selection of the Training Set

The operation of any supervised classification method depends primarily on a training set containing pre-classified users. Obtaining a training set for the classification of spammers and legitimate users is a challenge, since such data typically is not available publicly. A complicating factor is that the goal is to detect spammers still at the source network. Therefore, it is necessary a training set collected at that point in the system. Otherwise, the inferred association rules may not generalize, resulting in a poor classification.

The iterative solution reduces the need for building good user clusters a priori and for using external classification sources (e.g., reported abuse data), but may potentially introduce errors in the training set, as LAC may assign a high confidence to a misclassified user. These errors may greatly affect classification effectiveness over time. Moreover, the iterative solution reduces diversity in the training set, although diversity is important to keep high classification effectiveness over time. Because of that, SpaDeS may not adapt well to changes in user behavior patterns. By selecting only users classified with high confidence to compose the next training set, SpaDeS ends up selecting users with very homogeneous profiles (i.e., similar to the profiles used as training for their classification). Since users tend to exhibit high variability and heterogeneity [2], this procedure may not capture some relevant behaviors, specially for legitimate users. It may also hurt effectiveness of detection of malicious behavior in the long run, as spammers may change their behavior over time.

The previously proposed strategy for selecting the initial training [2] to be used by SpaDeS is the execution of the unsupervised algorithm for clustering X-Means on previously collected data in order to identify profiles (or classes) of users. For each class identified, we select the M users closer to the centroid of the corresponding group in order to get good representatives of each class. The group identified by the algorithm is used as the user class. For classes of users with abusive pattern, an indicative of possible spamming activity, we chose to, instead of selecting users near the centroid, use data from an external, potentially more reliable, source, since this information was available. Specifically, the machines identified as source of spam by reports from other providers were used as representative spammers. These reports, collectively named abuse, are generated by both user complaints

1436

To address these limitations, we here propose the use of ALAC (Active Lazy Associative Classifier) [3] as an active selection method to help build the training set. With active selection we intend to choose, from a set of unclassified users, a small subset that can be more diverse, and therefore useful and effective for the supervised classification process in the

Globecom 2013 - Communications QoS, Reliability and Modelling Symposium

long run. Each selected user should be manually classified by the administrator, after her selection. More formally, consider a set of unlabeled users U = {u1 , u2 , . . . , un }. The problem we investigate is how to select a small subset of users in U, such that the selected users summarize most of the interesting and non-usual user patterns found in U. These highly informative users will help compose an additional training data D, where ideally, |D|  |U|. Particularly, ALAC makes use of the redundancy in feature-space that exists between different users in U. That is, many users in U may share some common behaviors represented by common feature-values and LAC uses this fact to perform an effective selective sampling.

denoted as γj (U) and it is likely to be as dissimilar as possible from the users already in D={γj−1 (U), γj−2 (U), . . . , γ1 (U)}. The algorithm keeps inserting users into the training data, until the stop criterion is achieved:

Intuitively, if a user ui ∈ U is inserted into D, then the number of useful rules for users in U that share feature-values with ui will possibly increase. In contrast, the number of useful rules for those users in U that do not share any featurevalue with ui will clearly remain unchanged. Therefore, the number of rules extracted for each user in U can be used as an approximation of the amount of redundant information between users already in D and users in U. The sampling function employed by ALAC uses this key idea to select users that contribute primarily with non-redundant information. Those informative users are the ones likely to demand the fewer rules from D. More specifically, the sampling function γ(U) returns a user in U according to Equation 1:

The algorithm stops when all available users in U are less informative than any user already inserted into D. This occurs exactly when ALAC selects a user who is already in D. According to Lemma 1, when this condition is reached, ALAC will keep selecting the same user over and over again. At this point, the training set D contains the most informative users and LAC can be applied to classify users in the test set.

γ(U) = {ui such that ∀uj : |Rui | ≤ |Ruj |}.

(1)

The user returned by the sampling function is inserted into D, but it also remains in U. In the next round of ALAC, the sampling function is executed again, but the number of rules extracted from D for each user in U is likely to change due to the user recently inserted into D. The intuition behind choosing the user who demands the fewest rules is that such user should share less feature-values with users that were already inserted into D. That is, if only few rules are extracted for a user ui , then this is evidence that D does not contain users that are similar to ui and, thus, the information provided by user ui is not redundant, and ui is a highly informative user. This simple heuristic works in a fine-grained level of feature-values trying to maximize the diversity in the training set. The extracted rules capture the co-occurrence of feature-values, helping in our goal of increasing diversity, since in this case, the user who demands the fewest rules is exactly the one who shares the least possible number of feature-values with users already in the training data. In the case of a tie, the algorithm selects the user based on the size of the projection. Notice that initially D is empty and, thus, ALAC cannot extract any rules from D. The first user to be labeled and inserted into D is selected from the set of available users U. In order to maximize the initial coverage of D, the first selected user is the one who maximizes the size of the projected data in U, that is, it is the user ud for which Ud is the largest. This is the user who shares more feature-values with the others in the collection and can be considered its best representative. After the first user is selected and labeled, the algorithm proceeds using the fewest rules heuristic, as described above. After selecting the first user and at each posterior round, ALAC executes the sampling function and a new example is inserted into D. At the jth ALAC iteration, the selected user is

1437

Stopping Criteria: Lemma 1: If γj (U) ∈ D then γj (U)=γk (U) ∀k>j. Proof:: If γj (U) ∈ D, then the inclusion of γj (U) does not change D. As a result, any further execution of the sampling function must return the same user returned by γj (U), and D will never change.

ALAC tends to select a very small training set, as we shall see in our experimental results. Thus, the extra cost associated with the manual labeling effort, imposed on the system administrator, tends to be low and profitable. Manual labeling may also allow the system to adapt more quickly to changes in user behavior. To help reduce that labeling cost, we propose to build the training set for each iteration using the original automatic selection strategy, picking the users previously classified with higher confidence, and then applying ALAC. In other words, in the beginning of the ith SpaDeS iteration, set U will be composed by all users who were not selected by Algorithm 1, while the actual training set used to classifiy users of testing set Tsi is composed of Tri ∪ D. Notice that only the new users selected by ALAC must be manually labeled to be added to the training set. IV.

DATA S ETS

We use four data sets, two containing SMTP flow logs from a large broadband ISP in Brazil, and two with lists of users from that ISP who were denounced as spammers through that ISP’s abuse service during the same periods. Flow logs were collected using Cisco’s Service Control Engine (SCE) devices [20], and contain IP addresses, transport protocol and ports, start timestamp, duration, and volume of bytes sent/received. Each data set was combined with the ISP’s DHCP logs for the same period, so that local IP addresses could be tracked back to users (by their MAC addresses). Direct user identification was prevented by anonymization. The data sets cover the periods from March 1st to March 28th in 2009, and from June 12th to July 9th in 2010. SCE logs were filtered to extract only successful, non-empty SMTP transactions. After filtering, the 2009 and 2010 data sets contained 6.3 million transactions from 5,479 users, and 5 million transactions from 5,389 users, respectively. The abuse data sets include the denounced IP address and the date/time of each report, using ARF (Abuse Reporting Format). We built a tool to extract that information and correlate it with the SMTP traces and DHCP data. Using it, we identified 67 and 93 confirmed spammers in the 2009 and 2010 traces, respectively.

Globecom 2013 - Communications QoS, Reliability and Modelling Symposium

V.

TABLE I.

E XPERIMENTAL R ESULTS

In this section, we first evaluate the effectiveness of the new training selection strategy (Section V-A). Next, we evaluate user classification over a period of 28 days, comparing the results obtained with the original method [2] against those achieved by the new SpaDeS with active selection of the training set (Section V-B). We here consider two user classes, namely legitimate users and spammers. In order to evaluate the accuracy of the classification strategies, all users in the 2010 data set were manually inspected and classified according to these two classes. We also used the list of users denounced in the abuse logs in 2010 as privileged information about the real class of those users (i.e., spammers). A. Active Selection of the Training Set To evaluate the potential of active training set selection with ALAC, we classified the users in the 2010 data set using, as training set, users selected from the 2009 data set according to two approaches: (i) selection using ALAC, and (ii) selection using the (original) iterative approach only. In the first case, 129 users from a total of 5,352 were selected by ALAC. The manual classification of those users led to the identification of 57 spammers and 72 legitimate users. In the second case, which is equivalent to the experiments reported in [2], we first used the X-means clustering algorithm [21] to group users into two classes (legitimate/spammer). The 60 users closer to the centroid of the legitimate class and the 63 users reported as spammers in the abuse logs for the same period were then used as training set for the classification of the other users in the same data set (i.e., 2009 data set). Finally, we selected the users that were classified in the previous step with the highest confidences (using α=20%) as training set for the classification of the 2010 data set. Using this strategy, 3,230 legitimate users and 982 spammers were selected. Table I shows the classification results for each strategy in terms of the percentages of legitimate users and spammers that were correctly classified, and the size of the training set used. Note that active selection produces results very close to those of the iterative approach. Indeed, it leads to a slightly higher hit ratio for legitimate users (99.16% against 97.16%), while the original SpaDeS approach was slightly superior when classifying spammers (87.21% against 84.2%). We argue that, for detecting spammers at the source network, it is important to have the highest possible fraction of correctly classified legitimate users as long as we keep a good rate of spammer identification. This is because the cost of false positives can be high, since it may cause discomfort or penalties to a legitimate user. Misclassifying a few spammers does not have such high implications, since spams sent by them can still be blocked by a filter at the destination or at some intermediary point. We conclude that the use of active sampling with SpaDeS is viable, being a promising strategy to handle behavioral changes as discussed next. B. Temporal Evolution and Classification Effectiveness It is important to evaluate SpaDeS over time to assess whether it can be applied in practice, given its iterative nature. For example, the frequency of retraining the classification

1438

Original SpaDeS SpaDeS + ALAC

I MPACT OF THE T RAINING SET S ELECTION O N C LASSIFICATION . Correctly Classified Legitimate 97.61% 99.16%

Correctly Classified Spammers 87.21% 84.22%

Training Set Size 4,212 129

model depends on how stable it remains over time, and defines how often new external data and user clustering are needed. To evaluate the impact of adaptability issues on SpaDeS effectiveness, we ran experiments based on the 2010 data set divided by days. To bootstrap the process, we used X-means to classify users in the first day as legitimate users or spammers. An initial training set was built with the 40 users closer to the centroid of legitimate users, along with 43 users listed as spammers in the abuse log for that day. For the original version of SpaDeS, that training set was used to classify users in the second day, and, from that point on, the iterative approach (with α=20%) was applied to classify users in each day. For SpaDeS with active selection, the same initial training set was used. For the following days, the training set for day i+1 was built in two steps. First, the results produced by SpaDeS for day i were used to select the users classified with highest confidence (using α=20%) in each class. Next, ALAC was executed on the results for day i to select users to be manually labeled. Out of the users selected by ALAC (step 2), only those that had not been already selected by SpaDeS (step 1) were manually classified and added to the training set, so as to reduce the manual classification cost. Indeed, over the 27 days considered, only 15 users, on average, (26 maximum) had to be manually labeled each day. The users selected in step 1 were added to the training set with their predicted classes. Figure 1 shows the results for both strategies on each of the 27 days. Despite some fluctuation, the original iterative strategy used by SpaDeS correctly classified at least 82% of the spammers and 75% of the legitimate users, every day. With ALAC, SpaDeS also shows some fluctuation, but the fractions of correctly classified legitimate users reach much higher levels: it remains always above 80%, reaching close to 100% on several days. Indeed, the new strategy yielded equal or better results on 93% of the days. We note that the fraction of correctly classified spammers is somewhat smaller with ALAC, but it remains always above 72%. Overall, compared to the iterative approach, SpaDeS with ALAC produces an average daily gain in the classification of legitimate users of 8% (maximum gain of 21%), while causing only a very small loss in the identification of spammers (3% on average). As argued before, the superiority of SpaDeS with ALAC in the correct classification of legitimate users is due to the fact that ALAC tends to include users with different legitimate profiles in the training set, enhancing its diversity. On the other hand, the inclusion of multiple examples of (different) legitimate behaviors adds some noise to the classification of spammers, which causes a (very) small performance loss in spammer identification, when compared to the original method. The use of ALAC diversifies the training set covering new behaviors, since it selects users who have distinct and discriminatory characteristics, while the automatic approach proposed in [2] tends to select users similar to the ones on the existing training. Thus, using this new method enables

Correctly Classified Legitimate Users (%)

Globecom 2013 - Communications QoS, Reliability and Modelling Symposium

time. In the future, we plan to validate these results for newer data sets, possibly over longer periods of time.

100 80

ACKNOWLEDGMENTS 60

This work is supported by the Brazilian National Institute of Science and Technology for the Web (InWeb, MCT/CNPq 573871/2008-6), CNPq, Capes, FAPEMIG, and FAPERJ.

40 20

R EFERENCES

SPADES SPADES+ALAC

0 0

5

10

15

20

25

30

Day

Correctly Classified Spammers (%)

(a) Correctly Classified Legitimate Users 100 80 60 40 20 SPADES SPADES+ALAC

0 0

5

10

15

20

25

30

Day

(b) Correctly Classified Spammers Fig. 1.

Temporal Evolution: Original SpaDeS vs. SpaDeS + ALAC.

the practical implementation of SpaDeS for real situations, since the actual behavior of spammers (and even legitimate users) change over time, and ALAC allows adaptability to new emerging patterns. In the end, our goal is to minimize the number of misclassified legitimate users, keeping a good accuracy for spammers. Thus, the improvement provided by integrating active selection with SpaDeS is significant, justifying the need for a small manual labeling effort required by the new process. Nevertheless, we point out that the choice of strategy depends on the application scenario. The most common case would be to prioritize the correct classification of legitimate users, using SpaDeS with ALAC. But if the goal is to maximize the classification of spammers, then the original iterative method may be a better option. VI.

C ONCLUSION

We evaluated SpaDeS sensitivity to emerging patterns and behavioral changes over time. A new active learning strategy for training set selection was shown to enhance adaptability as time goes by. Our experimental evaluation showed that the use of SpaDeS with active selection of training examples is equal or superior to the original iterative strategy when classifying legitimate users 93% of the time, with a daily gain of as much as 21%, while incurring only a small loss (3% on average) in the classification of spammers. We showed that the use of ALAC improves the implementation for real situations, since it helps to identify new behavior patterns of spammers over

1439

[1] D. Fletcher, “A Brief History of Spam,” Time Magazine, Nov. 2009. [2] P. H. Las-Casas, D. Guedes, J. M. Almeida, A. Ziviani, and H. T. Marques-Neto, “Spades: Detecting spammers at the source network,” Computer Networks, vol. 57, no. 2, pp. 526 – 539, 2013, botnet Activity: Analysis, Detection and Shutdown. [3] R. Silva, M. A. Gonçalves, and A. Veloso, “Rule-based Active Sampling for Learning to Rank,” in Proc. ECML PKDD, 2011. [4] J. Kim and H. Choi, “Spam Traffic Characterization,” in Int’l Technical Conference on Circuits/Systems, Computers and Communications, Shimonoseki City, Japo, 2008. [5] L. H. Gomes, C. Cazita, J. M. Almeida, V. Almeida, and W. M. Jr., “Workload Models of Spam and Legitimate E-mails,” Performance Evaluation, vol. 64, no. 7-8, pp. 690–714, August 2007. [6] L. Gomes, V. Almeida, J. Almeida, F. Castro, and L. Bettencourt, “Quantifying Social And Opportunistic Behavior In Email Networks,” Advances in Complex Systems, vol. 12, no. 1, pp. 99–112, January 2009. [7] T. Ouyang, S. Ray, M. Rabinovich, and M. Allman, “Can network characteristics detect spam effectively in a stand-alone enterprise?” in Proc. 12th Passive and Active Measurement Conference, March 2011. [8] Richard Clayton, “spamHINTS: Happily It’s Not The Same,” Online, February 2006, http://www.spamhints.org/. [9] S. Venkataraman, S. Sen, O. Spatscheck, P. Haffner, and D. Song, “Exploiting network structure for proactive spam mitigation,” in Proc. 16th USENIX Security Symposium, 2007. [10] Z. Duan, K. Gopalan, and X. Yuan, “An empirical study of behavioral characteristics of spammers: Findings and implications,” Computer Communications, vol. 34, no. 14, pp. 1764–1776, Sep. 2011. [11] Y. Xie, F. Yu, K. Achan, E. Gillum, M. Goldszmidt, and T. Wobber, “How dynamic are ip addresses?” SIGCOMM Computer Communication Review, vol. 37, pp. 301–312, August 2007. [12] A. Ramachandran and N. Feamster, “Understanding the Network-Level Behavior of Spammers,” SIGCOMM Computer Communication Review, vol. 36(4), pp. 291–302, 2006. [13] S. Hao, N. A. Syed, N. Feamster, A. Gray, and S. Krasser, “Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine,” in Proc. Usenix Security, 2009. [14] D. Schatzmann, M. Burkhart, and T. Spyropoulos, “Inferring Spammers in the Network Core,” in Proc. 10th Int’l Conf. on Passive and Active Network Measurement, 2009. [15] M. Xie, H. Yin, and H. Wang, “Thwarting E-mail Spam Laundering,” ACM TOIS, vol. 12(2), pp. 1–32, 2008. [16] A. Kolcz and J. Alspector, “SVM-based Filtering of E-mail Spam with Content-specifcs Misclassification Costs,” in Proc. Workshop on Text Mining, 2001. [17] R. D. Lakshmi and N. Radha, “Spam Classification using Supervised Learning Techniques,” in Proc. 1st A2CWiC, 2010. [18] F. Benevenuto, T. Rodrigues, A. Veloso, J. Almeida, M. A. Gonçalves, and V. Almeida, “Practical Detection of Spammers and Content Promoters in Online Video Sharing Systems.” IEEE Transactions on Systems, Man and Cybernetics - Part B, vol. 42(3), pp. 688–701, 2012. [19] A. Veloso, W. Meira, and M. J. Zaki, “Lazy Associative Classification,” in Proc. 6th International Conference on Data Mining, 2006. [20] Cisco, “Cisco Service Control Application for Broadband Reference Guide,” Online, May 2010. [Online]. Available: http://www.cisco.com/ [21] D. Pelleg and Moore, “X-means: Extending K-Means with Efficient Estimation of the Number of Clusters,” in Proc. 17th International Conference on Machine Learning, 2000.

Suggest Documents