SOFTWARE: PRACTICE AND EXPERIENCE Softw. Pract. Exper. (2015) Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/spe.2343
Using new scheduling heuristics based on resource consumption information for increasing throughput on rule-based spam filtering systems David Ruano-Ordás, Jorge Fdez-Glez, Florentino Fdez-Riverola and José Ramón Méndez*,† Department of Computer Science, University of Vigo, ESEI, Campus As Lagoas, 32004 Ourense, Spain
SUMMARY The large increase of spam deliveries since the first half of 2013 entailed hard to solve troubles in spam filters. In order to adequately fight spam, the throughput of spam filtering platforms should be necessarily increased. In this context, and taking into consideration the widespread utilization of rule-based filtering frameworks in the spam filtering domain, this work proposes three novel scheduling strategies for optimizing the time needed to classify new incoming e-mails through an intelligent management of computational resources depending on the Central Processing Unit (CPU) usage and Input/Output (I/O) delays. In order to demonstrate the suitability of our approaches, we include in our experiments a comparative study in contrast to other successful heuristics previously published in the scientific literature. Results achieved demonstrated that one of our alternative heuristics allows time savings of up to 10% in message filtering, while keeping the same classification accuracy. Copyright © 2015 John Wiley & Sons, Ltd. Received 29 January 2015; Revised 21 May 2015; Accepted 09 June 2015 KEY WORDS: rule optimization schedulers; increasing filtering throughput; spam detection; anti-spam
filtering platforms; resource consumption-based heuristics; Wirebrush4SPAM framework
1. INTRODUCTION Since the establishment of the first interconnection between computers in 1969, Internet has grown beyond the imaginable. In fact, Internet has been undeniably becoming an essential part of life for many of people living in the most industrialized nations [1]. This circumstance has encouraged email to become one of the most powerful communication tools in the modern age. Because of its ease of use and widespread dissemination, Internet e-mail service achieved a surprising popularity. However, this fact together with the uncontrolled nature of Internet has turned e-mail communications into the best framework for the promotion of illegal drugs, phishing, and other forms of scam. The rising number of this kind of unlawful activities (also known as spam) confirms that spamming turned into one of the most profitable businesses for spammers and criminals. The big explosion of this phenomenon was reflected especially in the first decade of this century, where the presence of spam e-mails grew exponentially from 8% in 2001 up to 90% during 2009 [2]. Nevertheless, between the autumn of 2010 and the summer of 2012, the disconnection of rogue Internet Service Providers (i.e., 3FN, which made its fortune largely by hosting content for malware authors, identity thieves, child pornographers, and spammers) as well as targeted takedowns against major spam botnets (e.g., Bredolab, Rustock, and Grum) marked the beginning of a steep and steady decline (up to 8.2%) of junk e-mail volumes worldwide [3]. *Correspondence to: José Ramón Méndez Reboredo, Department of Computer Science, University of Vigo, ESEI, Campus As Lagoas, 32004 Ourense, Spain. † E-mail:
[email protected] Copyright © 2015 John Wiley & Sons, Ltd.
D. RUANO-ORDÁS ET AL.
Nowadays, because of the widespread use of social networks [4], most websites allow users to choose if they want to register themselves by using the same profile information and credentials as they use in social networks, such as Facebook, Twitter, or Google Plus. This new way of enrolling has become a very common phenomenon, mainly motivated by two critical factors: (i) the simplicity and comfort provided by the possibility of signing up into a website only by inserting the login information of the user profile in the social network and (ii) the unawareness from users about the risk posed by the massive exchange of their own private information, such as the user e-mail address [5]. This circumstance enabled spammers to easily obtain user e-mails and hence reactivate the spam business. Moreover, the new possibilities offered by the migration to the Web 2.0 technology have prompted novel forms of spam traffic, such as social spam. In order to get the word out about the enormous importance of this situation, a recent research report demonstrates that social spam has risen 355% in the first half of 2013 [6]. Moreover, the numerous types of social spam existing across Internet (e.g., link spam, text spam, or image-based spam) have boosted the need of updating the actual anti-spam filtering platforms in order to attain both (i) the enhancement of existing anti-spam techniques [7–9] and (ii) the development of new spam filtering methods [9–11]. However, we should keep in mind that increasing the complexity of actual anti-spam platforms involves a drastic reduction on their filtering performance (i.e., high-filter complexity entails less filtering throughput). In order to fight against current spam deliveries, most filtering service providers take advantage of combining classification results computed by different techniques. Thus, during the last years, the usage of rule-based spam filters (such as SpamAssassin [12]) achieved a great popularity. This filtering software is configured to use a set of scored rules and a filtering threshold. Each rule comprises a triggering condition and a score value, which is added to the global score of the e-mail if the condition matches the message contents. After the evaluation of all the rules that compose the filter, the message is classified as spam if its total score is greater or equal than the filter threshold. In the particular context of rule-based spam filtering systems, and considering the latest advances introduced to skip the execution of some rules under certain conditions [13] and the importance of striking the right balance between I/O operations and CPU resources, the need of providing appropriate mechanisms able to correctly schedule filtering rules becomes essential. Taking into account this situation, in this work, we propose three different scheduling heuristics specifically designed to improve filtering throughput (i.e., reduce the classification time for each new incoming e-mail) while maintaining the same classification accuracy. The ultimate aim of our approach is reducing the use of computational resources required by server side rule-based spam filtering systems as the most appropriate solution to cope with the large increase of spam deliveries [14]. While this section has introduced the work, the rest of the paper is structured as follows. Section 2 shows a detailed comparative about the operation cycle of Wirebrush4SPAM [13] and SpamAssassin [12] spam filtering platforms. In Section 3, we present in detail the three new scheduling techniques aimed to increase spam filtering throughput. Section 4 analyzes the results extracted from this work and discusses learned lessons. Finally, Section 5 summarizes the main conclusions from our approach and highlights future research lines. 2. ANTI-SPAM FILTERING FRAMEWORKS Spam countermeasures are as old as spam attacks. The earlier mechanisms to fight against spam deliveries were based on the individual execution of one (or several) non-intelligent spam detection technique mainly in charge of analyzing and detecting keywords inside the e-mail content. Over the years, the massive increase of spam deliveries has forced the anti-spam filtering community to design new spam countermeasures. As a consequence, a breakthrough in the spam filtering domain happened in August 1997 with the creation of spamometer [15] and filter.plx [16] applications. These rudimentary tools were able to automatically classify e-mails by joining the execution results of several simple anti-spam techniques as regular expressions, RBLs, and so on. In order to support the parameterization of these techniques (e.g., customize regular expressions or DNS prefix for RBL), they are added to the filter definition in the form of rules. Each rule identifies one spam-like feature of the e-mail including a score parameter (similar to a weight). Whenever an executed rule Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
NEW SCHEDULING FOR INCREASING THROUGHPUT ON RBS FILTERING SYSTEMS
matches a target message, its score is added to the e-mail total score counter. Finally, when all rules are executed, the e-mail is classified as spam if its total score is greater than or equal to a threshold initially established in the configuration. This straightforward mode of operation became the core of SpamAssassin, one of the most popular open-source filtering frameworks in the spam industry. In this way, SpamAssassin is able to automatically classify new incoming e-mail messages through user-defined spam filters. A filter in SpamAssassin is defined by a set of scored rules and a global threshold (called required_score). Each rule is associated with an anti-spam filtering technique and it is composed by two main parts: (i) a Boolean expression (used to trigger the rule) and (ii) a score. Following this simple design, an e-mail is classified as spam when the sum of the individual scores from triggered rules is greater than or equal to the value of the required_score parameter. The ease of use and the effectiveness of the e-mail classification process allowed SpamAssassin to become a reference platform in the spam filtering domain. As a consequence, the inner operation of this framework was adopted by other commercial spam filtering products and services developed by international companies. Symantec Messaging Gateway [17] and McAfee Email Protection [18] are in fact successful customizations of the SpamAssassin framework. Despite the revolution caused by SpamAssassin in the spam filtering domain, its low filtering throughput (i.e., number of messages classified per second) together with the indiscriminate increase of spam deliveries, forced companies to constantly improve their filtering infrastructure. This situation encouraged the need to enhance the spam filter operation model initially introduced by SpamAssassin to significantly improve filtering throughput. In this connection, in 2012, we successfully introduced Wirebrush4SPAM, a novel rule-based filtering platform that emerged to specifically cope with performance issues. The following subsections provide a detailed description about fundamental differences between Wirebrush4SPAM and SpamAssassin operation process, complemented by an explanation of the rule scheduler component built-in, the Wirebrush4SPAM platform. 2.1. Wirebrush4SPAM and SpamAssassin filtering process Although Wirebrush4SPAM rule-based philosophy and functionalities are inspired in the SpamAssassin framework, its operation process has been drastically changed with the aim of solving the drawbacks found after an exhaustive analysis of the SpamAssassin architecture. Figure 1 shows a detailed comparison of all the stages comprising the operation of these platforms in order to both illustrate their major differences and reveal the main improvements included in Wirebrush4SPAM. As we can observe from Figure 1, the operation of both filtering platforms is divided into two phases: (i) framework initialization and (ii) spam filtering life cycle. The first one is only executed when the filtering platform starts, being responsible for loading into memory all the rules comprising the anti-spam filter. Also, as Figure 1 shows, in this phase, Wirebrush4SPAM provides two innovative concepts: (i) the computation of pending_subtract and pending_add values required by the smart filter evaluation (SFE) strategy and (ii) a rule scheduler module able to elaborate an execution plan for rules using different rule schedulers with the goal of increasing filtering performance while reducing the time needed to classify each e-mail. The great impact of this module in filtering throughput was previously demonstrated in [19] and also used as the basis to carry out this work. The second phase is automatically executed every time a new e-mail is received for being classified. As we can observe from Figure 1, the spam filtering life cycle of both platforms (Wirebrush4SPAM and SpamAssassin) is divided into five different stages: (i) parsing; (ii) rule execution; (iii) decision making; (iv) e-mail learning; and (v) report generation. In the second phase, the first three steps comprising the filtering process are similar (also in their execution order) in both filtering platforms. Moreover, as we can notice from Figure 1, the parsing system provides an interconnection point between each new incoming e-mail and the filtering platform. To carry out this task, the parsing system analyzes each e-mail extracting the information required for the successful execution of each filtering rule. In order to parse the e-mail contents, SpamAssassin takes advantage of the efficiency achieved by Perl regular expression functions while Wirebrush4SPAM implements its own parser Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
D. RUANO-ORDÁS ET AL.
Figure 1. Comparison of Wirebrush4SPAM and SpamAssassin operation mode.
based on a specific finite-state machine. Discarding the use of regular expressions for parsing tasks, Wirebrush4SPAM parsers has been perfectly adapted to process e-mail contents, and consequently, the time required to analyze each e-mail has been significantly decreased. In the second stage of the filtering process, the system executes all the filtering rules. During this stage, whenever a rule matches the e-mail content (i.e., the filtering technique associated with the rule is fulfilled), the score of the rule is added to the e-mail total score. With the aim of specifically improving the performance in this stage, Wirebrush4SPAM provides three core functionalities: (i) use of caches to store intermediate results in order to avoid the unnecessary execution of complex calculations more than once; (ii) use of sufficient condition rules (SCR), which allows to assign a definitive score to a rule (‘+’ or ‘’) in order to abort the execution of the filter; and (iii) the introduction of the SFE strategy, which prevents the execution of rules that are irrelevant to the e-mail classification process. Inspired in lazy Boolean expression evaluation, SFE scheme includes an easy-to-evaluate condition to detect if unevaluated rules could affect the final e-mail classification. In the case that unevaluated rules do not affect the final decision, filter evaluation can be safely aborted and the target message is classified according to the score achieved until that moment. When all rules are executed (or the SFE condition is triggered in Wirebrush4SPAM), both frameworks automatically initiate the execution of the decision module. During this stage, the e-mail total score computed during the previous phase is used to finally classify the new incoming message. If the e-mail total score is greater that or equal to the required_score threshold, the decision support system automatically tags the e-mail as spam. With the aim of increasing filtering throughput even more, Wirebrush4SPAM incorporates into this stage the learn after report strategy. The use of this functionality alters the execution order of the Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
NEW SCHEDULING FOR INCREASING THROUGHPUT ON RBS FILTERING SYSTEMS
latest stages involved in the filtering process (i.e., report generation and learning). In contrast with the operation process of SpamAssassin, the use of Wirebrush4SPAM learn after report strategy forces the execution of all the auto-learning tasks in separate threads after sending the classification response to the mail transfer agent. This fact entails a significant increase on the rule engine responsiveness without compromising the classification speed. 2.2. Wirebrush4SPAM rule scheduler module As we stated in the previous subsection, current research showed that the inclusion of the rule scheduler module in Wirebrush4SPAM represents a breakthrough in the reduction of the e-mail classification speed [19]. Basically, this module is able to strategically sort rules according to the results obtained through the application of certain rule arrangement heuristics. To accomplish this task, the operation process is divided into three sequential and complementary steps (Figure 2): (i) rule loader and dependence identifier (RLDI); (ii) special rules presorting (SRP); and (iii) remaining rules scheduler (RRS). As shown in Figure 2, during RLDI stage, all rules comprising the filter are loaded into the memory, while their corresponding type is determined (i.e., META, SCR or standard rules). Finally, the framework computes all execution dependencies associated with each META rule. Subsequently, the second phase (SRP) ensures the appropriate distribution of the SCR and META rules taking into account the restrictions computed during the previous stage. With the aim of facilitating the understanding of the SRP stage, we divided it into two sequential steps: (i) sorting of SCR and META rules and (ii) dependence resolution. During the first step, the SCR rules are placed at the beginning of the rule execution plan. Moreover, all the META rules are positioned just after SCR rules and sorted by score in descending order. During the second step, the module locates the dependent rules for each META and places them just before the call to the
Figure 2. Inner working details of Wirebrush4SPAM rule scheduler. Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
D. RUANO-ORDÁS ET AL.
META test. By following this approach, we can ensure that dependant rules are executed before calling its corresponding META test. Finally, the third stage (RRS) is in charge of the organization of the remaining (standard) rules and should be designed to significantly improve filtering throughput by the execution of the rule arrangement heuristics. At this point, we previously proposed in [19] five different heuristics to elaborate an optimized plan for executing the rest of the rules. Table I shows a short description of each heuristic. The scheduling heuristics showed in Table I can be divided into three different categories: (i) based on the value of the score associated with each filtering rule; (ii) designed to maximize the separation between rules belonging to the same filtering technique; and (iii) aimed to get the most of the SFE stop condition. Although all the heuristics allow Wirebrush4SPAM to increase its filtering throughput (by decreasing the classification time), experimental results demonstrated that greater distance value (GDV) algorithm is the best approach, allowing a reduction of up to 50% on the average time needed to classify an e-mail in comparison with the absence of a rule scheduler [19]. Even though previous results achieved a great improvement on the filtering performance, we believe that the design of novel rule scheduler strategies (heuristics) represents the best way to significantly improve (even more) filtering throughput. 3. NEW SCHEDULING APPROACHES FOR IMPROVING RULE-BASED SYSTEMS PERFORMANCE As it can be concluded, rule scheduler component has been successfully identified as a new research alternative in terms of increasing the filtering performance by modifying the rule arrangement according to some specific heuristic. Under this scenario and with the aim of improving the functionalities and results previously achieved, we are introducing in this work, the development and implementation of three new rule arrangement heuristics: (i) resource balancing maximization (RBM); (ii) cost-effectiveness maximization (CEM); and (iii) multi-heuristic ensemble (MHE). Resource balancing maximization and CEM heuristics are able to take advantage of knowledge about CPU and/or I/O resources required for the execution of each test included in rules. Thus, the scheduler can achieve an adequate execution balance by combining rules that execute I/O operations together with those using intensively CPU resources (to carry out complex computations). Complementarily, MHE technique combines the main heuristic with an auxiliary one to sort rules when the main heuristic is not able to sort several rules (i.e., when two or more rules obtain the same evaluation). Resource balancing maximization heuristic is able to build an execution plan by sequentially adding each remaining rule to the plan. In each step, RBM adds the rule that achieves the best possible balance between CPU consumption and I/O delays. Figure 3 introduces the pseudo-code of RBM heuristic assuming that we compute the I/O delay and CPU consumption of each rule Table I. Available scheduling heuristics in Wirebrush4SPAM framework. Acronym
Name
PFS
Positive first scheduling
NFS
Negative first scheduling
GAV
Greater absolute value
GDV
Greater distance value
PSS
Plug-in separation scheduling
Copyright © 2015 John Wiley & Sons, Ltd.
Description Carries out a descending sort using the rule score as criterion. Performs an ascending sorting according to the value of each rule score. Accomplishes a descending sorting paradigm using the rule score absolute value as criterion. Implements a descendant ordering by using as criterion the distance between each rule score and the required_score parameters. Arranges the rules in order maximize the separation among rules belonging to the same filtering technique.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
NEW SCHEDULING FOR INCREASING THROUGHPUT ON RBS FILTERING SYSTEMS
Figure 3. Pseudo-code algorithm of RBM heuristic.
through the functions get_io(INPUT: ) and get_cpu(INPUT: ). These values can be easily estimated using gettimeofday C function. As we can observe from Figure 3, the algorithm is divided into two different functions. RBM function is responsible for reallocating rules while recalculating CPU and I/O values (stored in rule_time data type) every time a rule is scheduled. RBM heuristic selects the rule having less I/O delays and CPU requirements as the first scheduled rule. Keeping in mind the SFE strategy implemented in Wirebrush4SPAM, this decision accommodates those rules with the highest computational requirements at the end of the scheduling, ensuring the maximization of time savings when SFE condition is triggered. To accomplish the rest of the execution plans, the decision about how to select one rule over another is taken by the GET_BEST_BALANCE_RULE function (line 21). This function searches from unscheduled rules the one that minimizes the computational resource balance (CRB) metric defined by Equation (1). CRB ¼ ðrt:cpu_counter þ get_cpuðxÞÞ ðrt:io_counter þ get_ioðxÞÞ
(1)
where rt.cpu_counter and rt.io_counter represent the accumulated estimation of CPU resources and I/O delays, respectively, and get_cpu(x) and get_io(x) obtain the estimations of CPU and I/O for the execution of the rule x. In order to clarify the operation process of RBM heuristic, Figure 4 includes a brief example using a filter composed by three rules. As we can observe, each filtering rule is associated with two values: (i) CPU_consumption that stands for the CPU resources needed by the filtering technique associated with the rule and (ii) IO_time that denotes the average delay required to carry out I/O operations. Moreover, as shown in Figure 4, we have divided the process into two steps: (i) less CPU and IO consumption rule and (ii) best fitting CPU and IO consumption rule. During the first step, the rule having the lowest values for I/O and CPU is selected as the first rule for execution. In the proposed example, R3 is placed in the first place because it uses fewer resources than any other rule. In the second step, the algorithm computes the rule-fitting measure Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
D. RUANO-ORDÁS ET AL.
Figure 4. Example of the sorting process carried out by the RBM heuristic.
of each rule using Equation (1). The rule that attains a lower value will be selected and its CPU and I/O values will be added to the rule_time data type. This step will be repeated until N 1 rules are sorted (being N the total number of available rules) because the remaining rule should be executed at the end regardless of its CRB evaluation. In the first iteration, the rule selected will be R1 because it achieves lower values when applying Equation (1) (R1 = 1 < R2 = 3). Although RBM is able to take advantage of CPU consumption and I/O delays of rules to create an execution plan, we also designed CEM heuristic by combining computer resource consumption (CPU and I/O) and score information. CEM heuristic uses the score as the main guide to define the order of rules in the execution plan, trying to delay the execution of the most computationally expensive ones. Figure 5 introduces the pseudo-code of the CEM approach. As shown in lines 06 to 14, the algorithm assesses cost-effectiveness for each rule using COMPUTE_CE function (lines 22 and 23). Equation (2) shows how cost-effectiveness of each rule is computed. cost_effectivenessðxÞ ¼ jget_scoreðxÞ=ððget_cpuðxÞ þ get_ioðxÞÞ=2Þj
(2)
where get_score(x), get_cpu(x) and get_io(x) represent the score and the estimated CPU and I/O time required to evaluate the rule x, respectively. The CEM heuristic applies a sorting algorithm (e.g., quicksort) to arrange rules by descending order using a cost-effectiveness approximation as criteria. In the pseudo-code showed in Figure 5, we provide the implementation of a comparison function (CE_COMP) to facilitate the translation to programming languages such as C (qsort function). In order to clarify the main concepts of the CEM heuristic, Figure 6 presents an example of its operation. As we can observe, the rule arrangement process is divided into two different steps. During the first step, we compute the cost-effectiveness value of each rule by applying Equation (2) (defined inside CE_FUNC). During the last step, we use the cost-effectiveness estimations as the criterion to sort rules in descending order. As we can observe in Figure 6, the final rule arrangement Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
NEW SCHEDULING FOR INCREASING THROUGHPUT ON RBS FILTERING SYSTEMS
Figure 5. Pseudo-code algorithm of CEM heuristic.
Figure 6. Example of the sorting process carried out by the CEM heuristic.
follows the R2-R1-R3 distribution because it establishes a descendant order according to the overload value of each rule. As a final contribution of this work, we introduce the MHE heuristic, a simple method able to combine the knowledge provided by the two previous scheduling heuristics. MHE uses a master heuristic used as the main criteria to sort rules. However, when two rules obtain the same evaluation, MHE makes use of an auxiliary heuristic that provides a secondary criterion. Figure 7 shows the pseudo-code algorithm of MHE meta-heuristic. As we can observe from Figure 7, MHE function is responsible for sorting all rules involving the filter according to the result obtained by the application of the FIRST_COMPARATOR function over each rule (line 13). This function facilitates the execution of any available scheduler (i.e., RBM or CEM) through SCHEDULER function parameter (lines 46 to 48). As showed in lines 07 to 32, for each inner loop iteration, the algorithm identifies the best rule through the application of the defined scheduler (also called baseline rule). When a single baseline rule is identified, it is automatically arranged in the proper place. Otherwise, if several rules have the Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
D. RUANO-ORDÁS ET AL.
Figure 7. Pseudo-code algorithm of MHE heuristic.
same priority, they are identified as candidate baseline rules. Under this circumstance, all candidate baseline rules are automatically stored into a new variable (LIST data type). Once the inner loop ends, the heuristic verifies the size of the LIST data type (line 33) in order to determine the number of candidate baseline rules. If there are more than one candidate baseline rule, a second scheduler will be executed over these rules in order to resolve the tie between them and therefore select the best rule according to the value obtained by the application of the SUB_SCHEDULER function included in the SECOND_COMPARATOR method (line 35). The SUB_SCHEDULER function pointer allows defining the heuristic used to select the most adequate rule among the list of available rules. Once the definitive baseline rule is identified, the MHE algorithm sets the rule at the correct position inside the array of rules. Both SCHEDULER and SUB_SCHEDULER functions (lines 13 and 35) are not defined in the pseudo-code of Figure 7 because they depend on the sorting algorithm used. Finally, we should mention that MHE heuristic is consistent with the execution of our previously proposed schedulers (CRB and CEM) together with those included in [19]. 4. RESULTS AND DISCUSSION In order to demonstrate the suitability of our novel heuristics when compared with existing approaches, we designed and executed a straightforward and reproducible benchmarking protocol. This section introduces the experimental details and discusses the obtained results in accordance with the following subsections. Section 4.1 includes an in-depth explanation of the experimental Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
NEW SCHEDULING FOR INCREASING THROUGHPUT ON RBS FILTERING SYSTEMS
setup containing a brief description about the existing corpus, hardware and software details, and filter configuration issues. Section 4.2 provides a complete analysis of the results achieved by our novel heuristics, compared with the outcomes previously obtained in [19]. Finally, Section 4.3 highlights the experience gained while carrying out this work. 4.1. Experimental setup According to the work of Pérez-Díaz et al. [20], both the experimental dataset and training test methodology should be carefully selected in order to obtain realistic classification performance measurements in spam filtering domain. However, because of the nature of the current work, the experimental benchmark should be specifically designed to effectively compare the filtering throughput achieved by each heuristic (instead of their classification performance). Taking this fact into account, the use of a specific dataset or a particular evaluation methodology (e.g., k-fold crossvalidation) does not provide special evaluation benefits because the filtering rate is independent of the analyzed issues. In this context, we built our own dataset composed by 200 e-mails including different languages (i.e., Spanish, English, and Portuguese) equitably divided into two categories (i.e., spam and ham). In the same argumentation line, we applied a simple evaluation process (train and test) using 10% of the available messages for training purposes. Table II shows the final distribution of the selected corpus. To compare the suitability and performance obtained by each novel heuristic proposed in this work, we designed a spam filter consisting of (i) nine naive Bayes rules; (ii) 169 regular expressions; (iii) 7 networks tests (i.e., 2 belonging to RWL/RBL and 5 associated with SPF technique); and (iv) a global required_score value of 5. Additionally, we decided to avoid the usage of SCR because the proposed heuristics only address the arrangement of the standard rules (i.e., carried out in the RRS stage). It should be also noted that the time required for the execution of any rules performing network tests usually experiments important variations across different executions due to their dependency on the network status (e.g., latency, congestion, and overload). With the aim of minimizing this circumstance and achieve more realistic results, we processed each e-mail included into the experimental dataset 10 times, simulating the execution of the experiments in a corpus of 1800 messages. All the experiments were executed using the latest version of Wirebrush4SPAM inside an Intel 2.7 Mhz Core i7 2-Duo CPU with 6 GB of RAM running Ubuntu 12.10 64 bits GNU/Linux OS. Although in real deployments of anti-spam filters the amount of positive scores is usually greater than negative ones, in our experiments, we tested two complementary filters having (i) positive and (ii) negative trends. A filter presents a positive trend when the global amount of positive scores is greater than the sum of negative scores. Otherwise, the filter trend is considered negative. The positive filter used in our experiments assigns to e-mails scores in the range of [89.36, 162.98], while the scores associated with the negative trend filter are defined in the interval [162.98, 89.36] (i.e., filters are completely symmetric). To demonstrate the suitability of the RBM heuristic in a real environment (where network I/O elapsed time is highly unstable), we carried out the execution of the scheduler for both simulated and real environment. In addition, we assigned the overload parameters (I/O and CPU) associated with each filtering technique by calculating the mean for the obtained I/O and CPU values through the execution of Wirebrush4SPAM over the previously described corpus. Table II. Characteristics of the corpus generated for heuristic validation. Corpus ratio Train Test
1 10 9 10
< total_corpus > < total_corpus > Σ
Copyright © 2015 John Wiley & Sons, Ltd.
Ham
Spam
10 90 100
10 90 100
Σ 20 180 200
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
D. RUANO-ORDÁS ET AL.
In simulated environment, Wirebrush4SPAM filtering platform is executed assuming fixed I/O values (by using the sleep() ANSI/C function). However, in real environment experiments, we executed the default functions included in Wirebrush4SPAM (doing genuine I/O operations). In addition, with the aim of showing the importance of minimizing the I/O and CPU balance rate, we compared the filter distribution generated by our RBM heuristic against 10 randomly generated filters using as basis a subset of the positive trend filter. Table III presents the execution time (in milliseconds) together with the CRB for each filter in both environments. To facilitate the interpretation of the obtained results, we arranged the filters using an ascending criterion according to the execution time of each filter. The CRB metric is computed as the summation of individual CRBs of each filter rule (by using Equation (1)) according to its definition order inside the filter. Therefore, filter CRB solely depends on the distribution order of each rule inside the filter. A quick view to the simulated environment time results reveals that filter balance and filter execution time are directly related. Accordingly, low-filter balance values imply less filter execution time, and complementarily, high-filter balance values entail greater filter execution times. Working on a real scenario, CRB and filter execution time results largely follow the same connection as the simulated environment. However, the execution time measurements for some filters (random_8 - random_9 and random_5 - random_7) showed a slightly different behavior. This fact is caused because the low CRB distance between filters is not able to mitigate the high variability of the response time associated with network tests. Nevertheless, as we can observe from Table III, the filters do not present significant variations. 4.2. Comparative analysis With the aim of correctly evaluating the real impact of each heuristic when applied in both scenarios (positive and negative trends), we implemented and deployed these alternatives in our Wirebrush4SPAM filtering platform. Additionally, in order to compare the real improvements achieved by our novel heuristics, we also included in our experimental protocol the successful heuristics GDV and plug-in separation scheduling (PSS), previously developed by Ruano-Ordas et al. [19]. In the former study, GDV achieved the best throughput results in the group of SFE optimizations, while PSS minimized the effects of locks produced by the usage of shared resources. Although PSS experimental results included in [19] were poor (mainly due to the absence of rules implementing hard I/O operations), we believe that filters containing a large amount of rules belonging to a wide variety of anti-spam filtering techniques (specially network-based ones) could be executed faster using this heuristic. Table IV shows the experimental results achieved in the context defined by the first scenario (positive trend filter) including the following measurements: (i) the average X and percentage (%) values of the unenforced rules due to the SFE technique; (ii) the total amount (∑Activations) Table III. Comparative of RBM performance in a simulated and real environment. Simulated environment Filter name (CRB) rbm_filter.cf (7.838) random_0 (8.163) random_6 (10.068) random_9 (12.015) random_8 (12.307) random_1 (12.910) random_2 (16.382) random_7 (17.845) random_5 (18.185) random_3 (22.345) random_4 (23.135)
Real environment Filter execution time (ms) 2.814 2.858 2.874 2.880 2.890 2.903 2.976 2.987 3.083 3.113 3.137
Filter name (CRB) rbm_filter.cf (7.838) random_0 (8.163) random_6 (10.068) random_8 (12.307) random_9 (12.015) random_1 (12.910) random_2 (16.382) random_5 (18.185) random_7 (17.845) random_3 (22.345) random_4 (23.135)
Filter execution time (ms) 160.135 160.744 161.339 174.651 175.818 175.757 176.057 176.276 176.357 183.820 184.622
RBM, resource balancing maximization; CRB, computational resource balance.
Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
NEW SCHEDULING FOR INCREASING THROUGHPUT ON RBS FILTERING SYSTEMS
Table IV. Evaluation of heuristics using a positive trend filter. Simple heuristics
Spam
Unexecuted rules SFE
X % ΣActivations %
X ClassificationTime Unexecuted rules X % SFE ΣActivations % X ClassificationTime Spam + ham Unexecuted rules X % SFE ΣActivations % X ClassificationTime Computational Resource Balance (CRB) Ham
Complex heuristic
RBM
CEM
GDV
PSS
MHE(GDV, RBM)
4.540 2.494 52 26 17.374 3.880 2.131 40 20 13.893 4.210 2.313 92 46 15.195 9.720
8.360 4.593 50 25 18.989 4.360 2.395 48 24 14.453 6.360 3.494 98 49 15.228 481.972
23.100 12.692 60 30 13.580 4.900 2.6927 40 20 13.021 14.000 7.692 100 50 14.968 255.280
19.660 10.802 32 16 18.076 4.440 2.439 32 16 13.997 12.050 6.629 84 42 17.617 477.552
23.100 11.934 60 30 13.189 6.580 2.6927 40 20 12.26 14.150 7.692 100 50 13.684 248.8283
RBM, resource balancing maximization; CEM, cost-effectiveness maximization; GDV, greater distance value; PSS, plug-in separation scheduling; MHE, multi-heuristic ensemble; SFE, smart filter evaluation; CRB, computational resource balance.
and percentage (%) of times that SFE is executed; (iii) the average time (in milliseconds) required to classify each message X ClassificationTime ; and (iv) the CRB measure for each heuristic. As one can realize from Table IV, GDV heuristic provides the best evaluation results when compared with the other simple heuristics analyzed. These results are also backed by the high volume of SFE activations together with the high number of unexecuted rules (representing 12.692% and 2.692% of the total number of rules) achieved by the heuristic. Moreover, although CEM heuristic obtains better SFE activations than RBM, it yields worse performance values due to the ability of RBM to optimize the use of computational resources. Finally, when classifying spam and ham messages, RBM, CEM, and GDV heuristics achieved a similar filtering performance. In summary, and taking into account the overall values of the simple heuristics, we can conclude that GDV attains the best filtering performance, while PSS achieves the worst throughput values. Taking into account these results, we decided to use MHE ensemble method to combine GDV and RBM heuristics. This combination seems the most suitable to maximize SFE activations while balancing (and therefore minimizing) the consumption of computational resources. As we can observe from Table IV, this configuration achieves the same values of SFE activations and unenforced rules as GDV. However, the use of RBM as the second scheduler allows balancing the computational resource consumption between rules, effectively increasing the filtering performance. Additionally, with the goal of obtaining a complementary perspective of the proposed heuristics, Table V shows the performance results achieved when Wirebrush4SPAM is configured with a negative trend filter definition. In this case, when classifying any kind of messages (i.e., spam, ham, or both) PSS achieves the best filtering throughput against other simple heuristics. However, it obtains the lowest values on SFE activations and unexecuted rules. This fact demonstrates that, under certain circumstances, by maximizing the separation of rules belonging to the same filtering techniques, classification throughput can be enhanced. Taking into account the results achieved by PSS and GDV, and following with the procedure carried out in the previous scenario, we can imagine that the use of MHE to combine PSS and GDV heuristics would provide the best filtering throughput. However, we were not able to achieve better performance than using PSS simple heuristic. The reason of this underperformance values lies on the behavior of the MHE heuristic. As commented in the previous subsection, MHE executes the second sub-scheduler (GDV in this case) only when more than one rule obtains the Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
D. RUANO-ORDÁS ET AL.
Table V. Evaluation of heuristics using a negative trend filter. Simple heuristics
Spam
Unexecuted rules SFE
X % ΣActivations %
X ClassificationTime Unexecuted rules X % SFE ΣActivations % X ClassificationTime Spam + ham Unexecuted rules X % SFE ΣActivations % X ClassificationTime Computational Resource Balance (CRB) Ham
Complex heuristic
RBM
CEM
GDV
PSS
MHE(GDV, RBM)
6.220 3.417 56 28 19.603 5.480 3.011 44 22 17.237 5.850 3.214 100 50 18.564 9.720
5.340 2.934 50 25 20.114 2.780 1.527 46 23 16.103 4.060 2.2308 96 53 18.700 481.972
16.360 8.989 54 27 19.230 4.580 2.516 50 25 17.440 10.407 5.752 104 52 18.127 255.280
4.040 2.2198 30 15 17.524 4.060 2.230 36 18 17.024 4.050 2.225 66 33 17.440 477.552
16.360 9.186 54 27 16.880 4.580 2.5495 50 25 14.910 10.407 5.868 104 52 15.026 248.8283
RBM, resource balancing maximization; CEM, cost-effectiveness maximization; GDV, greater distance value; PSS, plug-in separation scheduling; MHE, multi-heuristic ensemble; SFE, smart filter evaluation; CRB, computational resource balance.
same evaluation values (resolution of tied rule ranks) after the application of the first scheduler (PSS). Nevertheless, using PSS as the first scheduler is not capable of taking advantage of the possibilities offered by the MHE scheduler due to its intrinsic difficulties to get tied rules. In this context, Equation (3) outlines the condition required to achieve tied rules using the PSS heuristic. ∃p∈fpluginsgjnumber_of _rulesðpÞ 1 > total_filter_rulesðÞ=2
(3)
As we can observe from Equation (3), the draw condition occurs when the number of rules belonging to a particular filtering technique (implemented as a plug-in in Wirebrush4SPAM) exceeds over half the number of the total rules involving the filter. Given this fact, as it is difficult to find tied rules when using PSS heuristic, it should not be selected as the main sorting criteria with MHE. As a consequence of the poor results achieved by MHE heuristic using PSS and GDV as sorting criterion, we executed several experiments using MHE with different combinations of available heuristics. In order to summarize the results, we include in Table V the evaluation of the MHE configuration that achieved the best filtering results (GDV and RBM). As we can observe from Table V, by using the combination of GDV and RBM heuristics, we can successfully boost filtering performance. Analyzing experimental results from a global perspective, we can conclude that using GDV (able to take the most of SFE) and RBM (able to adequately balance the usage of computational resources) heuristics combined by MHE allows us to build a great rule scheduling plan to significantly reduce the computational resources and time required to execute a filter regardless of its trend (positive or negative). 4.3. Learned lessons To support the usage of our novel scheduling heuristics, we have designed and developed a specific rule scheduler module for our Wirebrush4SPAM framework (left part of Figure 8). Although the sorting of rules could be carried out using complementary tools before executing the filtering service, we decided to apply the scheduling heuristics during the initialization of the framework to facilitate system administrative tasks. Moreover, the implemented heuristics were developed as plug-ins to ease the research and development of new scheduling strategies to boost filter throughput. Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
NEW SCHEDULING FOR INCREASING THROUGHPUT ON RBS FILTERING SYSTEMS
Figure 8. Different types of plug-ins currently supported by the Wirebrush4SPAM platform.
In Figure 8, scheduler plug-ins provide instances of the prescheduler_t data type, which contains the information required for its operation (data) and a pointer to the function able to sort the rules according to the specific implemented heuristic (prescheduler_func_t). All the plug-ins supported by Wirebrush4SPAM have a similar architecture and could be implemented through the usage of C-Pluff library.‡ This design architecture allowed us to easily develop the supporting code to dynamically load plug-ins configured using XML descriptors. From the prescheduler_t design view showed in Figure 8, it can be seen that the codification of most scheduling heuristics is straightforward. However, the design and implementation of the MHE rule-sorting scheme presented in this work entailed a new challenge. In order to load plug-ins only once (i.e., while starting the core of the Wirebrush4SPAM platform), we take advantage of the possibility of including additional parameters in the plug-in descriptor file for configuring those plug-ins used by MHE. In detail, all the scheduling plug-ins include a type parameter with two possible values: simple or complex. A complex value for this plug-in descriptor option requires the load of an additional parameter (sub_scheduler) from the configuration file allowing the instantiation of additional schedulers. 5. CONCLUSIONS AND FURTHER WORK Rule scheduler components included in the Wirebrush4SPAM platform allow the creation of new rule execution plans by using different heuristics as sort criteria with the aim of finding faster ways of executing spam filters over incoming messages. The utility of these scheduling methods used to adjust the rule execution order is twofold: improving filtering throughput and saving computational resources. This work initially introduces two novel scheduling heuristics (RBM and CEM) that can be used to successfully discover efficient rule execution strategies to improve spam filtering throughput. Moreover, we propose a novel heuristic ensemble method (MHE) able to adequately combine the scheduling knowledge generated from the other two heuristics. For comparison purposes, we present an empirical analysis specifically intended to measure the performance gained when using these heuristics for scheduling rules. Finally, we compare the performance of our three novel alternatives with GDV and PSS rule scheduling schemes previously presented in [19]. ‡
Available at http://www.c-pluff.org/
Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
D. RUANO-ORDÁS ET AL.
Results obtained through the execution of our experimental protocol enabled us to achieve the following conclusions: (i) scheduling schemes based on minimizing the computational resource consumption (RBM and CEM) generate less efficient execution plans than GDV heuristic (due to the high time variability of network test); (ii) although GDV achieves great results for both filtering trends, the implementation of a combined scheduler allows to significantly improve the filtering performance obtained by GDV algorithm; (iii) regardless on the filter trend, all sorting algorithms require more time to classify spam e-mails than ham messages due to the greater complexity associated with the spam content; (iv) the poor results achieved by the PSS heuristic, together with the underperformance obtained by its use in combination with other approaches, converted PSS into a not recommended algorithm for improving filtering classification performance; and (v) CEM heuristic achieved results worse than those theoretically expected mainly because the way it is used to evaluate rules does not successfully fit to a real environment. By solely using the RBM heuristic, we were not able to build faster execution plans in comparison with other alternatives (i.e., GDV). However, the possibility of combining multiple heuristics introduced by the MHE approach allows us to quickly discover new rule execution arrangements able to significantly outperform the results obtained by using simple heuristics. As we can conclude from the achieved results, the use of the MHE scheduling ensemble with GDV and RBM allows up to 10% on time savings with respect to the use of GDV as a simple heuristic [19]. Current works on this topic do not address the arrangement of META rules as well as rules having inter-execution dependencies [19]. Keeping this idea in mind, we believe that filtering throughput can be improved even more by using specific heuristics able to take advantage of conclusions extracted in previous works while fulfilling dependencies. Moreover, we consider valuable the development of mechanisms able to save logs about the execution behavior of each rule (e.g., start time and end time) in order to facilitate the development of graphical representations depicting the e-mail classification process. Analyzing this graphical information, we could easily find important constraints and new heuristics that should be applied to scheduling schemes for improving filtering throughput. Finally, the development of filter execution simulators could also reduce the time required to evaluate the suitability of an execution plan and mitigate the impact of delays produced by the execution of network operations while measuring throughput. Thus, by using a rule execution simulator together with artificial intelligence techniques, we can effortlessly identify valuable information useful for boosting filter performance. ACKNOWLEDGEMENTS
This work was partially funded by the [14VI05] Contract-Program from the University of Vigo. D. Ruano-Ordas was supported by a pre-doctoral fellowship from the University of Vigo. REFERENCES 1. Internet World Stats. 2014 Internet growth statistics. Available at: http://www.internetworldstats.com/emarketing. htm [last accessed 15 January 2015]. 2. Symantec Corporation. 2000–2009 The spam explosion. Available at: http://www.symantec.com/connect/blogs/ 2000-2009-spam-explosion [last accessed 15 January 2015]. 3. Krebs B. Spam volumes: past & present, global & local. Available at: http://krebsonsecurity.com/2013/01/spam-volumes-past-present-global-local/ [last accessed 15 January 2015]. 4. Doreian P, Stokman F. Evolution of Social Networks. Routledge: London, 2009. 5. De W, Irani D, Pu C. A study on evolution of email spam over fifteen years. In: Proceedings of the 9th International Conference on Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom) 2013; 1–10. 6. NextGate Inc., State of Social Media Spam. Available at: http://nexgate.com/wp-content/uploads/2013/09/Nexgate2013-State-of-Social-Media-Spam-Research-Report.pdf [last accessed 15 January 2015]. 7. Godwin C, Maozhen L, Yang L. An ontology enhanced parallel SVM for scalable spam filter training. Neurocomputing 2013; 108:45–57. 8. Fdez-Riverola F, Iglesias EL, Diaz F, Mendez JR, Corchado JR. SpamHunting instance-based reasoning system for spam labelling and filtering. Decision Support Systems 2007; 43(3):722–736. DOI: 10.1016/j.dss.2006.11.012. 9. Song Y, Kolcz A, Giles CL. Better naive Bayes classification for high-precision spam detection. Software: Practice and Experience 2009; 31(11):1003–1024. Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe
NEW SCHEDULING FOR INCREASING THROUGHPUT ON RBS FILTERING SYSTEMS 10. Yevseyeva I, Basto-Fernandes V, Ruano-Ordas D, Méndez JR. Optimizing anti-spam filters with evolutionary algorithms. Expert System with Applications 2013; 10(40):4010–4021. 11. Heydari A, Tavakoli M, Salim N, Heydari Z. Detection of review spam: a survey. Expert Systems with Applications 2015; 42(7):3634–3642. 12. SpamAssassin Group. The Apache SpamAssassin project. Available at: http://spamassassin.apache.org/ [last accessed 15 January 2015]. 13. Pérez-Díaz N, Ruano-Ordas D, Fdez-Riverola F, Méndez JR. Wirebrush4SPAM: a novel framework for improving efficiency on spam filtering services. Software: Practice and Experience 2012; 43(11):1299–1318. 14. Sandeep Y, Yogesh Y. A review on spam detection methods. International Journal of Management, IT & Engineering 2013; 3(1):155–171. 15. Robinson G. The spamometer. Available at: http://jon.es/spamometer/ [last accessed 15 January 2015]. 16. Jeftovic M. filter.plx: a context/keyword based spam filter. Available at: http://web.archive.org/web/ 19981207030000/antispam.schmooze.net/filter/ [last accessed 15 January 2015]. 17. Symantec Corporation. Symantec messaging gateway 10.5. data sheet: messaging security. Available at: http:// www.symantec.com/content/en/us/enterprise/fact_sheets/b-symantec-messaging-gateway-10.5-DS-21320399.enus.pdf [last accessed 15 January 2015]. 18. McAfee Inc. McAfee Email Protection. Data sheet: powerful, inclusive security and smart flexibility. Available at: http://www.mcafee.com/us/resources/data-sheets/ds-email-protection.pdf [last accessed 15 January 2015]. 19. Ruano-Ordas D, Fdez-Glez J, Fdez-Riverola F, Méndez JR. Effective scheduling strategies for boosting performance on rule-based spam filtering frameworks. Journal of Systems and Software 2013; 86(12):3151–3161. 20. Pérez-Díaz N, Ruano-Ordas D, Fdez-Riverola F, Méndez JR. SDAI: an integral evaluation methodology for contentbased spam filtering models. Expert Systems with Applications 2012; 39(16):12487–12500.
Copyright © 2015 John Wiley & Sons, Ltd.
Softw. Pract. Exper. (2015) DOI: 10.1002/spe