Advanced Classification Lists (Dirty Word Lists) for Automatic Security ...

6 downloads 0 Views 1MB Size Report
Jun 20, 2015 - domain with a ”high” level of confidentiality, such as a corporate or military network, to a ... classification lists (a.k.a. ”Dirty Word Lists”) to automatically assess the ... to the ”low” domain (e.g. the Internet), the guard permits the information object to ..... (classified as either ”Confidential”, ”Secret” or ”Top Secret”).
Advanced Classification Lists (Dirty Word Lists) for Automatic Security Classification Paal E. Engelstad, Hugo Hammer, Anis Yazidi and Aleksander Bai

Abstract—With the increasing risk of data leakage, information guards have emerged as a novel concept in the field of security which bears similarity to spam filter that examine the content of the exchanged messages. A guard is defined as a high-assurance device used to control the information flow, typically from a domain with a ”high” level of confidentiality, such as a corporate or military network, to a domain with a ”low” level, such as the Internet or a network of a subcontractor. It often uses simple classification lists (a.k.a. ”Dirty Word Lists”) to automatically assess the security classification (e.g. ”Public” vs ”Confidential”) of information objects, such as documents or text messages. The object is released into the ”low” domain, only if the policy allows for information objects of that classification level to be released. Otherwise, it will be blocked and possibly quarantined for human inspection and intervention. The classification lists today are usually simple and configured manually. This paper demonstrates the use of machine learning to create more advanced classification lists automatically. A major obstacle for machine learning to be used is that they would create long lists that are difficult to inspect, analyze and control by humans. In addition, some of the most efficient machine learning techniques, particularly SVM and Neural Networks, are ”black-box” classifiers, meaning that they do not posses an explanatory nature. In this paper, we explore the use of massive/strict dimensionality reduction in order to create a sparse solution that results in a brief classification list that is easier for humans to analyze. Index Terms—Security, classification, machine learning, lasso, feature selection, dirty word list, cross-domain information exchange.

I. I NTRODUCTION classification [1] is a concept used by private corporations, the military, government agencies and international organizations. The classification indicates the sensitivity of the contents of different information objects. Examples of information objects that might be given a security classification are documents and text messages. There are a large number of private corporations that use classification of information objects. Different companies might use different classifications and have different companyspecific policies associated with each classification. They also often use different classification names. For instance, Caterpillar Inc. uses ”Public”, ”Confidential Green” and ”Confidential Red”, Amoco Corporation uses ”General”, ”Confidential” and ”Highly Confidential, and Whirlpool uses ”Public”, ”Internal”

S

ECURITY

Document submitted on 20. June 2015. P. E. Engelstad, H. L. Hammer, A. Yazidi and A. Bai are with the Oslo and Akershus University College of Applied Sciences (HiOA), Oslo, Norway (e-mails: paal.engelstad, hugo.hammer, anis.yazidi, [email protected]). This work was partially funded by the University Graduate Center (UNIK).

and ”Confidential” [2]. The military, on the other hand, often uses classification categories that bears names like ”Unclassified”, ”Restricted”, ”Confidential”, ”Secret” and/or ”Top Secret”. The security classification mandates how the information object shall be treated according to the governing security policy. The ”Unclassified” and ”Public” categories might mean that the information in the document does not need any particular protection, while other classifications might indicate that the information is classified/sensitive and that the information object must be handled accordingly. Organizations that implement security classifications might also use guards to enforce information flow control according to the policy. The guard is typically located on the border between two domains, e.g. a ”high” domain and a ”low” domain (Figure 1). It is responsible for protecting the confidentiality of the ”high” domain, by denying objects of a too high classification to leave the ”high” domain and be released into the ”low” domain. The guard should ideally protect against information leakage out of a ”high” domain, both from sloppy and dishonest employees and from possible malware. The guard inspects the information objects coming from the ”high” domain (e.g. a corporate or military network) and determines the security classification of the object. If the policy permits the information of the determined classification to flow to the ”low” domain (e.g. the Internet), the guard permits the information object to be released into the ”low” domain. If the security classification of the information object is too high, the guard may deny the information object to be released. The guard often implements a classification list. This is a list of words that can be taken as an indication of the security classification of an object. The text of the information objects is inspected, and the words in the text are compared to the list. For example, if a company runs a confidential project named ”XEAGLE”, the word ”XEAGLE” can be put on the list. If the object contains the word ”XEAGLE”, the guard classifies the security object as ”Confidential”. Then, the information object might not be permitted released to the ”low” domain (if this is the policy). Thus, the simple classification lists that are in use today, are similar to a ”Dirty Word List” that can be found in some spam filters. In spam filters, a ”Dirty-Word List” is a list of blacklisted/banned words that would protect information objects to enter into a ”high” domain. There is a number of commercial guard solutions available, e.g., Lockheed Martin’s Radiant Mercury (RM), BAE’s

DataSync Guard and Boeing’s eXMeritus Hardware Wall, have been certified and officially approved for use by the Department of Defense (DoD) in the US [3]. In terms of content scanning, the guards typically support some type of basic ”dirty word” checking, i.e. the classification lists in these guards are not very advanced. A review of such content scanning tools is undertaken in [4]. Even though there are some resemblances between a guard and a spam filter, a guard is often concerned with multiple classification levels and advanced policies, and is a ”highassurance” device that is highly critical for the entire operation of the organization, while a spam filter is often a simpler lowor medium- assurance device. The guard handles primarily ”release” decisions about transfers out of a ”high” domain, while a spam filter handles information flow control into a ”high” domain. A classification list is typically configured manually. The advantage is that the human has full control and decides easily which words to include on the list. Good human control with the list is critical, since the guard is enforcing confidentiality protection of the domain, which is often critical for the organization. The disadvantage of manually constructed lists is that it is hard to construct lists with advanced functionality. An advanced classification list could be similar to a ”sentiment list” (see e.g. [5]) used for opinion mining and sentiment analysis. Such advanced lists would take a large number of words into account, and give each word a ”weight” per classification level. A sum of all the weights of the listed words that is also found in the document could give an indication of the probability that the document belongs to a certain classification level. Furthermore, a threshold can be set to configure how restrictive the guard should be for each class, and the accumulated weight of the words in the document is compared against this threshold. It is evident that it is very difficult for a human being to create such lists with good accuracy, and to determine weights and thresholds in a meaningful, accurate and consistent way. An alternative approach is to configure the blacklist automatically. A simple automatic approach is to scan a large number of documents, and use word frequency analysis to determine specific words that are indicators of the different classes. A more advanced automatic approach is to use machine learning techniques. Some machine learning techniques, such as neural networks, are probably not suitable to create classification lists, since such techniques create opaque solutions that are hard to analyze. In this paper, on the other hand, we explore machine learning methods, such as Support Vector Machines (SVM), k-Nearest Neighbor (kNN) or Na¨ıve Bayes (NB) [6]. We put even more focus on Lasso (Least Absolute Shrinkage and Selection Operator) [7], [8], which has been shown to have compelling features for this problem in a previous work [9]. Especially the regression methods, such as Lasso, are of interest in order to be relevant as a creator of some form of an advanced classification list. With linear regression, the method creates a linear predictor on the form ηi = β0 + Σi βi wi which

will be compared to the word-frequency vector of a document that is to be classified. Here, wi would correspond to a word on a classification list, while βi corresponds to the weights that would be given to each word on the list. Such weights are not used in the simplest classification lists. The β0 corresponds to an additional bias term that also represent an advancement compared to simple classification lists. These terms would be easy for humans to interpret and understand as demonstrated by a simple example in the end of this paper. Furthermore, an additional advantage of logistic regression methods (such as Lasso mentioned later), is that the numerical result of applying this advanced weighted classification list on a document can be fed directly into the logistic function to produce a classification probability between 0 and 100%, which is also easy to grasp for skilled personnel. SVM, on the other hand creates a classifying hyper-plane described by a vector that is a linear sum of a number of support vectors. By comparing the vector of the document with the vector of the hyper-plane, one can determine the classification. However, the use of SVM seems less suitable to be a generator of a classification list than Lasso. While the results of Lasso are easy to analyse, SVM on the contrary is known to have ”notoriously poor interpretability”, as noted by Kotsiantis [10]. Nevertheless, we use SVM (as well as kNN and NB) as a representation of machine learning methods in general to which a logistic regression model of Lasso will be compared. Machine learning approaches are not fully utilized by guards today, mainly because guards are critical, high-assurance devices that mandates extraordinary control. If machine learning is ever to be fully employed also for automatic security classification in guards, it will probably first be used as a human-assisted solution, e.g. ”difficult decisions” might require human assistance, or suspicious information objects are quarantined for human intervention. Humans also need a way to ”inspect” the actual result of the machine learning, as simple as a human can inspect a simple classification list or a ”Dirty Word List”. This is to account for some disadvantages of automatic approaches. First, they can often be manipulated. They are usually undertaken on a large corpus (e.g. on a large number of text documents), and it is hard for a human to have sufficient insight into the all details of the corpus. A malicious user or agent can forge documents that become part of the corpus, and it might be difficult to detect later. Secondly, parts of the Classification List might stem from coincidental features of the corpus, and might not be sufficiently applicable to the general task of filtering (or classification). The focus of this paper is how to use machine learning that creates classification lists that are advanced and performs well, but at the same time are so brief that they are easily inspected by humans. In our approach, we employ a machine learning approach to the problem of Automatic Security Classification. Little has been published on the problem. To the best of our knowledge, the first published work on the issue is a recent work from 2015 [9]. In fact, Automatic Security Classification

as a genearal problem was also addressed as an issue in an article published back in 2005. However, there Rhetorical Structure Theory was applied to the problem [11], which is not so relevant to our focus here. Later, in 2008, on the other hand, Clark proposed in a Master thesis to use general machine learning techniques to the problem [12]. The thesis did not make attempts to apply machine learning, but proposed an architecture for the problem. General machine learning methods were first applied to the problem in a difficultly accessible unpublished technical report from 2010 [4], where many parameters of the work are poorly described. However, this work was used as a reference and starting point for the work presented in [9]. All these mentioned works make comments on the surprising lack of published data. The work in [9] attempted to bring the issue of Automatic Security Classification into the open domain of published works, and to explain experiments and related parameters and methods in detail. In addition to exploring standard machine learning techniques, such as SVM, kNN or Na¨ıve Bayes, it also explored the use of Lasso. The paper expanded the technique for a two-class classification problem into a solution for multiclass classification, and discussed the importance considering false negative and false positives. Finally, the paper identified the idea of using Lasso to generate short classification lists. This is further explored in this paper. The main contribution of this paper is an in-depth exploration on how to use machine learning to automatically create classification lists that are more advanced than the simple ”dirty word lists” used today, that perform well, and that are brief and easy for humans to inspect and interpret. In addition, it shows that the Lasso machine learner have the potential to create very brief classifications lists without sacrificing too much of the performance. Finally, practical examples of this are also demonstrated, analyzed and discussed. What do we mean by an advanced classification list, anyway? As a simplification, we might order different types of classification from simple to more advanced as follows: 1) Blacklist of banned words/ ”Dirty Word” lists, These kind of lists have typically a binary outcome; if an information object contains a word or expression on the list, the information object is blocked. 2) Generic sentiment lists, Sentiment lists often assign a number to each term on the list, but does usually not include a bias term. The words contained in an information object is matched against the list, and a match gives the number assigned to the word as a score. All scores are summarized into an aggregate number. The value of the aggregate number is used to classify the object. The list can be manually configured, such as the AFINN list, or it can be generated automatically (e.g. using PMI) - cf. [13], [14], [15] and [5] for further information on the topic. Opinion mining often uses sentiment lists to classify information objects, typically into a class of objects with positive sentiment and another class of negative sentiment, or into multiple classes reflecting how positive the sentiment of the object is. As such, this

Fig. 1: Example of cross-domain information exchange across guard 1 from a ”high” domain to a ”Low” domain and across guard 2 to the Internet. Both guards will not permit ”Secret” information objects to leak from the ”high” domain. Guard 2 will also block Confidential information objects from being released to the Internet.

technique can easily be applied also for the classification of security levels. 3) Advanced classification lists, If there is only two classes, such a list will have a number assigned to each word on the list (similar to a sentiment list), but in addition also contain a bias term. In a situation with more than two security classes, there will be a separate classification list for each class. The most advanced lists will be able to convert the aggregate score into a prediction probability of document belonging to a class. Both the list with assigned number, the bias term and the prediction probability are easily understood by human beings. The purpose of this paper is to construct such lists, and to ensure that the lists are sufficiently short, so that they can easily be analyzed and inspected by humans. 4) Opaque classification lists, These are typically generated by machine learning, where the results are difficult to interpret by human beings. Thus, in a security setting where ensuring correct operation is critical and where human control and intervention might be paramount to good operation, such lists might not be the best option. II. B ENCHMARK E XPERIMENT A. Experiments on Policy Documents from DNSA DNSA contains the most comprehensive collection of historic and declassified US government documents available to the public [16]. These were chosen because they contain a mix of both classified and unclassified documents from three unrelated domains: • AF, Afghanistan: The Making of U.S. Policy, 1973-1990 • CH, China and the United States: From Hostility to Engagement, 1960-1998 • PH, The Philippines: U.S. Policy during the Marcos Years, 1965-1986 We use the same corpus as in [9]. Of the 5867 documents available within these three topics, we skip documents that

TABLE I: Documents used in the experiments (before removal of short docs) Total docs

Unclassified.

Confidential

Secret

Top Secret

AF CH PH

901 972 1060

368 327 443

420 253 528

109 299 89

4 93 0

Sum

2933

1138

1201

497

97

TABLE II: Documents used in the experiments (after removal of short documents). Total docs

Unclassified.

Confidential

Secret

Top Secret

AF CH PH

834 948 1023

333 322 424

395 247 514

102 286 85

4 93 0

Sum

2805

1079

1156

473

97

are not very useful to the scenario we are targeting on to the analysis itself (Table I). For instance, 9 documents are removed since they are duplicates of other documents already in the set, while 620 documents are not useful for classification, since their classification is unknown (i.e. they are classified/marked as ”Unknown”). 1612 documents are removed because these documents are marked as ”Excised” (i.e. the classified text sections are removed). The ”PublicUse” could be considered as an unclassified class and the ”Restricted” document as classified, but since their sizes are so limited (3 docs and 1 doc, respectively), they are omitted. Furthermore, 688 documents are within the borderline classification LimitedOfficial, and therefore not considered. Thus we end up with 2933 documents in total. How the number of remaining documents are distributed per class and per country (AF:Afghanistan, CH:China and PH:Philippines) is summarized in Table I. As part of our analysis algorithm, we also remove documents that are very short (only 30 word stems or less extracted). As a baseline, we remove 128 documents and end up with a corpus of 2805 documents that constitute the 2805 rows of the DTM that is used for further analysis. The resulting document distribution is summarized in Table II. As for the actual machine learning 70% of the documents (i.e. 1964 documents) are used for training set, while the remaining 30% (i.e. 841 documents) constitute the test set. However, if we remove some additional keywords (explained in [9]), a few more documents get to the limit of 30 word stems. Then we end up with 2793 documents (i.e. a DTM of 2793 rows), with 1955 documents in training set, and 838 documents in the test set. The resulting document distribution is summarized in Table III. The DNSA documents are scanned pdf documents of poor quality, and the content text extracted from OCR (optical character recognition) has many errors. We extracted the raw

TABLE III: Documents used in the experiments (after removal of keywords). Total docs

Unclassified.

Confidential

Secret

Top Secret

AF CH PH

823 948 1022

325 322 423

394 247 514

100 286 85

4 93 0

Sum

2793

1070

1155

471

97

textual contents using the OCR service provided by Abbyy [17] combined with auto-correction to mend many OCR errors. The processed/auto-corrected text material is more concise, and less verbose than the raw text (i.e. mis-spelled words are merged into the same word for the analysis). B. Addressing a two-class classification problem We reduced the security classification problem into one that only deals with two classes. In doing so, 1795 documents (classified as either ”Confidential”, ”Secret” or ”Top Secret”) are aggregated into a ”Classified” class of the experiment. The remaining 1138 documents (453 documents labelled as ”NONCLASSIFIED” and 685 labelled as ”UNCLASSIFIED” in DNSA) are aggregated to the ”Unclassified” class of our analysis. C. Analysing the frequency of different word stems per document The document texts are treated as bag of words, [18], in which any word order was discarded and where punctuation marks, white spaces etc. are removed. Each document gives a term-vector with a dimensionality corresponding to the number of available words, and with each vector component providing the frequency of the word being used, i.e. it is typically a sparse vector. Vectors of all documents are comprising the Document-Term Matrix (DTM), where each row in the matrix is a word-frequency vector corresponding to one specific document. For the analysis we used word stemming, which does not influence performance considerably [4], ending up with a corpus of 23477 word stems. Infrequent word stems, occurring less than 15 times in the entire corpus was also omitted to speed up calculations, by reducing the number of word stems (or columns in the DTM) from 23477 to around 5840. Our analyses showed that this term-frequency limitation did not affect classification accuracy noticeably. Finally, each document (row in the DTM) was represented in the DTM merely by a vector of term frequency-inverse document frequency (tf-idf) weights, which does not affect the performance significantly [4]. D. Results and comparisons We apply three general classifiers, namely ”k Nearest Neighbor” (kNN), ”Na¨ıve Bayes” (NB), and ”Support Vector Machine” (SVM) [6] for security classification and compared

TABLE IV: Main results of benchmark experiment showing the classification performance for different machine learners. The second column indicates whether all the training documents are before the test documents in time. Machine learner

Chronological

Accuracy

95% conf.int.

SVM 5 kNN Naive Bayes

No No No

0.77 0.66 0.65

(0.74, 0.81) (0.62, 0.69) (0.61, 0.68)

SVM 5 kNN Naive Bayes

Yes Yes Yes

0.70 0.58 0.58

(0.67, 0.73) (0.55, 0.62) (0.55, 0.62)

Lasso Lasso

No Yes

0.78 0.74

(0.75, 0.82) (0.71, 0.77)

their classification accuracy with that of Lasso. Results are summarized in Table IV. Our results confirm that SVM has better performance than k-Nearest Neighbor (kNN) and Na¨ıve Bayes (NB), even though the performance difference is higher here than in [4]. Nevertheless, based on these results, SVM will be used as a benchmark machine learner throughout the rest of this paper together with Lasso, while we will focus less on kNN and NB in the following. We also observe that Lasso - in the bottom pane of Table IV - performs better in terms of classification accuracy compared to SVM. Lasso yields a classification accuracy of 78% compared to SVM at 77%.) E. Chronological Order In a real setting the classification of new documents will be based on machine learning performed on historical documents. That is, a machine learner cannot learn from future documents that has not been created yet; the machine learner has to learn only from the documents that exist up until the current point in time. To test this we arranged the documents in chronological order, and performed training on the first part of the range. Then we tested on the second part, which corresponds better to an implementation in a real scenario. Results are outlined in Table IV. Results are shown in the third pane from the top in Table IV. We observe that the classification accuracy of SVM drops from 77% down to only 70%. Lasso still outperforms SVM, and Lasso drops slightly less, i.e. from 78% down to 74%. This means that the topics of the documents change over time, which makes it harder to predict classification of new documents when the machine learner in a real scenario has to deal with historic documents. In the following analyses presented below, we will mainly use the benchmark experiment variant that has chronologically ordered documents, since we consider this to reflect the most realistic scenario. This is in contrast to previous work, which as not taken this important issue into account.

F. Other Issues for Further Work From these analyses it is clear that the time dependency of topics is an important feature that need to be addressed in order to make machine learning a promising tool for automatic security classification. Introducing the time dependency of the topic into the machine learning is is addressed in a follow-on companion paper [19]. Furthermore, our analysis shows that the classification accuracy depends heavily on the actual machine learning method used. Thus, future work should make a wide analysis on the wide range of other available machine learning methods that are not analyzed in this paper. Moreover, in our analysis above, the DTM contained around 5840 different words. The result is a list of words where each word is assigned a weight. This can work as an advanced classifier list. However, with such a large number of words in the list, the list is very hard to inspect and interpret by humans. The main idea in this paper is to compute sparse solutions that are easier to interpret, which is beyond the analysis done in [4], which focused mostly on increasing calculation speed without sacrificing performance. The result will be a sufficiently short - but still advanced and hopefully wellperforming - classification list. This is the focus in Section IV of this paper. III. D IMENSION R EDUCTION A. Dimensionality reduction before Machine Learning Before the machine learning is carried out, we reduce the dimensionality of the DTM by different alternative dimensionality reduction methods. We selected the dimensionality reduction methods, ”Information Gain” (IG), ”Chi-Square” (CHI) and ”Document Frequency” (DF). The reader is referred to [20] for further information and details about these methods. We selected IG and CHI because they are known to perform well [20], and DF due to the simplicity of this method. Results are shown in Figure 2. For clarity, we have focused only on Lasso and SVM in the figure, given that these machine learning techniques are the best performing methods at hand. In Figure 2 we observe that IG works as the best dimensionality reduction method for both SVM and Lasso. Lasso preserves its classification performance with IG dimensionality reduction, and performs generally better than SVM also when the dimensionality is reduced. B. Integrated dimensionality reduction with Lasso With Lasso, we learn a text classifier, y = f (x), from a set of n training examples D = {(x1 , y1 ), ..., (xi , yi ), ..., (xn , yn )}. For text categorization, the row vectors xi = [xi1 , ..., xij , ..., xid ]T of the DTM correspond to document i, and d is the number of terms. Usually d is a huge number. The values yi ∈ {−1, +1} are class labels that indicate non-membership or membership to a class. We shall explain how Lasso can be applied to solve a two-class classification problem. The concept can be easily generalized to a multi-class problem.

Fig. 2: The upper pane shows the classification accuracy for SVM and Lasso, as functions of the share of the original 5837 words that remains after applying IG, CHI or DF dimensionality reduction methods prior to the machine learning (logarthmic scale). The lower pane shows the same results for only a few words (metric scale). Logistic regression is a conditional probability model of the form: 1 p(yi = +1|β, xi ) = (1) 1 + exp(−β T xi )) To estimate the unknown parameters in the Lasso model, we compute the following minimization βb = arg min{− β

n X i=1

ln(1 + exp(−β T xi )) + λ

d X

|βj |} (2)

j=1

where βb is the parameter estimates. As we see, the expression consists of two sums. The first sum is for maximizing the likelihood estimation of the parameters βj , j = 1, 2, . . . , d and the second sum is for controlling the sparsity of the solution. λ is a regularization parameter controlling the degree of sparsity. Lasso is a regularized regression method based on the L1-norm (second sum in equation 2). L1 regularization is a compelling feature for creating brief classification lists, because it leads to sparse solutions where an additional automatic dimensionality reduction is performed within the Lasso learning algorithm. Lasso maximizes the likelihood, while constraining (i.e. penalizing) the sum of the absolute values (i.e. L1-norm) of the regression coefficients, as shown above. The L1-based regularization is what makes Lasso interesting for our problem, since we end up in a situation where some of the βi -estimates becomes exactly zero and are removed from the solution. This leads to sparse solutions with many zero β-values, effectively reducing the dimension of the solution. The size of the penalty is controlled by the λ parameter; the larger the penalty applied the sparser the solution. We first explored doing feature selection only with Lasso, by selecting a non-optimal λ parameter. The result is shown by the curve marked ”Only Lasso Dim.Red” in Figure 3. As seen in the scatterplot, there are a number of other points in the plot

Fig. 3: Each scatter point corresponds to one unique combination of a certain degree of IG dimensionality reduction and a certain degree of integrated dimensionality reduction of Lasso. Blue line is using no IG at all, while red line uses only IG dimensionality reduction. The green points correspond to IG=0.01 (99% reduction) and different degrees of Lasso dimensionality reductions. Some green points are above the red line (might perform better)

with better performance than this curve. Thus, other strategies than only relying entirely on the integrated feature selection of Lasso seem to be better choices. Let us discuss other strategies. We then tried to perform dimensionality reduction with IG down to a certain share of the original, ranging from 1 (i.e. no dimensionality reduction all) to 0.001, and then performed Lasso machine learning. After the machine learning has been carried out, we have adjusted the λ parameter of Lasso to do additional integrated dimensionality reduction. Figure 3 shows a scatter plot where each point correspond to one combination of both IG and Lasso dimensionality reduction. The figure shows that there is little gain in selecting a nonoptimal λ parameter to obtain sparser solutions, compared to doing IG-based dimensionality reduction in advance. In the figure, the line shows a combination of IG together with Lasso with the optimal value of λ. Some points in the scatter plot is above this line, and thus indicate better performance. However, since the 95% confidence interval tends to be at almost the same size for all our analysis results (i.e. around +/- 0.03 0.04), we suspect these points might simply represent outliers within the confidence interval. C. The Real List Length Produced by Lasso We have inspected the zero-valued beta values of the solution displayed in the Figures 2 and 3 above. These values correspond to words that are removed from the solution and are thus removed from the classification list. Thus, the Lasso solution provides even shorter list length than what is given by an initial dimensionality reduction method, such as IG. The real list lengths of Lasso is shown in Table V. Thus, in terms of creating short classification list (sparse solutions) without loosing too much classification accuracy, we observe in Figure 2 that Lasso outperforms SVM. More generally, while other machine learners would easily create a classification list of 5837 words, Lasso automatically

TABLE V: The real list lengths of Lasso, since zero-valued beta values can be ignored. ”FS” = dimensionality reduction (Information Gain). The 95 confidence intervals are +/- 0.03 for all the classification accuracy results of Lasso shown in the right-most column.

TABLE VI: Example of an unbalanced classification list. Words with low values contribute towards giving a security classification as ”Classified”, while words with positive value contribute towards a security classification as ”Classified”. Word

# words before FS

Share after FS

# words after FS

# words w/Lasso

Lasso results

5837 5837 5837 5837 5837 5837 5837 5837 5837 5837 5837 5837 5837

0.00100 0.00178 0.00316 0.00562 0.01000 0.01778 0.03162 0.05623 0.10000 0.17783 0.31623 0.56234 1.00000

6 10 18 33 58 104 185 328 584 1038 1846 3282 5837

6 10 18 29 39 58 91 136 177 201 295 306 314

0.69 0.72 0.74 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.74 0.74

reduced the size to only 314 words (Table V). Lasso also demonstrates superb preservation of the classification accuracy when reducing dimensions by Information Gain (IG) based dimensionality reduction. As observed in the table, Lasso preserves its accuracy of around 75% down to a classification list length of only 29 words, and the accuracy drops to only around 74% with a list length of only 18 words. In fact, SVM is also providing sparse solutions in terms of using only a subset of the sparse training vectors as support vectors. Given that each of these easily contains hundreds of words, SVM will still not provide so sparse solutions compared to Lasso. We do not analyse this issue here, since we use SVM mainly as a general benchmark for classification performance as explained earlier. IV. E XAMPLES OF S HORT C LASSIFICATION L ISTS A. Basic example In Table VI we show the classification list that corresponds to the third row with results in Table V, i.e. with a classification list length of 18 words. Actually the list shows the word stems, since we have done word stemming prior to the machine learning. Every word is assigned a weight. Not surprisingly, documents showing words like ”Moscow” and ”Soviet” are in favour of assessing the document as ”Classified”, while documents about ”market”, ”research”, ”dollar” and servic(es)” tend to be ”Unclassified”. We also speculate that letter documents tend to be ”Unclassified”, due to the word ”dear”, which is a strong indicator of a letter. The fact that the word ”dear” has the strongest weight might also be an indication that the word ”dear” perhaps only occur once in letters, which is something the optimization algorithm might have taken into account. Furthermore, it seems that words used to estimate quantities, like ”entir(e)”, ”unit” etc. might be in favor of giving a ”Classified” classification results. Keep in mind, though, that we have shown an example of a very short classification list, for an illustrative purpose.

β−

entir comment unit peke moscow pint soviet

-5.39 -2.62 -2.01 -1.78 -0.57 -0.53 -0.28

BIAS (a.k.a. ”α” or ”β0 ”)

-0.47

→ Cnt. →

Word

β+

fund market salari research laboratori properti total servic dollar salari dear

0.04 0.10 0.21 0.22 0.29 0.64 0.73 0.85 1.20 0.21 2.19

In a real scenario, such lists will probably be considerably longer. Moreover, our brief analysis presented here is only based on speculations. With these lists at hand, analysis tools can be developed in the future in order to assist humans in matching the listed words against the corpus to provide further understanding. B. A multi-class example While the analysis in this paper has focused on a classification problem with only two classes, the work in [9] shows how this analysis easily can be extended to a multiclass problem, with more classes. Thus, classification lists can also be extended to multi-class scenarios, with one classification list per class. To generate a multi-class classification list example, we split all the classified classes (that was aggregated into one class above), into the separate classes ”Confidential”, ”Secret” and ”Top Secret”, and run multinomial regression with Lasso (in contrast to the binomial regression we used thus far in the paper - including Section IV-A). Otherwise, we selected exactly the same parameters as we did in Section IV-A. That is, we used the Information Gain type of feature selection, and reduced the dimensionality down to a 0.00316 share of its original size before applying the Lasso regression. In Table VII we show the multi-class classification lists that were derived. Note that now there is one classification list per class, in contrast to the the two-class classification problem in Section IV-A that generated one classification list for both classes (Table VI). C. The bias term, β0 Notice the bias term of −0.47, that is present in the basic example in Section IV-A (i.e. in Table VI). In the logistic regression, the bias term is often referred to as the ”α” or ”β0 ” value. To understand the function of the bias, let p denote the share (e.g. percentage) of documents belonging to one class in the corpus. In the case where a document to be classified contains no words that are listed on the classification list, the best a classifier can do is to guess that the document belongs

TABLE VII: Example of an multi-class classification list. Words with low values contribute towards giving a security classification as ”Classified”, while words with positive value contribute towards a security classification as ”Classified”. UNCLASSIFIED

CONFIDENTIAL

dear china republ washington tag

3,03 2,05 1,03 0,83 0,39

tag eye

4,5 0,12

henri soviet eye chou chines memorandum kuan sensit info peke might

-0,08 -0,15 -0,26 -0,28 -0,35 -0,49 -0,81 -0,95 -1,39 -1,66 -1,75

memorandum might republ sensit chines peke page henri your washington info

-0,12 -0,16 -0,18 -0,22 -0,39 -0,63 -0,72 -0,79 -0,94 -1,05 -3,01

SECRET

zero in equation 1. The result is a substantial drop in the classification accuracy from 74% down to only 65%. The drop must be compared to the fact that 62% of the documents in the corpus actually belongs to the ”Classified” class, and thus 62% forms an expected lower bound on the performance accuracy of any reasonable classification model, as explained in further detail below in Section IV-D. In other words, this example illustrates how much importance the β0 term might play, even in a moderately skewed/unbalanced corpus. The presence of the bias term in a two-class classification list (such as the one in Table VI) distinguishes this type of list from a generic sentiment list, such as the AFINN list, that is designed for a generic corpus [13]. Since p is unknown for a sentiment list designed for a generic corpus, there is no bias terms in such sentiment lists. The importance of the bias term discussed above is a takeaway for research on generic sentiment lists. The lack of a bias term in a generic sentiment list might be an important obstacle towards achieving good performance unless the corpus is well balanced.

TOPSECRET

info page chines peke sensit might memorandum soviet henri

2,66 1,91 1,02 0,63 0,22 0,16 0,12 0,11 0,08

eye dear tag china

-0,12 -0,18 -0,39 -0,53

eye info kuan your peke henri sensit might chines memorandum chou

1,53 1,39 1,36 1,22 1,17 1,01 0,84 0,56 0,35 0,3 0,24

tag

-4,23

to the first class at a probability of p, or to the other class at a probability (1 − p). With no words matching those on the classification list, all β values are zero in equation 1, except the bias term, β0 . Thus, there is a basic relation between the the bias term, β0 , and the share of documents of in one of the two classes. The relation is given by equation 1, and is p = (1 + exp(−β0 )))−1 . Thus, β0 depends on p through the logit function, and is used to balance between the two classes that both are represented with one single classification list. On the other hand, for multi-class classification lists (cf. the example in Table VII), there is no need for a bias term in any of the lists, because there is one classification list per class. Thus, the balance between the classes are instead ensured by the values in each classification list. In contrast, for the twoclass classification problem in Section IV-A, we had one list for both classes, and the bias-term ensures the right balance between the two classes. To illustrate the importance of the bias, β0 , we calculate the classification accuracy, after first artificially setting β0 to

D. A balanced corpus It is clear that for research purposes it is a good idea to select a corpus with a balanced share of documents from each class. As the discussion above demonstrates, an unbalanced corpus might yield artificially good results. For instance, assume a corpus with p = 0.95 (i.e. 95% of the documents are from one of the two classes). Then, a simple approach is to construct a list with only a bias-term, that will ensure that all documents are predicted to belong to the biggest class (i.e. ensuring the correct sign of the bias term). Even though all other β values have been artificially set to zero, the classification accuracy will be 95%, since all documents are predicted to belong to the biggest class. In other words, the skewness of an unbalanced corpus might set an artificial high lower bound on the classification accuracy results, while the lower bound on a balanced corpus, on the contrary, is typically at only 50%. In Table VIII we show the classification list where we have actively balanced the corpus, i.e. thrown away a number of documents to ensure that the two classes contains the same number of documents at the expense of having less information available for the machine learning. More specifically, balancing the corpus means that the original 1955 documents in the test set were reduced to 1270 documents, where 633 documents belong to the ”Classified” class and 637 documents belong to the ”Unclassified” class. Likewise, the 838 documents in the original test set is now reduced to 807 documents, where 405 documents belong to the ”Classified” class and 402 documents belong to the ”Unclassified” class. We observe from Table VIII that now the β0 value is very close to zero. This is as expected since the corpus now is balanced, as explained in the discussion above. Again, we calculate the classification accuracy after forcing β0 to be zero in equation 1 in the same way we did above. The result is an unchanged classification performance of 74% as expected.

TABLE VIII: Example of a balanced classification list. Words with low values contribute towards giving a security classification as ”Classified”, while words with positive value contribute towards a security classification as ”Classified”. β−

Word entir comment peke pint unit moscow soviet

-3.48 -3.04 -1.05 -0.88 -0.62 -0.58 -0.22

BIAS (a.k.a. ”α” or ”β0 ”)

0.03

→ Cnt. →

Word

β+

research market fund total agenc laboratori properti dollar servic salari dear

0.04 0.08 0.09 0.19 0.30 0.65 0.70 0.82 0.99 1.23 1.65

V. C ONCLUSIONS AND F URTHER W ORK This paper demonstrates that Lasso is able to automatically create advanced classification lists that might partly replace the simple ”Dirty Word Lists” used in guards today, and that are brief and easily to interpret by humans. Since it is based on logistic regression with L1-norm regularization, it creates a linear predictor of a relatively short list of words and corresponding weights. SVM, on the contrary, is known to have ”notoriously poor interpretability”, as noted by Kotsiantis [10] in his review of classifiers. Thus, we believe that SVM is not a good match for the problem targeted in this paper. Future work should search for alternative approaches that can be compared to Lasso as a solution. Nevertheless, in absence of known analyses of alternatives to Lasso for this problem, we used SVM as a benchmark for evaluating the performance of Lasso, due to the general good performance of SVM that has made SVM a general benchmark method within many areas. We observed an indication that Lasso might outperform SVM in terms of classification performance (e.g. as shown in Table IV and Figure 2). More importantly, while other machine learners would easily create a classification list of 5837 words, Lasso automatically reduced the size to only 314 words (Table V). Lasso also showed superb preservation of the classification accuracy when reducing dimensions by Information Gain (IG) based dimensionality reduction. Lasso preserves its accuracy of around 75% down to a classification list length of only 29 words, and the accuracy drops to only around 74% with a list length of 18 words. We also tried to adjust the λ hyper-parameter of Lasso to enforce a higher dimensionality reduction within Lasso than what is given by the best value selected by Lasso. However, here it seems that the combination of Information Gain (IG) feature selection and Lasso with its optimal λ value works about just as well (Figure 3). Finally, examples of the generation of brief classification lists both for a two-class and a multi-class security classifica-

tion problem were presented, and the importance of the bias term was also discussed. The paper identified that chronological ordering of documents between the training set and the test set is needed to give realistic results. This requirement reduced the classification accuracy of Lasso from 78% to around 74%. These results show that there is a temporal evolution of the topics over time, since requiring chronological ordering between training and test sets results in a considerable drop in classification accuracy. This means that introducing the time variable into the machine learning is a promising topic for future research. This topic is addressed in a follow-on companion paper [19]. Furthermore, the success of Lasso in the context of the work presented in this paper indicates that one should not be limited to look at SVM, kNN and Naive Bayes. Indeed, there is a need for research that evaluates a much larger set of machine learning algorithms for this problem than have been done so far. R EFERENCES [1] R. Kissel, Glossary of Key Information Security Terms. DIANE Publishing Company, 2011. [Online]. Available: http://books.google.no/ books?id=k5H3NsBXIsMC [2] W. Nicolls, “Implementing company classification policy with the s/mime security label,” RFC 3114, IETF, May 2002. [3] UCDMO. (2011) Ucdmo cross domain baseline list. see: http://www. crossdomain.org. Accessed: 2015-03-26. [4] J. D. Brown and D. Charlebois, “Security classification using automated learning (scale),” DRDC Ottawa CR, Tech. Rep., 2010. [5] L. Bing, Sentiment Analysis and Opinion Mining. Claypool Publisher, 2012. [6] C. Entezari-Maleki, A. Rezaei, and B. Minaei-Bidgoli, “Comparison of classification methods based on the type of attributes and sample size,” Journal of Convergence Information Technology, vol. 4, no. 3, pp. 94– 102, 2009. [7] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Royal. Statist. Soc B., vol. 58, no. 1, pp. 267–288, 1996. [8] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software, vol. 33, no. 1, pp. 1–22, 2010. [Online]. Available: http://www.jstatsoft.org/v33/i01/ [9] P. E. Engelstad et al., “Automatic security classification with lasso,” Proceedings of The 16th International Workshop on Information Security Applications (WISA 2015 ), Jeju Island, Korea, August 20-22, 2015. [10] S. B. Kotsiantis, “Supervised machine learning: A review of classification techniques,” Informatica, vol. 31, pp. 249–268, 2007. [11] H. Mathkour, A. Touir, and W. Al-Sanie, “Automatic information classifier using rhetorical structure theory, in intelligent information processing and web mining, advances in soft computing,” vol. 31, 2005. [12] K. Clark, “Automated security classification,” Master’s thesis, Vrije Universiteit, 2008. [13] H. L. Hammer, A. Bai, A. Yazidi, and P. E. Engelstad, “Building sentiment lexicons applying graph theory on information from three norwegian thesauruses,” Norwegian Informatics Conference (NIK 2014). [14] H. L. Hammer, A. Yazidi, A. Bai, and P. E. Engelstad, “Building domain specific sentiment lexicons combining information from many sentiment lexicons and a domain specific corpus,” Proceedings of the 5th IFIP International Conference on Computer Science and its Applications (CIIA 2015), Saida, Algerie, 20-21 May. [15] A. Bai, H. L. Hammer, A. Yazidi, and P. E. Engelstad, “Constructing sentiment lexicons in norwegian from a large text corpus,” Proceedings of the 17th IEEE International conference on Computational science and Engineering (CSE 2014). [16] Digitial national security archive. ”http://nsarchive.chadwyck.com/home. do”. Accessed: 2015-03-26. [17] Abbyy. ”http://www.abbyy.com/”. Accessed: 2015-03-26. [18] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval. ACM press New York, 1999, vol. 463.

[19] P. E. Engelstad, H. L. Hammer, A. Yazidi, and A. Bai, “Analysis of time-dependencies in automatic security classification,” Proceedings of The 7th IEEE International Conference on Cyber-enabled distributed computing and knowledge discovery (CyberC, 2015), Cyber Security and Privacy (CSP), Xian, China, Sept 17-19, 2015. [20] Y. Yang and J. Pedersen, “A comparative study on feature selection in text categorization,” Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, 1997.