This paper will describe practical implementation of the classifier for text files ... There is a list of interesting methods introduced in recent years: Naive Bayesian ...
Text Classification: Classifying Plain Source Files with Neural Network Jaromir Veber Department of systems analysis Faculty of informatics and statistics University of economics in Prague W. Churchilla 4, 130 67 Praha 3 Czech Republic Abstract: The automated text file categorization has an important place in computer engineering, particularly in the process called data management automation. A lot has been written about text classification and the methods allowing classification of these files are well known. Unfortunately most studies are theoretical and for practical implementation more research is needed. I decided to contribute with a research focused on creating of a classifier for different kinds of programs (source files, scripts…). This paper will describe practical implementation of the classifier for text files depending on file content. Key words: classification, neural network, text file, categorization, automation, programming language
1. Motivation Nowadays scripts are popular target for malicious code writers. That’s because it is much more portable than system dependent binary version and most of the modern programming languages are easy to learn. The growing number of malicious textual files bring along demand for automated processing of these files. The aim of a text document classification is to assign automatically an unknown document into one or more predefined classes based on its contents. Considering growing number of documents the automated file processing becomes interesting not only for academic but also for business environment. The need for this classification is consecutive automation of file processing depending on file type (file class). During malware analysis all additional information are helpful, moreover categorization of these files allows consequent automated processing of the files depending on their type. Thus categorizer for these files was requested by security software developer. In this paper I’d like to present practical solution to categorization of program source files (source files, scripts…). Large majority of articles about text classification is heavily theoretical; therefore it is not so easy to present practical solution, which is demanded in the end. I hope this paper will help others in creating the classifier for the purpose described above but also for other purposes.
2. Background In text classification the text is mostly transformed to a feature vector and consecutively one of statistical methods is used to classify the vector to one or more classes. The common process of categorization is shown in Figure 1 : Categorization.
Figure 1 : Categorization Methods mentioned below, which serve for feature extraction, were designed for document categorization – natural language texts for example content of news, so that the newspaper articles could be classified as articles about agriculture or articles about economics. The data, on which these methods were tested, are for example corpus Reuters-22173 and OHSUMED collection. The first phase of classification as introduced in diagram - Figure 1 is transforming text data to a feature vector. A common way of conducting this step is using a “bag-of-words” representation, where each feature is a single token. In this phase the classifier needs to identify two unknowns; firstly
JOURNAL OF SYSTEMS INTEGRATION 2010/4
39
JAROMIR VEBER
“What is a feature?” and secondly “How to quantify the feature?” Every classifier must solve these two issues. Different methods used for feature selection are document frequency thresholding (DF), information gain (IG), mutual information (MI), χ2 statistic (CHI) and term strength (TS). These methods are described for example in [9] or [5]. First aim of this applied tool is to reduce number of words 1 . We perform this first step, because irrelevant and redundant words often degrade the performance of classification algorithms both in speed and classification accuracy [6]. DF performs this task just by counting number of text files where word is found and then setting the threshold for missing words removal in most documents. IG sets word rating by the meaning of word for specific class – words with low meaning for any of the given classes are removed. “IG measures a number of bits of information obtained for category prediction by knowing the presence or absence of a term in document.” [9]. MI and CHI rate words using statistical approach, which is counting probability of co-occurrence of word and class – more in for example [9]. TS is originally corpus statistic method introduced by [10] for text categorization it counts word strength in related documents, thus weak words may be removed and vocabulary reduced. If the feature vector is constructed by feature extractor the next stage in the process would be to send this vector to the classifier. The classifier needs to use certain method to get the classification result. There is a list of interesting methods introduced in recent years: Naive Bayesian [2], Neural Network [8], k-nearest neighbor [11], Support Vector Machines 3], Boosting [7]. However, my task is not to test all known methods. I decided to identify practical and working solution to source file classification. Therefore the aim of this research is not the identification of the best method to be applied. I rather focused on searching for the method, which will serve the purpose well. At the same time I describe the improved method for feature extraction specific for source files.
3. Methodology Referring to the introduction of the classification diagram (Figure 1) the task was to construct it. Main parts are feature extractor creating the feature vector and classifier that makes a decision about the final result (class). I should mention at this point that before undertaking this project (IG + Neural network) I worked for some time on the same problem with DF + Naive Bayes classifier. The accuracy of this classifier was around 60-80% depending on test set. Unfortunately, according to my criteria, DF + Naive Bayes classifier was insufficiently accurate. The reason why the previous project was not successful may be discussed but I suppose that the main reason was usage of separate 2 DF, which is not the ideal feature extractor for programming language. Therefore, I’m introducing now the second solution IG + Neural network.
3.1
Feature Extractor
An advantage of introduced methods for feature extraction is its universality. We may use them for most purposes. As I mentioned in chapter 2 they were designed for document categorization. But in my particular research I have more details available about files I’m trying to classify so I may use this knowledge to improve and reduce feature vector. I’m trying to classify source files. These files are specific because of programming language keywords and signs. The other words in file are a bit less interesting because these are just variable names, custom function names. In some cases standard library functions are interesting for feature extraction but not other words. Keywords are used for programming language commands and declarations. It is not problematic to find all keywords that may appear in programming language and I may use the knowledge about keywords to my benefit. The other unique characteristic of programming language is sign usage. Every programming language has defined which signs may appear and others are not allowed. It is common that keywords and signs overlap between different languages; however there are definitely
1
Well more accurate would be using of term feature. A feature can be as simple as a single token, or linguistic phrase, or a much more complicated syntax template [4]. But I’ll use word for better notion. 2 Without feature vector filter described later.
40
JOURNAL OF SYSTEMS INTEGRATION 2010/4
TEXT CLASSIFICATION: CLASSIFYING PLAIN SOURCE FILES WITH NEURAL NETWORK
candidates with the best attributes to be a document feature between them and these are the interesting ones. This means I may use one of these methods mentioned in chapter 2 however, I should run the words (strings) through the filter to sort out only names from standard library, keywords or signs. Particular classes may have different filter. Result of this consideration was creation of XML file containing words, which were filtration results. As an input for filtration I decided to use IG; however I expect that MI and CHI should not return worse results and other feature extraction methods could be used as well3. But the main step for feature extraction is to use mentioned filter to remove custom variable and function names and gather features to groups with similar rating. More precisely my feature extractor works with string search so the words are sequences of characters. To reduce the feature vector to a minimum the strings were gathered to groups where every group represents one programming language (class). Every group contains one or more strings which give high IG for that class. I thought a bit about how many rating groups are needed. I started with one group, but it was definitely not enough so now I have two group first group contains words with low and middle rating – normal words and second group contains words (strings) with high rating – special words. Figure 2: Feature Extraction demonstrates the feature extraction for one class as it is described above.
Figure 2: Feature Extraction The result of my research is modified IG, which will be able to select the string vocabulary and the dimension of final feature vector equal to number of classes. The outcome is only small dimension so the speed of categorization can be pretty high. The classifier uncertainty is the next obstacle calling for solution. Especially how should classifier handle the files where little known strings were found? For this purpose I added one more variable in feature vector – the number of words found divided by the number of characters in text examined. This variable should indicate to classifier whether the examined file is the text of programming language or whether it is just one word of programming language in text.
3.2 Classifier Decision making is the main task of all neural networks. Artificial version is trying to simulate the operation of natural neural network. Classification is a decision about the assigning a class to the feature vector. Classification using neural network was introduced by [8]. Weiner describes three main components of neural network: architecture, cost function and search algorithm. In regard to architecture – one or more hidden layers with nonlinear activation functions allow the network to model nonlinear relationships between input and output [8]. There are surely some nonlinear relationships between programming languages (classes), therefore one or more hidden layers should be in neural network. I used just one and it seems to be enough. Nonlinear functions often used for hidden neurons are sigmoid or goniometric function. I made a few experiments and best activation function seems to be Elliott function (fast sigmoid defined by David Elliott): (1)
3
Worse result could be expected.
JOURNAL OF SYSTEMS INTEGRATION 2010/4
41
JAROMIR VEBER
This function is faster and sometimes 4 gives better results than standard sigmoid: (2) So the architecture of my network is two layer feed forward network with number of input neurons equal to number of classes plus one and with the same number of output neurons. Number of hidden neurons is equal to 1.3 of number input neurons. This is experimentally discovered constant. For the output of neural network I’m using standard sigmoid 0>y