Improving Feature Selection Techniques for Text Classification

0 downloads 0 Views 3MB Size Report
Techniques for Text Classification) presented by the student (Esraa Hussein ..... experimental results using naive Bayes classifier based on multinomial model.
Republic of Iraq Ministry of Higher Education and Scientific Research University of Kufa Faculty of Computer Science and Mathematics Department of Computer Science

Improving Feature Selection Techniques for Text Classification A Thesis Submitted to Faculty of Computer Science and Mathematics -University of Kufa in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science.

By

Esraa Hussein Abdul Ameer Alzuabidi Supervised by

Assist. Prof. Dr. Ahmed Hussein Aliwy

2018 D.C.

1439 A.H.

I

‫اَّلل اا هْح ز اا هْح‬ ‫ْل ه‬ ‫ا ِ ۚ مثو و و و و و ووُِ نُ و و و و و ووتِ‬

‫ووووووو‬

‫وتِ ااْلهو و و و و و و ا‬ ‫ه‬ ‫﴿اَّللُ نُ و و و و و و ُ‬ ‫مص و و و و و و و ۚ اا ص و و و و و و و ُ و و و و و و ُ و و و و و و ۚ اازُّ و و و و و و ُ كنه و و و و و و ت و و و و و و‬ ‫ْ هو و و و و و‬ ‫وَ مو و و و و ووز يو و و و و ووَْ مُّ و ِ و و و و و و ي ُتن و و و و و و ه يو و و و و ووْق ه‬ ‫دِيٌّ يُتقو و و و و و ُ‬ ‫ُ‬ ‫د ي ُ و و و و يُْ و و و و ُُ او و و ووت ا و و و و ْلْل و و و و ُ نو و و و و ِ ۚ نوُّ و و ووتِ ُر و و و و نُو و و ووتِ ۚ‬ ‫ي وووو ُ‬ ‫اَّللُا مثو و و و و و ار وه و و و و و ۚ‬ ‫أ ه‬ ‫ي و و و و ووَي ه‬ ‫اَّللُ ا ُو و و و ووتِ مو و و و ووز ي و و و و و ُُ ۚ يْو و و و ووْ ُ‬ ‫اَّللُ ُِ ي ُ ُر ﴾‬ ‫ه‬

‫صَق آهلل ااعر آاعظ‬ ‫ستِ اا تِ \ آي ‪35‬‬

‫‪II‬‬

‫ووووووو‬

Declaration Aware of legal liability I hereby declare that I have written this thesis myself and all the contents of the thesis have been obtained by legal means.

Signature: Date: / /2018 Name: Esraa Hussein Abdul Ameer Alzuabidi

III

Approval of scientific Supervisor

I certify that this thesis entitled "Improving Feature Selection Techniques for Text Classification" was prepared under my supervision at the University of Kufa in a partial requirement for the degree of Master in Computer Science.

Signature Supervisor's Name: Ahmed Hussein Aliwy Title: Assist. Prof. Dr. Date:

/

/2018

In view of the available recommendation, I forward this thesis for debate by the Examining Committee.

Signature Name: Dr. Asaad Noori Hashim Al-Shareefi Head of the Department of Computer Science Date: / / 2018

IV

Certification of Linguistic Expert

I certify that I have read this thesis entitled " Improving Feature Selection

Techniques for Text Classification” and corrected its grammatical mistakes. Therefore, it has become qualified for debate.

Signature: Name: Haider Yousif Selman Al-Mansury Title: Assist. Lect Date: / /2018

V

Certification of Scientific Expert

I certify that I have corrected the scientific content of this thesis entitled "Improving Feature Selection Techniques for Text Classification”. Therefore, it has become qualified for debate.

Signature: Name: Lamia Abed NOOR Mohammed Title: Assist. Prof. Dr. Date: / /2018

VI

Certification of the Examination Committee We are the chairman and members of the examination committee, certify that we have studied the dissertation entitled (Improving Feature Selection Techniques for Text Classification) presented by the student (Esraa Hussein Abdul ameer) and examine her in its contents and in what is related to it, and we have found it worthy to be accepted for the degree of Master in Computer Science.

Signature:

Signature:

Name: Kadhim Bureihi Al-Janabi

Name: Mohammed Abbas Kadhim

Title: Prof. Dr.

Title: Assist. Prof. Dr.

Date: / /2018

Date: / /2018

( Chairman )

( Member )

Signature: Name: Ayad Rodhan Abbas Title: Assist. Prof. Dr. Date: / /2018 ( Member ) Signature: Name: Ahmed Hussein Aliwy Title: Assist. Prof. Dr. Date: / /2018 ( Supervisor )

Approved by the Dean of the College of Computer Science and Mathematics, University of Kufa

Signature: Name: Kadhim Bureihi Al-Janabi Title: Prof. Dr. Date: / /2018 (Dean of the College of Computer Science and Mathematics) VII

ACKNOWLEDGEMENTS All praise is to ALLAH Almighty who enabled me to complete this task successfully and our utmost respect to His last Prophet Mohammad Pbuh.

I would like to start expressing my sincere appreciation to my teacher and supervisor Dr. Ahmed Hussein Aliwy for his guidance and encouragement during the preparation of this work. His expertise, insightful comments, and useful advice have decisively contributed to my work. The words, really, are not enough to express my gratitude for all what he has done for me. I would like to express my sincere thanks to everyone who helped me in one way or another, particularly, my teachers and the head of graduate studies in the faculty. Also, I would like to thank my dear father (Hussein Abdul Ameer), my mother (Sahera), my brother (Ali), and my sisters (Alyaa, Walaa and Duaa) for their support. Last but not least, and most important, I would like to thank my unremitting beloved, my husband (Zaid) for his patience and support.

Esraa H. Abdul Ameer

VIII

DEDICATION

To the Voice of human justice Amir-Al-Mo’mineen Al-Imam Ali Ibn Abi Talib To safer Al Hussein Muslim Ibn A keel To the flowers of my life My father (Hussein) To the source of Love & tenderness My mother (Sahera) To the source of my life My dear brother (Ali) and my sisters (Alyaa, Walaa and Duaa) To my love My husband (Zaid) and my sons (Hussein and Zain Alabdeen)

IX

Abstract Documents classification is one of the most important fields in Natural Language Processing and Text Mining. There are many machine learning algorithms like: Decision Tree, Support Vector Machine, K-Nearest Neighbors and Naïve Bayes. These are the most essential four classification algorithms. Many researches try to modify and improve these algorithms for text classification. In this thesis, the work is divided into three levels: (i) a comparative study for these four algorithms, (ii) studying the improvement of document classification with feature selection where three feature selection methods are used and a new feature selection method is suggested, and (iii) the four feature selection methods are combined in a new suggested weighted-voting method. Firstly, a comparative study for these four algorithms of text classification with almost all their amendments has been done. As result of this comparative study, feature selection is the easiest way can be used for improving text classification which causes: (i) fasting in the time complexity and accuracy (ii) increasing the efficiency. Secondly, a new feature selection method was suggested which can be a general formulation and mathematical model of Recursive Feature Elimination (RFE). The adopted method is compared with other three wellknown feature selection methods: Chi-square, Information Gain, and threshold. The results proved that the new method is comparable with the other methods where the tests were done with the four classification algorithms.

X

Thirdly, the four feature selection methods are combined in a new suggested weighted- voting method. The results are compared with unweightvoting technique. The best results were 83% when 60% of features used, 82% when 40% of features used, and 82% when 20% of features used. The results show suggested Improved Recursive Feature Elimination is comparable with the standard methods like Chi-square and information Gain. Also, the suggested weighted voting is excellent for few numbers of classes and best than voting. The used data set was a well-known English data set, “20 newsgroups text” consist of approximately 18846 files.

XI

Table of Contents Subject

Page

No. Chapter One : General Introduction

1.1 1.2 1.3 1.3.1 1.3.2 1.3.3 1.4 1.5 1.6 1.7

2

Introduction Applications of document classification

3 4

Approaches Classification

4

Rules-based classifiers

4

Statistical approaches

5

Hybrid approaches

5

Problem Solving

6

Related work

6

Thesis Objective

8

Thesis Layout Chapter Two: Theoretical Background

2.1 2.2 2.2.1

11

Introduction

11

Techniques of classification

11

Decision Tree XII

2.2.2 2.2.3 2.2.4 2.3 2.3.1 2.3.2 2.3.3 2.3.4 2.4 2.5 2.5.1 2.5.2 2.5.2.1 2.5.2.2 2.5.2.3 2.5.2.4 2.5.2.5 2.6 2.7

13

Support Vector Machine

15

K-Nearest Neighbors

16

Naïve Bayes

17

Improvements

17

DT Improvement

18

SVM Improvement

18

KNN Improvement

19

NB Improvement

20

Overall the improvements

23

Feature reduction methods

24

Feature Extraction

24

Feature Selection

26

Threshold methods Information Gain (IG) and Mutual information (MI)

26 27

Chi-Square Recursive Feature elimination (RFE)

28 28

Voting method

29

Evaluation methods

31

The used dataset Chapter Three : The proposed methods

3.1

32

Introduction XIII

3.2

33

Text Classification Process

3.2.1

Documents Collection

34

3.2.2

Pre-Processing

34

3.2.4

Feature Selection

34

3.2.5

Classification Algorithms

35

3.2.6

Performance Evaluations

35

3.2.7

Training phase

35

3.2.8

Testing phase

35

3.3 3.4 3.4.1 3.4.2 3.5 3.5.1

Recursive Feature elimination (RFE) with its limitations Improved recursive feature elimination (IRFE)

36 36 37

Formal definition Complete example

41

Suggested weighted Voting method

45 45

Formal definition Chapter Four :Results

4.1 4.2 4.3 4.3.1 4.4 4.4.1

48

Introduction The adopted programing language Implementation and Result of classification without FS Implementation of the classifiers without using FS Implementation and Result of classification with FS

48 49 49 50 50

Chi square XIV

4.4.2 4.4.3 4.4.4 4.4.5 4.4.6 4.4.7 4.5

51

Information Gain

52

Threshold

54

Suggested Method

56

Voting Method

58

Suggested Weighted Voting

59

Overall FS Methods

61

The behavior of the classifiers Chapter Five : Conclusions and Future Works

5.1 5.2

66

Conclusions and Discussion

67

Suggestions for Future Work References

List of Figures Figure No.

Title

Page No.

1.1

Block diagram of the system

3

2.2

An illustration of optimal separation of a single hyper plane

14

and a support vector. a) Only the hyper plane H (indicated by a solid blue line). b) A hyper plane and the support vectors. 3.1

Stages of Text Classification.

33

3.4

how to eliminate parts in two tests

42

XV

4.1

Applying classification using the four classifiers with features reduction methods for dataset2 (20% FS + 4 categories).

61

4.2

Applying classification using the four classifiers with features reduction methods for dataset2 (40% FS + 4 categories).

62

4.3

Applying classification using the four classifiers with features reduction methods for dataset2 (60% FS + 4 categories).

63

4.4

Applying classification using the four classifiers with features reduction methods for dataset1 (20% FS + 20 categories).

64

List of Tables Table No.

Title

Page No.

2.1

The improvements to text classification algorithms

21

3.1

The errors for all 10 tests

43

3.2

Mean Square Errors for Each Eliminated Part (example)

43

3.3

Rank of each feature (example)

44

4.1

The results of four algorithms for DataSet2 & DataSet1

49

(without FS) 4.2

The results of Chi square for all the classifiers for DataSet2 &

50

DataSet1 (20% of total Feature were selected) 4.3

The results of Chi square for all the classifiers for DataSet2

51

(40%&60% of total Feature were selected) 4.4

The results of IG for all the classifiers for DataSet1 & DataSet2 (20% of total Feature were selected)

XVI

51

4.5

The results of IG for all the classifiers for DataSet2

52

(40%&60% of total Feature were selected) 4.6

The results of Threshold for all the classifiers for DataSet2&

53

DataSet1 (20% of total Feature were selected) 4.7

The results of Threshold for all the classifiers for DataSet2

53

(40%&60% of total Feature were selected) 4.8

The results of IRFE for all the classifiers for DataSet1 (20% of

54

total Feature were selected) 4.9

The results of IRFE for all the classifiers for DataSet2 (20% of

54

total Feature were selected) 4.10

The results of IRFE for all the classifiers for DataSet2 P=26,

56

e=2(40% &60% of total Feature were selected) 4.11

The results of Voting for all the classifiers for DataSet2&

57

DataSet1 (20% of total Feature were selected) 4.12

The results of Voting for all the classifiers for DataSet2 (40%

57

&60% of total Feature were selected) 4.13

The results of weighted Voting for all the classifiers for

58

DataSet2& DataSet1 (20% of total Feature were selected) 4.14

The results of weighted Voting for all the classifiers for

58

DataSet2 (40%&60% of total Feature were selected) 4.15

The average of each classifier output for the tests

XVII

59

List of Symbols and Abbreviations Abbreviation

Meaning

CFS

correlation based feature selection

CFS

correlation based feature selection

DT

Decision trees

FBIS

Foreign Broadcast Information Service

FDT

Fast decision - Tree induction

FFT

Fast Fourier Transform

IG

Information gain

IR

Information Retrieval

IRF IRFE

Improved the Recursive Feature Improved Recursive Feature Elimination

KDD

Knowledge Discovery in Database

KNN

K-Nearest Neighbors

MI

Mutual Information

ML

Machine Learning

NB

Naïve Bayes

NLP

Natural Language processing

RFE

Recursive Feature elimination XVIII

SVDD

Support vector domain description

SVM

Support Vector Machine

TC

Text classification

List of Algorithms Algorithm No.

Title

Page No.

2.1

ID3 Decision trees algorithm.

13

3.2

The IRFE algorithm

39

3.3

The IFE algorithm

40

3.5

Improved Weighted Voting Method

46

List of Publications 1- ALIWY, Ahmed H.; AMEER, Esraa H. Abdul. Comparative Study of Five Text Classification Algorithms with their Improvements. International Journal of Applied Engineering Research 2017.12.14:4309-4319.

XIX

Chapter One Introduction

Chapter One

General Introduction

Chapter 1.

Introduction Classification is one of the most widely used techniques in machine learning. It can be standalone application as in Text Classification (TC) or a part from other field as in data mining and text mining. It is the process of classifying the data by predefined groups or classes. Where, in most cases, it uses labeled training data to get on the rules for classifying test data into predetermined groups. The classification consists of two phases: a learning phase that analyzes the training data and establishes the classification rules for them; and a classification phase that classifies the test data using the generated rules into group where the group defined based on data attribute values [1]. Text Classification is achieved by classifying documents based on their content (or/and its topic) into predefined categories [2]. TC is too important; therefore, many methods and algorithms, different in their efficiency and computation accuracy, were used to solve it. TC could be utilized for document indexing, web browsing, and e-mail filtering. Moreover, it is important and active area for machine learning and information retrieval (IR) intersect [3]. The common implementation approach for a text classifier is by labeling documents manually to predefined classes, and then using a learning algorithm to produce a classifier which allocates classes to future documents based on their contents. This implementation is called supervised learning because the training documents is labeled with predefined classes [4] [5]. Many studies have been done on automatic text classification in the last few years. They also contain approaches of machine learning as Support Vector 2

Chapter One

General Introduction

Machine (SVM), Bayesian classifier, Decision Tree (DT), and K-nearest neighbor (KNN) [7], Block diagram of the system in figure 1.1.

Preprocessing

Machine learning algorithms

   

Tokenize Text Stop words Stemming Feature extraction &Feature Selection  Vector Representation

Training

Documents

Test Documents



Decision Tree



K-nearest neighbor



Support Vector Machine



Naïve Bayes

Classifier

CLASS

Fig 1.1 Block diagram of the system

Applications of document classification As mentioned in the introduction, documents classification can be applied to many NLP, Text Mining applications and their fields. In this section, a list of some of these applications will be presented which are [8]: 

Spam filtering which attempts to discern E-mail spam messages from legitimate emails.



Language identification which determines the language of a text automatically.

3

Chapter One 

General Introduction

Email routing to transmit the email that was sent to a general address to a specific address or mailbox based on the topic.



Readability assessment that automatically determines the degree of readability of a text, either reader types or to find suitable materials for different age groups or as part of simplification system of a larger text.



Genre classification that determines the genre of a text automatically.



Sentiment analysis to determine the conduct of a writer or a speaker with respects to the overall contextual polarity of a document or some topic.

Approaches Classification Most of the known classification approaches can be one of three types: rule-based, Statistical, or hybrid. These general types of classification will be explained briefly in this section 1.3.1 Rules-based classifiers Rule-based classifier makes use of a set of IF-THEN rules for classification. Rule can be extracted directly from data (Direct Method) or from other classification models (Indirect Method). In direct method or Sequential Covering algorithm, the rules are extracted for one class at a time creating rules that cover many examples of a class C and none (or very few) of other classes. In indirect method, the rules extracted from other classification models for example decision trees, neural networks, etc. [9]. Rules-based classifiers are easy to generate, easy to interpret, and very fast in classification [9]. 1.3.2 Statistical approaches Statistical approaches has clearly probability model, where each class has its own probability. Some human intervention is required [9]. 4

Chapter One

General Introduction

This classification is often done with logistic regression or a similar procedure; the properties of observations are termed as independent variables and categories to be predicted which are known as outcomes. These outcomes are considered to be possible values of the dependent variable. The independent variables are features and the outcome is the classes [9]. Most of the classification techniques from Machine Learning (ML) are stochastic approaches. Machine learning documents classification is “the ability of a machine to improve its document classification performance based on previous results of document classification”. There are two important factors in ML approaches: good classification algorithm and balanced corpus for learning process. There are other limits in documents classifications. These limits are: finding the important information in a document and reduction of the features. Good classifiers should give more than 80% accuracy [9]. 1.3.3 Hybrid approaches Hybrid approaches rely on merging multiple approaches. In this case the new approach has characteristic of the merged approaches. In some cases hybrid is a type of combination of classifiers with different approaches like combining stacking and voting [9]. Problem Solving Text mining has many limits compared to data mining. One of the important limits is the high scale of features; therefore, features selection is studied widely in text mining. The main problem of this work is related to improving documents classification with focusing on the problem of features selection for text classification.

5

Chapter One

General Introduction

Related Works In the recent years, the progresses of web and social network technologies have led to a massive interest in the classification of text documents. Many studies on classification algorithms have been done by many researches. Almost all the researchers try to give high performance of classification problem. In this section, a review will be done to these works to show the focus points of them. As will see the aim of this thesis is to improving the classification by feature selection on a well-known English data set “20 newsgroups text” consists of approximately 18846 files. Therefore the related works will be presented of the aim of this thesis. KO& Park & Seo (2004) [10] introduced text summarization techniques to measure the importance of sentences which represented a document as a vector of features with different weights according to the importance of each sentence. Using two language newsgroup data sets one written by English and the other written by Korean. Four kinds of classifiers are used in our experiments: Naive Bayes, Rocchio, k- NN, and SVM. Schneider (2005) [11] described some simple modifications of the Naive Bayes text classifier that address problems. Used a simple transformation that effectively removes duplicate words in a document to account for burstiness phenomena in text; proposed an information theoretic framework to obtain more reliable confidence scores and by viewing a training corpus as a clustering of the training documents and feature selection as a way to improve that clustering. The data set used 20 Newsgroups. Novovičová& Malík & Pudil (2004) [12] presented sequential selection methods are sequential forward selection methods based on improved mutual information. These methods significantly overcome standard best individual features method based on information gain on the testing data set. Present 6

Chapter One

General Introduction

experimental results using naive Bayes classifier based on multinomial model on the Reuters data set. Taheri, Mammadov & Bagirov (2011) [13] used conditional probabilities to find dependency between features and to apply it to Naïve Bayes classifier. They offered results of numerical experiments on 10 data sets obtained from UCI machine learning repository and LIBSVM. They showed that the proposed algorithm significantly optimizes the performance of the Naive Bayes classifier. Yuan (2010) [14] optimized Naïve Bayes text classification by calculating posterior probability and reducing dimension of feature words of text. The results for the specified experiment of the enhanced way had higher efficiency than the original algorithm. The used data set was "the Starter Edition text classification data made by Sogou laboratory which has17910 documents of 9 categories". Harrag, El-Qawasmeh & Pichappan (2009) [15] used a decision tree algorithm which shows classifying Arabic text documents. They suggested hybrid techniques of document frequency threshold by using embedded information gain criterion and the preferable feature selection criterion. They used two different corpora of 8 categories and 14 categories respectively. They got an accuracy of 0.93 for the scientific corpus and 0.91 for the literary corpus. Yong ,Youwen & Shixiong (2009) [16] improved KNN text classification algorithm by: (i) Compressing the given training sets and deleting the samples near by the border, (ii) clustering the training sample sets of each category using k-means clustering algorithm, (iii) introducing weight value which mentions the significance of each training sample according to the number of samples in the cluster which contains this cluster center, (iv) and finally, using 7

Chapter One

General Introduction

the modified samples to accomplish KNN text classification. They used training corpus of 19637 documents with 20 categories. Lewis & Ringuette (1994) [17] gated empirical results on the performance of decision tree learning and Bayesian classifier algorithm on two text categorization data sets (The first of them was a set of Reuter's newswire stories of 21,450 documents and U.S. Foreign Broadcast Information Service (FBIS) of 1,500 documents). Documents used a set of 8,876 binary features corresponding to English words occurring in 2 or more training documents. The features ranked for each category by using the information gain measure. The performance for this algorithm was reasonable. They showed that feature selection in the decision tree algorithm was particle effective in dealing with the large feature sets common in text categorization. Thesis Objective The main objective of this work, in design and implementation, is to improve classification using feature selection. The proposed methods could overcome the large amount of words (hence large features) by taking advantage of enhanced concept. These methods produce information having less prediction error and/or less execution time. In addition, in this work used multiple features selectors and combining of them for getting high performance. An improvement to this combination will be introduced and hence a new approach will be discussed. Thesis Layout The rest of the thesis is organized as the following: Chapter two: presents a wide explanation of the main four algorithms for text classification and comprehensive study for almost all the amendments 8

Chapter One

General Introduction

which were done on them. Also, it presents feature selection methods which used later in this work. Chapter three: explains the main steps of proposed prediction techniques for improving classification of text by using feature selection and its properties. Chapter four: illustrates the implementation of the ways used in improving the classification by using feature selection on a well-known English data set, 20 newsgroups text data set and the experimental results of this implementation. Chapter five: presents conclusions of this work and suggestions for future work.

9

Chapter Two Theoretical Background Chapter 2.

Chapter Two

Theoretical Background

Introduction The main concepts of this thesis are related to three categories; the concepts of text mining Techniques for classification and their improvements, the concepts of feature selection methods, and the concepts of evaluation methods .In addition, the used dataset. These three concepts will be explained in four parts in this chapter. The first part contains four of well-known and good evaluated classifiers because there are many algorithms used in document classification. These four classifiers are Decision Tree (DT), Support Vector Machine (SVM), kNearest Neighbors, and Naïve Bayes. The second part contains an explanation of almost all the improvements done on these algorithms and the types of the improvements. The third part contains an introduction to feature extraction and selection methods (feature reduction). Five methods of feature selection will be explained .The fourth part contains how to evaluate the classifier performance. The fifth part included the used dataset. All these concepts are representing a theoretical base for the next chapters. Techniques of classification There are many algorithms of documents classification. As was mentioned previously, four algorithms will be used in this work are Decision Tree (DT), Support Vector Machine (SVM), k-Nearest Neighbors (KNN), and Naïve Bayes (NB). 2.2.1 Decision tree When decision tree is used for text classification, it consists of tree where internal node is label by term, branches represent weight and leaf represents the class. Tree can classify the document by running through the query structure from root until it reaches a certain leaf, which represents the goal for

11

Chapter Two

Theoretical Background

the classification of the document. Most of training data will not fit in memory, decision tree construction it becomes inefficient [18]. Decision trees (DT) are the widely utilized inductive learning methods. They are learned from labeled training documents. ID3 is one of the most well-known decision tree learning algorithms and it has extensions like C4.5 and C5. DT has advantages and drawbacks: 1) The advantages: Decision trees capable to learn disjunctive expressions and their robustness to noisy data seem convenient for document classification. 2) The disadvantages: learning of decision tree algorithms cannot guarantee to return the global optimal decision tree [19]. Data comes in records of the form [20]: (X, Y) =(x1, x2, x3… x k, Y) … … (2.1) The dependent variable, Y, is the target variable. The vector X is composed of the input variables, x1, x2, x3 etc., that are used for that task. The decision tree can be linearized into decision rules, [21] where the result is the contents of the leaf node, and conditions along the path are formed in relation to the requirement condition. In general, the rules have the following form “If condition1 and condition2 and condition3 then outcome”. A decision rule to predicting the value of a discrete decision attribute called a classification rule. A set of decision rules which predict the same decision attribute called a rule group [22]. The most common approach of inducing a decision tree is to divide the labeled examples recursively until the stop criterion was meted. The section is defined by specifying the test that 12

Chapter Two

Theoretical Background

divides all the examples into the separate subgroups of the test sections, passes each example to the corresponding branch and treating each part of the section as a sub-problem, a sub-tree of which is a recursively construct. The common stopping criterion for a subset of examples is that they all have the same class. The ID3 algorithm [15] is given in Figure 2.1. Algorithm 2.1: ID3 Decision trees algorithm Input: The Training sample Output: A Decision tree Step 1: Method:  Tree is constructed in a top-down recursive divide-and-conquer manner.  At start, all the training examples are at the root.  Attributes are categorical (if continuous-valued, they are discredited in advance).  Examples are partitioned recursively based on selected attributes.  Test attributes are selected on the basis of a heuristic or statistical measure(E.g. Information Gain). Step 2: Conditions for stopping partitioning:  All samples for a given node belong to the same class.  There are no remaining attributes for further partitioning (majority voting is employed for classifying the leaf).  There are no samples left.

Fig 2.1 ID3 Decision trees algorithm [15] 2.2.2 Support Vector Machine The Support Vector Machine which was proposed by Vapnik, provides "a maximal margin separating hyper plane" between two classes of data and has non-linear extensions[23]. It is a supervised classification algorithm which recently was successfully used for many tasks of Natural Language Processing (NLP) as text classification [24][25]. SVM algorithm represents the text document as a vector where the dimension is the number of distinct keywords. If the document size is large then the dimensions are enormous of the hyperspace in text classification 13

Chapter Two

Theoretical Background

which causes high computational cost. The feature extraction and selection can be used to reduce the dimensionality[26]. SVM is a form of linear classifiers. Linear classifiers in the context of text documents are models that make a classification decision based on the value of linear combinations of document features; where the output is y = 𝑤 ⃗⃗ . 𝑥 + b, where 𝑥 = (x1, x2 … x n) is the normalized document word frequency vector, 𝑤 ⃗⃗ = (w1, w2 . . . w n) is a vector of coefficients and b is a scalar. The output can be interpreted as a separating hyper plane between different classes as in the figure 2.2 [27].

a

b Fig 2.2 An illustration of optimal separation of a single hyper plane and a support vector. a) Only the hyper plane H (indicated by a solid blue line). b) A hyper plane and the support vectors [27].

14

Chapter Two

Theoretical Background

From of SVM advantages, it is different Kernel functions can be specified for the decision function and quite robust to high dimensionality this means, learning is almost independent of the dimensionality of the feature space. SVM methods have been widely used in many application domains such as pattern recognition, face detection and spam filtering [28]. 2.2.3 K-Nearest Neighbors K-Nearest Neighbors (KNN) is known as simple and effective classifier of text categorization. The KNN classifier has three defects: the complexity of computing its sample similarity is huge, its performance is easily affected by single training sample, and KNN doesn’t build the classification model since it is a lazy learning method. The complexity of KNN can be reduced by utilizing three ways: reducing dimension of vector text, reducing the amount of training samples and fasting process of finding K nearest neighbors [29]. KNN is used to classify a document by calculating the distance between the document and all documents in the training set by using variation or similarity measure, then finding the nearest K neighbors among all training documents and assigning the document to the category which includes largest number of documents included in k nearest neighbors set [30]. A popular distance measure like Euclidean distance is used to determine which of the K instances in the training dataset are most similar to a new input. Euclidean distance is estimated using the following simple [31]. 𝑛

𝑑(𝑥, 𝑦) = √∑

(𝑤𝑖𝑥 − 𝑤𝑖𝑦 )2 … … … … … … … … … … … (2.2)

𝑖=1

Where x and y are the documents to calculate the distance between them. Wix and Wiy is the weight of the word i in document x and document y respectively. 15

Chapter Two

Theoretical Background

2.2.4 Naïve Bayes The Naïve Bayes classifier is known as a group from simple probabilistic classifiers upon on a common supposition where all the features are freelance of each other, according to the category variable [32]. Naive Bayes was fast and easy for implementation. So was a base-line in text classification [33]. The Naïve Bayes is effective enough to classify the text in many domains, although it is less accurate than other discriminative methods as SVM [34]. Naive Bayes is model the distribution of the documents in each class using a probabilistic model with independence assumptions about the distributions of different terms. It was a very prevalent method in text classification area, where the binary independence classifier was one of the best known approaches to Naive Bayes classification which used binaryvalued vector representations of documents [35]. NB classifier mostly utilize Bayes’ rule [45]: P (c𝑖 |d) =

P (c𝑖 )p (d|c𝑖 ) p (d)

𝐶𝑚𝑎𝑝 = argmax 𝑐𝑖 ∈𝐶

… … … … … … … … … … … … … … … … (2.3)

P (c𝑖 )p (d|c𝑖 ) p (d)

= argmax(P (c𝑖 )p (d|c𝑖 )) … … . (2.4) 𝑐𝑖 ∈𝐶

Where p (ci | d) is the probability of class i given a document d, p (c i) are the probability of class i which calculating by: P (c𝑖 ) =

𝑁𝑖 N

… … … … … … … … … … … . . … … … … … … … … … (2.5)

Where "Ni is the number of documents in class i and N is the number of documents in the all classes, p (d| ci) is the probability of a document d given a class i, p (d) is the probability of document d.

16

Chapter Two

Theoretical Background

If di is continuous-valued, then we need to do a bit more work, but the calculation is pretty straightforward. A continuous-valued attribute is typically assumed to have a Gaussian distribution with a mean µ and standard deviationσ, defined by [36] G (d, µ,𝜎) =

1 √2πσ

𝑒−

(d−µ )2 … … … … … … … . . … … … … … … … (2.6) 2𝜎 2

1

µ= ∑N 1 di … … … … … … … … … … … … . . … … … … … … … … … (2.7) N

1

(d − µ )2 … … … … … … … … … . . … … … … … … … … … (2.8) 𝜎2 = ∑𝑁 N 1

Improvements There are many improvements were done to the used four algorithms DT, NB, KNN and SVM. In this section, many of these improvements will be explained for each algorithm. 2.3.1 DT Improvement

There are many improvements were done to DT algorithm itself, the learning and features reduction. These improvements can be modification / addition to the algorithm itself or extraction-selection (reduction) of the features. Little of these improvements on DT will be introduced in this section. FDT Fast Decision-Tree Induction is introduced which reduces the cost of large scale, and multi-label data. FDT can be implemented using a two parts technique: “(1) feature-set pre-selection and (2) induction of several trees [48]. This is, also, can be combined with converting a decision tree to a rule set [49]. DT can be improved by feature selection when the feature is very large scale [17].Many feature selection techniques can be combined for improved 17

Chapter Two

Theoretical Background

the feature selection with DT [15]. DT algorithm can be improved by using different learning techniques [50] also; L’Hospital Rule can be introduced to improve the efficiency of decision making algorithm [51]. C5.0 can be introduced to reduce the error ratio in the decision tree by implementing the feature selection, cross validation, model complexity to solve over fitting problem [52] also, it utilized as the base classifier to classify with low memory usage and high accuracy [53]. DT can be improved by utilized L’Hospital Rule [54], also constructed The SVM-DT [55] to improve the efficiency and performance of multi-class classification. 2.3.2 SVM Improvement There are many improvements and modifications done to SVM. These improvements increased the efficiency of SVM and hence the accuracy. Some of these improvements are selected and explained in this section: SVM improved by using different parameters to increasing the performance [56], where two modifications: one-class SVMs and Robust one class SVMs to outliers which making one-class SVMs more suitable to unsupervised anomaly detection [57]. Enhanced SVM style with a weighted kernel function depended on features of the training data where Rough set theory was used to implement a feature standing to get on the better evaluation [58], where Support Vector Machine utilized to the task of classifying multilayered text [59]. 2.3.3 KNN Improvement There are many improvements done to KNN algorithm. Some of these improvements will be explained in this section.

18

Chapter Two

Theoretical Background

K-means clustering algorithm can be used to improved KNN text classification algorithm [16] also used only a part of training set to classify a new instance satisfying the condition of relevance [60] to optimize classification precision and quicken classification time. A two phase hierarchical text classification algorithm can be used to improve KNN algorithm called EKNN [61]. KNN classifiers utilized varying similarity measures [62] to getting on the optimal parameters to predict the categories. Also, N-Gram (unigrams and bigrams) in the documents indexing and traditional single terms indexing method can be introduced to improve the precision of KNN classifier [63]. 2.3.4 NB Improvement Many improvements were done on NB classifier. Some of these improvements were modification of probabilities calculation, feature reduction and little other characteristics. Some of these improvements will be shown in this section. Correlation based feature selection (CFS) algorithm utilized as preprocessing to Naive Bayes classifier to optimize the performance for this classifiers [64]. Also, conditional probabilities can be applied to Naive Bayes classifier [13], and this used to finds dependencies between features [65] for optimized the performance of Naive Bayes algorithms. NB can be improved by feature selection by using simple transformation to simple modifications of the Naive Bayes text classifier to optimize this algorithm [66]. Mutual-Information-weighted naive Bayes text classifier is proposed to increase the effect of highly informative words for optimizing the performance of naive Bayes text classifier [67]. 19

Chapter Two

Theoretical Background

Several smoothing ways used to estimation the parameters introduced during naive Bayes text classifier to achieve best and more stable performance [68]. Optimized Naïve Bayes text classification by calculating posterior probability and reducing dimension of feature words of text for enhanced way has higher efficiency [14]. Overall the improvements As was showed, there are many improvements to the well-known algorithms for text classification. Summary of all improvements presented in this work is showed in table 2-1. The improvements are divided to two main classes: algorithm and features. The improvements in algorithm can be modification/addition to the algorithm and the learning. The improvements on the features can be in the extraction or/and selection. 1-Algorithms improvements  Learning.  Learning and the algorithm itself.  The algorithm itself (only). 2-Improvements by Features  Feature Extraction and selection.  Feature selection.  Feature extraction. From the table (2.1), we can see each researcher has their own dataset for testing the improvement which makes the comparison more difficult.

20

Chapter Two

Theoretical Background

Table (2.1) the improvements to text classification algorithms

Improvement Data Set algorithm

researcher

algorithm learning

Vateekul & Kubat [48].

modification

features addition

Extraction

Selection





















Different data subset.

Johnson, Oles, Zhang & Goetz [49].

The Reuters-21578 collection of categorized newswires

Two data sets: 1- Set of 21,450 Reuter's newswire Lewis & Ringuette [17].











stories. 2- 1,500 documents from the U.S. Foreign Broadcast Information Service (FBIS) two different corpora; 1- Arabic texts from Arabian

DT Harrag, El-Qawasmeh & Pichappan [15].

scientific encyclopedia of 373











documents from 8 categories. 2- Set of prophetic traditions or "Hadiths' collected from the Prophetic encyclopedia.

Badgujar & Sawant [50].











Data sets from UCI machine learning repository.

Multiple dataset: Galathiya, Ganatra & Bhensdadia [51].











Zoo dataset, Ionosphere, Contactlenses, Au1_1000, Breast Cancer, iris, Annealing and Weather nominal dataset.

21

Chapter Two Galathiya, Ganatra & Bhensdadia [52].

Pandya & Pandya [53].

Theoretical Background











Used RGUI with weka packages.











Used weka packages.

Used large amount of data Agrawal & Gupta [54].











collection, data mining tool WEKA was used.

Xu & Wang [55].



















Used Reuters-21578 collection.

 Ageev& Dobrov [56].

Amer, Goldstein & Abdennadher [57]. SVM

collection as FRF-10372 consists of 10372 documents

Datasets from the UCI machine











learning repository, ionosphere, shuttle and satellite and the breast-cancer dataset.

Yao, Zhao & Fan [58].

Rennie & Rifkin [59].











The KDD dataset and the UNM











Use two well-known data sets, 20

dataset.

Newsgroups and Industry Sector. Data from Chinese natural language processing group in

Yong, Youwen & Shixiong [16].











Department of Computer Information and Technology in Fudan University of 19637 documents. Used two different data sets:

KNN Barigou [61].











1- Reuters-21578 data set that 2- 20 Newsgroups data set that

Han, Liu, Shen & Miao [62].







22





Used Dmoz, Wikipedia small, and Wikipedia Large dataset.

Chapter Two Peterson, Doom & Raymer [63].

Theoretical Background











Al-Shalabi & Obeidat [64].

Used biological or medical data with four UCI datasets Private corpus collected from











online Arabic newspapers archives including Al-Jazera, AlNahar, Al-Hayat, and Al-Dostor, Used sample dataset ("Eucalyptus

ManeeshSinghal & RamashankarSharma [65].











Soil Conservation ") from the TunedIT repository of "machine learning databases".

Taheri, Mammadov& Bagirov [13].

Petre [66].

NB Schneider [67].

Kim, Rim, Yook & Lim [68].

He & Ding [69].











10 data sets from UCI machine











Used the dataset concerns from











learning repository and LIBSVM.

UCI Machine Learning Repository. Used four datasets: 20Newsgroups, WebKB, Ling-Spam and Reuters-21578".











Used the Reuters21578 and 20











Extracted 3,894,900 questions

Newsgroups collections.

from Yahoo! Webscope dataset. 17910 documents from the Starter

Yuan [14].











Edition text classification data by Sogou laboratory.

Feature reduction methods The examples or documents in the text classification tasks contain thousands of symbols; therefore, the classification problem has been very difficult to much classifier. Dimensional reduction should be executed. It is a model at text mining converts data representation to a shorter, more compact, and more predictive space [37]. The new space has been the easiest to deal with because of its size.

23

Chapter Two

Theoretical Background

The purpose of selecting features is to get helpful styles in text documents. The feature reduction would convert authentic features to fresh features by applying some conversion function. This new set of features has fewer features or dimensions than the authentic collection, which give best results. Two main categories of reduction techniques are describe below [43]. 2.5.1 Feature Extraction The feature extraction includes an authentic feature space that transforms into a more compact space. All authentic features are converted to a low space without omitting them, but the authentic features are substituted by the smaller representatively group. This is when the feature number in the input data is very large to be processed then the input data will converted to a low representation group of features [37]. Feature extraction included feature construction, space dimensionality reduction, feature selection, and sparse representations. All these techniques are as preprocessing to machine learning. A number of new applications with very large input spaces critically need

space

dimensionality

reduction.

These

applications

include

bioinformatics like DNA micro arrays, mass-spectrometric data, etc., combinatorial chemistry e.g. high throughput screening of drug candidates, text processing like spam filtering, decision making like oil drilling and pattern recognition e.g. handwriting recognition, speech processing, and vision[47]. 2.5.2 Feature Selection Feature selection technology includes specifying a subset of the authentic features using only the specific features of the training and testing classifiers. Features that have been removed from the accounts are no used anymore [38].

24

Chapter Two

Theoretical Background

Feature selection techniques classified into two basic categories: filtering techniques and wrapper techniques. Filtering techniques are independent of the learning algorithm .These methods regardless of learning algorithm and using statistical methods to feature selection and have low complexity, but wrapper methods are use from learning algorithm as the evaluation function. These methods have higher time in complexity and accuracy than filter methods [10]. With increasing size of the features in text classification, these methods could not be used because of the high complexity. Some filtering methods can be used in texts classification techniques such as Threshold methods, Information Gain (IG), and Mutual Information (MI), Chi-Square, etc. Some wrapper methods are Sequential Forward Selection (SFS), Sequential Backward Selection (SBS) and Neural Networks [10]. In text classification a text documents may partially match many categories, so need to found the best matching category for the text document, the term (word) frequency/ inverse document frequency (TF-IDF) approach used to weight each word in the text document according to how unique it is, this mean the TF-IDF approach captured the relevancy among words, text documents and particular categories [39]. After selecting feature subsets, all documents are represented by the feature vector with the normalized TFIDF weighting function [40]. The formula for TFIDF is given as:

TFIDF (ti, dj) =

tf (ti,dj) 𝑙𝑜𝑔

n num(ti)

√∑𝑚 𝑝=1((tf (tp,dj)𝑙𝑜𝑔

25

2 n ) ) num(tp)

… … (2.9)

Chapter Two

Theoretical Background

Where tf (ti, dj) denotes the number of times ti occurs in the document dj, num (ti) denotes the number of documents in which ti occurs at least once; n is the total number of documents, and m is the size of the feature subset. In the next section, five of feature selection filtering methods will be explained briefly which are used in our work. 2.5.2.1 Threshold methods Threshold ways founded

on

deleting

features

which

have

weights/frequencies greater/ less than a specified threshold value. These ways are the most public currently because they are fast and efficient. On the other hand, they fully disregard the presence of other features and evaluate each feature on its own [41]. 2.5.2.2 Information Gain (IG) and Mutual information (MI) Information gain (IG) is a folk feature selection way in text classification that works to measure the number of bits from information got for a given category by having a term in a document or lacking the term. IG feature selection way chooses the terms which have higher scores gain information [42]. The gain of information about the term t is defined as [45]:

|c|

|c|

IG (t) = ∑i=1 P (ci) log p (ci) + p (t) ∑i=1 P (ci| t) log p (ci| t) + |c|

P (t -) ∑i=1 P (ci|𝑡̅) log p (ci|𝑡̅ )

… … (2.6)

Where: P (ci): represents the likelihood of the occurrence of ci class. P (t):

represents the likelihood of the occurrence of t.

P(𝑡̅): represents the likelihood of the nonoccurrence of 𝑡̅ . .

26

Chapter Two

Theoretical Background

Mutual information (MI) is derived from information theory and provides a formal way to model the mutual information between the features and the classes. The point wise mutual information M i (w) between the feature tj and the category ci is defined on the basis of the level of co-occurrence between the category ci and feature tj, where it known as standard commonly utilized in statistical language modeling of word association and related applications [43]. MI measures the mutual dependence of a feature tj and a category ci. The formulation for MI is given as [33]: K

MI(C; 𝑡𝑗 ) = ∑

i=1

𝑝(𝑐𝑖 , 𝑡𝑗 ) 𝑙𝑜𝑔

𝑝(𝑐𝑖 |𝑡𝑗 ) p(𝑐𝑖 )

……

(2.7)

Where k "is the number of categories"; ci "represents the ith category" and tj "represents the jth feature". 2.5.2.3 Chi-Square Chi-Square (χ2) [44] is a fashion of selecting folk features which evaluates features individually by calculating χ2 square statistics with respect to layers. In other words, chi-square tries to analyze the dependency between the term and the class. If they are independent or dependent, then the score is 0 or 1. Surly, a term with a higher chi-squared score is more informative. The formulation for CHI is given as [35]: χ2(c, t) =

𝑁∗(𝐴𝐷−𝐵𝐶) (𝐴+𝐶)(𝐵+𝐶)(𝐴+𝐵)(𝐶+𝐷)

… … (2.8)

Where: A = frequency of t and c occurrences. B = frequency of t occurrences without c. C = frequency of c without t. D = frequency of non-occurrence of both c and t, and N= is the quantity of document.

27

Chapter Two

Theoretical Background

2.5.2.4 Recursive Feature Elimination (RFE) The RFE selection method is a recursive process which ranks features according to their importance, where each iteration feature importance are measured and the less relevant one is removed. A recursion needed because for some measures relative importance of each feature can change when evaluated over a different subset of features during of elimination process. The inverse order for features are eliminated is used to construct a final ranking. The feature selection process is taking only the first n features from this ranking [46]. It is very slow method. It takes a long time to work feature elimination so it is not used and avoided in text mining. Therefore, a faster version will be suggested with same methodology in next chapter. 2.5.2.5 Voting method The notion behind the implementation of the voting classifier is to sum different concepts of conceptual machine learning and to use a majority vote. Majority voting (MV) present as the simplest possible rule, where if two or three classifiers are agree on a class for a test document, the result of voting classifier is that class. Second weighted majority voting, in this method, the weights are specific for each class in this weighting method, error of each classifier is calculated. [18]. This is can be a benefit for feature selection. Voting is divided into two parts: weighted and unweight. Unweightvoting means if we have number selectors where each one gives a rating, then the features which have highest vote will be taken. A weighted vote means if we have a number of selectors and they are different in their accuracy according to previous tests, then each selector will take weight according to its accuracy.

28

Chapter Two

Theoretical Background

Evaluation methods To evaluate the utility of the various feature selection methods, F-measure can be used: a measure that combines precision and recall. In classification problem, the system can give four states for a class c:  Return some of documents from the class c (true positive).  Return some of documents not from the class c (false positive).  Not return other documents from the class c (false negative).  Not return other documents from other classes (true negative). Hypothesis

Correct Class

+ve

-ve

+ve

Tp

Tn

-ve

Fp

Fn

Precision is defined as the ratio of correct categorization of documents into categories to the total number of attempted classifications [31]. true positive

Precision =

true positive + false positive

=

Tp Tp + Fp

… … (2.10)

… ….. … …. …. …. …. (2.10)

Recall is defined as the ratio of correct classifications of documents into categories to the total number of labeled data in the testing set [40]: Recall = =

𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 Tp Tp+ Fn

… … (2.11)

… …… …. …. …. …. …. .. (2.11)

29

Chapter Two

Theoretical Background

F1-measure is defined as the harmonic mean of precision and recall. Hence, a good classifier is assumed to have a high F1-measure, which indicates that the classifier performs well with respect to both precision and recall [40]: 2 ∗ precision ∗ recall

F1-measure =

precision + recall

… …… … … … … … … … … …. .

(2.15) Accuracy = =

true positive true positive+false positive+ true negative+false negative Tp+Tn

Tp+Fp+Tn+Fn

… … (2.16)

… … … … … … … … … … … … ... .... (2.16)

Error=1- Accuracy … … … … … … … … … … … ….… … … …… (2.16)

The used dataset A well-known English data set, 20 newsgroups text data set, is used for testing the proposed approaches. It involves about 18846 newsgroups positions upon 20 subjects divided into two subsets: one for training or development and the other one for testing or for performance evaluation. The training files consist of 11314 documents and the testing files consist of 7532 documents where the numbers of all features are 101322 without reduction. In the implementation, the whole data set (DataSet1) was taken in all tests and then parts (DataSet2) of the categories were taken for the same tests. This is done for the study the classification for high and low numbers of classes. The parts of data set testing consist of four classes where the training files consist of 2034 documents and the testing files consist of 1353 documents. And the numbers of all features are 26576 without reduction.

30

Chapter three The Proposed Methods

Chapter Three

the proposed approaches

Chapter 3.

Introduction The text documents have large number of features in many applications such as text automatic categorization. Where number of features in the data ranges from hundreds to tens of thousands. High dimensions raise big difficulties in understanding the data building classifier systems and machine learning therefore feature selection is take place. The goal from feature selection is to find a subset from original features set by removing irrelevant and/or redundant features. As was showed in previous chapter, the improvements of classification can be done in algorithm and features. Algorithm improvements can be in learning or modification of the algorithm itself. Feature improvement can be done in the feature reduction (i.e. extraction and selection). The improvements of classification in features reduction are more effective for fasting classification process and neglecting the least informative feature. It is also easier than modification of a given algorithm or inventing a new algorithm and it is very necessary for any classification algorithm. The improvement in feature selection is taken in this work because the improvement on the algorithms is done in different way for each algorithm and there are not any standard rules for modification. This is because each algorithm has its own specifications. In this chapter, a suggestion to improvement of features selection will be introduced in two levels. The first level is done by suggesting a new approach for feature selection; it can be a general formulation for RFE which was explained in the previous chapter. The second level is done by combining this method with other methods using a new weighted voting approach.

32

Chapter Three

the proposed approaches

Text Classification Process The stages of TC are discussing as following points, Fig 3.1 Stages of Text Classification.

Documents

Pre-Processing

Tokenization

Stop Words removal

Word Stemming

Feature Selection

Classification Algorithms

Performance Measure

Fig 3.1 Stages of Text Classification.

33

Chapter Three

the proposed approaches

3.2.1 Documents Collection This is first step of classification process in which are collecting the different types (format) of document like web content etc.

3.2.2 Pre-Processing The first step of pre-processing used to presents the text documents into clear word format. The documents prepared for next step in text classification represented by amount of features. The commonly the steps taken are:  Tokenization: A document is treated as a string, and then partitioned into a list of tokens.  Removing stop words: Stop words like “the”, “a”, “and”, etc. are frequently occurring, so the insignificant words need to be removed.  Stemming word: Applying the stemming word which converts different word form into similar canonical form. This step is the process of conflating tokens to their root form, e.g. connection to connect, computing to compute.

3.2.4 Feature Selection After pre-processing which consider important step of text classification, is feature selection to construct vector space, which improves the efficiency and accuracy of a text classifier. Feature Selection (FS) is select subset of features from the original documents. FS is performed by keeping the words with highest score according to predetermined measure of the importance of the word. The major problem of text classification is the high dimensionality of the feature space. Many feature evaluation metrics have been wide commonly among which is information gain (IG), Chi-square, mutual information, three should which are explain in chapter two. 34

Chapter Three

the proposed approaches

3.2.5 Classification Algorithms The automatic classification is classifying of documents into predefined categories, the documents can be classified by three ways, unsupervised, supervised and semi supervised methods. In the last few years, the task of text classification was extensively studied and rapid progress seems in this area, including the machine learning approaches such as Naive Bayes classifier, Decision Tree, K-nearest neighbor, Support Vector Machines, where these techniques explained also in chapter two.

3.2.6 Performance Evaluations It is Last stage of Text classification, which is the evaluations of text classifiers, is typically conducted experimentally, rather than analytically. The experimental evaluation of classifiers usually tries to evaluate the effectiveness of a classifier, which means its capability of taking the right categorization decisions. An important issue of Text categorization is how to measures the performance of the classifiers. Many measures used like Precision, Recall, F-Score, Error, and Accuracy which described in the chapter two.

3.2.7 Training phase It is also called as Learning Phase; the set of documents utilized for learning phase is called training set. It describes a set of predetermined labeled classes where each document in the training set is assumed to belong to a predefined class. This set is used to train the classifier to take appropriate categorization decisions where the classifier learns from training data, hence called learning phase. The model represented as classification rules, decision trees, or mathematical formulae. 3.2.8 Testing phase After training phase comes the testing phase which is also called Classification Phase. It is used for classifying future or unlabeled documents. The known label of test

35

Chapter Three

the proposed approaches

document is compared with the classified result to estimate the accuracy of the classifier. Recursive Feature Elimination (RFE) with its limitations The recursive feature elimination (RFE), as its title removes features, identifies features by considering smaller and smaller sets of features. This is done by training on all features in the dataset and ranking the feature then the low rank feature will be eliminated. Again the process of training for the remained features will be done and it will be repeated to get the desired number of features. The researchers avoided using it in text mining due to its complexity and it is very slow. Therefore a faster way is suggested to improve recursive feature elimination (IRFE). Improved Recursive Feature Elimination (IRFE) Improved Recursive Feature Elimination (IRFE) is suggested to solve the problem of slow RFE instead of eliminating one feature with many trainings; group of features will be eliminated for little trainings. This is done by partitioning the features into p equal parts. Then p training will be done with eliminating features equal to size of one part in each training and testing. For example, if we have 1000 features then they can be partitioned to 10 parts each part contains 100 features. Each training phase will be done on 9 parts (900 features) with neglected the other one part (100 features). It is easy to see that 1000 tests are reduced to 10 tests. This method in this way has a problem with feature elimination because 100 features will take same error value (the same ranking). One group may be has the most important feature and the least important feature; therefore, the overlap between two groups can be taken as feature elimination not takes the part only. Each part will be examined more than once, but the test number remains the same. In this case, the number of parts of feature elimination (e) will be introduced. For example, if e=2, then the first part and the second will be neglected in first training and the rest will be taken, 36

Chapter Three

the proposed approaches

but the second part and the third will be neglected in second training phase and so on. Each part will be neglected twice and its effect will appear for two measurements. In the explained method, neglecting the parts will affect the performance of the system this affect will be the error for these parts. The worst part will give very small error. Each part will take its errors from the summation of e’s neglecting tests. Then, each part will have a rank according to these errors. Another problem will be raised where all features in one group will have the same error rate. For example, for 26000 features with 100 parts result in each 260 features have the same error rate. Hence, this problem can be solved by multiplying the value of each feature in term frequency (tf) which breaks the equality for the same group. Different values of the features will be obtained. The features which have lowest ranks will be eliminated. This will be repeated till the specified number of features will be remained. The neglected features should be less than a half of the value of the segment and should be as little as possible. 3.4.1 Formal definition: Suppose we have a set of features F to be reduced. F= {f1, f2… f n} These features are partitioned into p parts (groups) where: 1