A Computational Framework for Tamil Document ... - IEEE Xplore

0 downloads 0 Views 199KB Size Report
Amrita School of Engineering, Amrita Vishwa Vidyapeetham. Email: p [email protected] and m [email protected]. Abstract—Along the ...
A Computational Framework for Tamil Document Classification using Random Kitchen Sink Sanjanasri J.P and Anand Kumar M Center for Excellence in Computational Engineering and Networking Amrita School of Engineering, Amrita Vishwa Vidyapeetham Email: p [email protected] and m [email protected] Abstract—Along the prompt growth in World Wide Web, the availability and accessibility of regional language contents such as e-books, web pages, e-mails, and digital repositories has grown exponentially. As a result, the automatic document classification has become the hotspot for fetching information among the millions of web documents. The idea of classifying the text, forms the baseline for many NLP applications such as information extraction, query response, information summarization, etc. The main objective of this paper is to develop an computational framework for supervised Tamil document classification task. This paper highlights the performance of Random Kitchen Sink, a randomization algorithm, in Grand Unified Regularized Least Squares (GURLS), a Machine Learning Library, is proven to be comparably better than the conventional kernel based classifier in terms of accuracy. Henceforth, we claim that Random Kitchen Sink can be an effective alternative to the kernels for a classifier. Keywords—Machine Learning, Natural Language Processing, Support Vector Machine, Proximal Support Vector Machine, Random Kitchen Sink, Regularized Least Square.

I.

I NTRODUCTION

Since the half decade, almost every regional language knowledge repository, especially Tamil language has captivated their resources to online digital repositories. Besides the exponential growth of online data, the demand for a large scale document classification system has also inflated, for extracting the specific information from the database. Document classification is a mechanism that analyzes the information inferred within the document and assigns the relative class for an equivalent. The manual classification of the Tamil documents is expensive, requires man-work, chiefly domain specialist and highly time-consuming, for the reason, it undergoes redundant process for a new set of documents. Because of the limitations of manual document classification, the automatic classification came to limelight. An automated document classification lies at the crossroads of Machine Learning (ML), Natural Language Processing (NLP) and Information Retrieval (IR) [1]. Information Retrieval is the task of extracting the documents that is appropriate to the search query from within the large collection [2]. Automatic summarization, web searching, information filtering are some applications of IR [2]. Natural Language Processing is utilized to represent the document semantically for improving the classification task. Certain NLP techniques are used in pre-processing phase, which include stemming, lemmatization, stop words

c 978-1-4799-8792-4/15/$31.00 2015 IEEE

removal, Parts Of Speech tagging etc [3]. The machine learning algorithms train the classification system with previously seen example data to improve the expected future performance [4]. The classification algorithms largely fall under three categories: supervised learning, unsupervised learning and semi supervised learning [5]. The classifier is trained with the labeled data, in supervised learning algorithm. The labels are zero-one (binary) for a two class problem, whereas in multiclass, labels are n-tuple vectors. Nearest Neighbors method, Support Vector Machine, Decision trees are the most used supervised classifiers. Unsupervised text classification algorithm or clustering trains the classifier with the unlabeled data. It cluster the documents without any external information, labels. The clustering is based on some similarity measure calculated between the documents. Documents that are more similar to each other are grouped into one cluster. Cosine similarity and euclidean distance are widely used metrics. Semi-supervised algorithm takes up few unlabeled data in addition to the labeled one to train the classifier. Expectation Maximization algorithm, Graph based Methods are few examples. However, there are many researchers who works in automatic document classification, but for regional language, not much work has been carried out. K. Rajan, et al used Artificial Neural Network (ANN) and Vector Space Model (VSM) for classifying Tamil texts. They claimed that the accuracy of ANN (93.33%) is better than VSM (90.33%). For VSM construction, Term Frequency-Inverse Document Frequency has been used as weighing scheme, later based on the cosine similarity measure, the documents are classified [6]. The proposed approach is similar to this VSM framework of the paper [6], but the difference in our approach is, we have used newspaper corpus as a data set and the system is supervised and the classification is done using a simple linear classifier in GURLS that is built with Random Kitchen Sink algorithm. The motivation of this paper is to develop a computational framework for the document classification task that gives a better performance than the kernel machine classifier with a simpler architecture. To do so, classification model built with RKS in GURLS is compared with kernel based classifiers such as standard SVM, Proximal Support Vector Machine (PSVM) and GURLS built with RBF kernel to show its performance when dealing with unbalanced data. The paper is organized as follows: In the next section, the critique of the system built, the process of selecting features from the raw data and how each document is represented

1571

in vector format is explained. In Section III, the differences of Proximal Support Vector Machine (MPSVM) and the classical Support Vector Machine is explained in the first two consecutive subsections. The subsection III-C gives the introduction to the robust Grand Unified Regularized Least Square (GURLS) Library and explains how the classifier in GURLS serves to be better than MPSVM. The idea of replacing the optimization problem with randomization features (Random Kitchen Sink) is explained in the subsection III-D. The statistics of the data set is given in Section IV. Section V exposes the performance of various classifier used and finally, the paper concludes with Section VI. II.

assigning values to these selected features. Accordingly, the documents are encoded into numerical vectors. The resulting sparse matrix, represents the documents in rows and features or words in columns. It is represented as ⎡

t11 ⎢ M atrix = ⎣ ... tm1

1) 2) 3) 4) 5) 6) 7)

All documents belonging to the same class are concatenated to a single large document. Noisy tokens such as newspaper name, date & time tags, and special characters along with some stop words are removed The single large document is parsed into tokens with its frequency. The words (tokens) are sorted based on frequency count. The words with frequency above the threshold are added to the Vocabulary. Here, threshold is fixed as 24, which is chosen randomly. The same procedure (Step 1 to 5) is repeated for all other categories of documents. The build vocabulary becomes the feature vector for constructing the matrix.

B. Vector Space Representation Every document is represented as a vector in terms of vocabulary, which is built in feature selection phase. The three different statistical weighing schemes are applied for

1572

tmn

1)

Bag of Words (BoW): It is the binary matrix representation; for a specific document, if ith entry is present in the vocabulary, then the value is set to one else it is set to zero.  1, if ti in document, (1) ti = 0, otherwise.

2)

Term Frequency (TF): It denotes the recurrence of the term in the document that is normalized by the length of the document [8]. For the small sized documents, the words are unlikely to repeat, henceforth, the term frequency is preferably equivalent to the BoW [8]. In our case, the total number of documents are more than hundreds, so the resulting matrix is not equivalent to BoW. ti = tf (i, d) # of ith entry in document, d = # of documents, the ith term appears

A. Feature Selection The words in the document are termed as features here. These features build vocabulary for representing the document in vector format. The features are selected based on term frequency. The only frequent words are selected as features, so that probability of feature occurrence in the test cases can be improved [8]. Selecting the most relevant feature, also reduces the search space, dimensional complexity. The lexicon generation or feature selection algorithm is given below:

⎤ t1n .. ⎥ . ⎦

We have used the following three weighing functions:

D OCUMENT C LASSIFICATION

Document can be defined as a sequence of words in Natural Language with varied dimension. Representing the document in a compact form, numerical data, with a finite dimension is the main issue in document classification. Vector Space Model is a possible solution for this. In vector space representation, each document is represented by an array of words, which is selected from all training documents. These array of words can easily reach orders of tens of thousands because of the unique words and phrases [7], when the feature selection is not handled in prior. After the constructing the documents in terms of vector, Machine Learning algorithm is applied to train the classifier. The following subsection details feature selection and vector space representation. The figure 1 illustrates the generic framework of the document classification task.

··· .. . ···

3)

(2)

Term Frequency-Inverse Document Frequency (TFIDF): It is the correlation measure between the frequency of the term in the corresponding document and the number of documents, the term occurs. If the weight of a term in the document is less, then it is more likely to appear in the other documents, as well with considerable frequency [9]. A smoothing technique is used by adding 1 to the denominator, to avoid divide-by-zero error. ti = tf (i, d) ∗ idf (i, d)

(3)

The inverse of document frequency is represented as , T otal # of documents idf (i, d) = log # of documents,d | ith term ∈ d+1 III.

M ACHINE L EARNING A LGORITHM

The various machine learning algorithms that have been applied for this application are discussed in this section. The input to the classifier is represented as x, y, where x represents the numerical document vector and y represents the label for each document. A short clear description of the standard Support Vector Machine (SVM) is explained in the subsection, III-A, following which, subsection III-B explains how Proximal Support Vector Machine (PSVM) differ from SVM. We assume the binary classifier for both the PSVM and SVM for better understanding of algorithm, later, the discussion is extended to the kernel function and multiclass classification in subsection III-B1, III-B2. The subsection III-C gives the

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI)



Pre-Processing (Stop word removal, time and date tag removal etc)

Unstructured Documents

Machine Learning Algorithm 

Feature Selection 

Vector Space Representation (TFIDF, BoW, TF)



Fig. 1. Generic Framework for Document Classification

introduction about GURLS Library. The effective computation of Random Kitchen Sink (RKS) is explained in subsection III-D. A. Support Vector Machine Let us assume the binary labelled data set D = [(xi , yi ) | xi ∈ Rn , i = 1, 2, 3...m, yi ∈ {+1, −1}] is given as an input for training. The SVM takes up the following hypothesis to design the best hyperplane that separates the two disjoint spaces [10]. f (x) = wT .φ (x) + b

(4)

Such that, w is the weight vector, the size of w depends on the dimension of φ (x), φ (x) : Rn → Rd is the kernel function, transforms the data points in lower dimensional input space to higher dimensional space so that the non-linearly separable data points becomes linearly separable. In linearly separable data points, φ (x) is replaced by x and b ∈ R is the bias term. The following formulation defines the SVM classifier, (5) f (xi ) ≥ + 1, if yi = +1 f (xi ) ≤ − 1, if yi = −1 The couple of above equations can also be written as yi wT . φ (xi ) + b ≥ 1 | i = 1, 2, 3, . . . m

(6)

minimal risk of incorrect classification. The objective function is given as,

1 ξi | c > 0, i = 1, 2, 3, ...m (9) min ||w||2 + c 2 i Where ||w||2 controls the margin, the distance between the bounding plane, ξi is the error of the ith data point and c is the regularization parameter that controls the tradeoff between the margin and training error [11]. In this quadratic optimization problem, it is difficult to estimate the w, since its dimension depends on φ (x), which is of infinite dimension. So, it cannot be solved in the primal form. This can be avoided by dual formulation that utilizes Lagrangian multiplier, α, which is a m-tuple vector. In dual formulation, the learning can be done in m-dimension rather than d-dimension [5]. Therefore, dual form of the classifier can be represented as

1 αi − αj αk yj yk k(xj , xk ) (10) max αi ≥0 2 i j,k

subject to, 0 ≤ α ≤ c | i = 1, 2, 3, ...m

α i yi = 0 i

B. Proximal Support Vector Machine (7)

The two parameters w and b decide the best fitting line for classification. A positive slack variable, ξ, is introduced to manage the noisy data.  yi wT . φ (xi ) + b ≥ 1 − ξi | ξi ≥ 0, i = 1, 2, 3, . . . m (8) The parameter w is responsible for the maximization of the distance between two bounding planes, margin. It has to be estimated based upon the objective function under the constraint eqn.(8). So the resulting hyper plane tends to have

Proximal Support Vector Machine is a variant of SVM [12]. Contrast to the classical SVM, y T w − b is not bounding plane any more, it is considered as proximal plane, since the data points of same class are clustered together around it [12]. The Proximal SVM (PSVM) assigns the class to the data points based on the distance from the ‘proximal’ plane. For instance, in PSVM, the points that are closest to the positive class proximal plane are assumed to be in positive class. In PSVM, the proximal plane for each class is fixed based on the centroid of the data points and which are pushed apart as far as possible [12]. All data points that do not lie on the class proximal plane

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI)

1573

are considered as error. The objective function is given as  2

1 w   +c min  ξi2 | c > 0, ξi = 0, i = 1, 2...m (11)   b 2 i subject to

yi wT .φ (xi ) + b = 1 − ξi

 2  w   Here,   b  controls the margin, c is the regularization parameter, squared error, ξi2 , is calculated in order to handle this new error perception. Similar to Least Square SVM (LSSVM), the equality constraint results in solving the system linearly in the primal form, provided the φ (x) is in finite dimension. When φ (x) is in infinite dimension, the duality of the above formulation has to be considered. This algorithm is faster and has comparable effectiveness with standard SVM, when the data dimension is far less than the number of training data (n

Suggest Documents