Author Identification using Sequential Minimal

This paper is submitted to General or Special Session (Please specify the track name here)

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 00 (2018) 000–000 www.elsevier.com/locate/procedia

International Conference on Computational Intelligence and Data Science (ICCIDS 2018)

Author Identification using Sequential Minimal Optimization with rule-based Decision Tree on Indian Literature in Marathi Kale Sunil Digamberraoa*, Dr. Rajesh S. Prasadb a

Smt. Kashibai Navale College of Engineering,Vadgaon, Pune 411041, India. a Pune Institute of Computer Technology, Dhankawadi, Pune 411043, India. b Sinhgad Institute of Technology & Science, Narhe, Pune 411043, India.

Abstract Authorship Identification is the task of identifying who wrote a given piece of text from a given set of candidate authors (suspects). The increasingly large volumes of texts on the Internet enhance the great yet urgent necessity for authorship identification. For this purpose, a large amount of work has already been done for the English language. Comparatively, less research has been carried out for Indian regional languages such as Tamil, Telugu, Bengali and Punjabi whereas no such experiment is available for Marathi. In this study presented a strategy for authorship identification of the documents written in Marathi language. Moreover, we adopted a set of fine-grained lexical and stylistic features for the analysis of the text and used them to develop two different models (statistical similarity model and SMORDT-Sequential minimal optimization with rulebased Decision Tree approach). Then, we validated the feature extraction method to show consistent significance in every model used in this experiment. The performance of the proposed approach has been evaluated based on the values of Recall, Precision, F-measure and Accuracy. Keywords: Author Identification; Decision Tree; Sequential Minimal Optimization; Feature Extraction; Stylometry and Marathi Language

* Corresponding author. E-mail address: [email protected] 1877-0509 © 2018 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the scientific committee of the International Conference on Computational Intelligence and Data Science (ICCIDS 2018).

2

Kale Sunil Digamberrao & Dr. Rajesh S. Prasad / Procedia Computer Science 00 (2018) 000–000

© 2018 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the International Conference on Computational Intelligence and Data Science (ICCIDS 2018).

1. Introduction The present civilization solely depends on the internet at both workstations and in private, by posting updates unanimously routed through unknown e-mail sources on social media. Such communal attitude in the day-to-day life integrates informally as well as illegitimately, thus making it essential to identify the author of the sources. The Author Identification (AI) is a problem concerned with investigating the author’s identity from an unknown source using either defined or undefined list of authors through stylometry. The urge for author identification is essential in various fields of applications including civil law, computer security, criminal law, cybercrime, electronics, text forensics, intelligence, literature, security, trading, etc. Stamatatos [1]. The profiling of authors is identifying gender and age of the author, carried out on a short text by Cheng et al. [2]. Framework and result of author profiling at PAN 2016 provided in Rangel et al. [3]. Author profiling experimented on Roman Urdu and English by Fatima et al. [4], English, Dutch, Spanish and Italian in Kocher and Savoy [5], and Arabic text Alsmearat et al. [6]. Earlier, author identification process has been mainly focused on huge texts whereas the current research effort is on short texts due to the boom of social media which highly rely on text messages Sapkal and Shrawankar [7]. Author identification can be categorized and explained in two ways including closed-class and open-class. In the closed class problem, the author to be identified is defined in a group whereas, in open set problems the author is left undefined in the list Prasad et al. [8]. The majority of existing studies focus on closed-set problems Stolerman et al. [9], which are considered easier to solve when compared to open-set problems Potha and Stamatatos [10]. The author identification can be toppled through two processes namely Native Language Identification (NLI) and Language Variety Identification (LVI) Franco et al. [11]. As a cutting-edge, research is to be advanced through Stylometry analysis using Natural Language Processing (NLP), with a concept that every single author is unconsciously using homological idiolect in their linguistics. Another concept named Bag-Of-Words (BOW) relies on the identification of pattern analysis which includes computing the keyword frequencies used by the authors or group of authors linguistically Alsmearat et al. [6]. The author identification techniques were planned bidirectional as either statistical or computational approach using standard NLP techniques. Statistical model uses three fixed procedures including cosine-similarity, chi-square measure and Euclidean distance (x). In contrast, the machine-learning approach comprises Formal Concept Analysis (FCA), Decision Tree, Naive Bayesian Classifier by Baron [12], Nearest Neighbour, Neural Network by Sboev et al. [13] and Support Vector Machine (SVM) algorithm by Alam and Kumar [14]. Decision-tree approach is a method of classification with a significant hierarchical data, which is innate and informal to integrate. The sequential minimal optimization uses a machine learning cross-platform called LIBSVM whereas the decision tree approach is based on top-down strategy in Barros et al. [15]. In India, the officially recognized languages record 22, whereas almost 200 unofficial languages are traditionally built up in various localities. Among these, several researchers were reported for author identification in various languages ubiquitously (Indian and International). Among all the universal languages, the reports on author identification for English Juola [16] are more when compared to Bengali in Das and Mitra [17], Punjabi in Kaur and Varma [18], Tamil in Pandian et al. [19], Telugu in Prasad et al. [8], Arab Alam and Kumar [14], Chinese in Ma et al. [20], Dutch in Kestemont [21], Greek in Peng et al. [22], Japanese in Tsuboi and matsumoto [23], Russian Sboev et al. [13], Turkish Saygili et al. [24] etc. Although Marathi is an official language of Maharashtra and used by crores of people in India, reports do not exist for author identification on Marathi. In Marathi, constant texting of characters, from left to right track the caption of one character, combines with the caption of the next or previous text. Therefore, multiple characters with modified shapes appear in a word as a solely related component linked by shirorekha Kamble and Hegadi [25]. Similarly, in component characters, consonants, numerals and vowel modifiers


3

there are many similar fashioned characters by Cavanaugh [26]. Altogether these dissimilarities challenge the Marathi author identification technique. In this regard, the present study presents a strategy to construct decision tree structure for the prediction of the author accurately from a group of mysterious authors using SMO with the rule-based decision tree (SMORDT) approach for author identification on Marathi which has not been explored previously. The proceedings are organized in this paper as given here. In Section 2, an indication of author identification and the features of writing style are discussed. In Section 3, the linguistic characteristics of the Marathi are discussed. In Section 4, a detailed review of author identification using various methods through diverse approaches is provided. In Section 5, discussion over the work proposed. In Section 6, discussion on experimental environment followed by Section 7 on experimental evaluation and the results obtained. Lastly, a conclusion of this work and future scope is presented. 2. Background 2.1 Author Identification Author Identification is a classification based on the writing style of documents rather than the topic of the script. The key research topics to be focused on the author identification are a selection of writing style features and author identification techniques in Prasad et al. [8]. According to author identification studies, the taxonomy of proposed writing style features may be categorized into four types including lexical, character, syntactic & semantic. 2.2 Writing Style Features Lexical Features- Lexical features are the most traditional features used for author identification. Examples of these features are term size, special characters, misspellings, sentence extent, text rates, most frequent types, spelling errors and prompt terminology. The main disadvantage is that some oriental languages like some Asian, Chinese, Japanese, Korean, etc. lack boundaries separating each word, thereby ending up inflexible in applying these measures in the absence of required special tools by Altheneyan and Menai [27]. Character Features- As per the character features, a word is a sequence of characters. Letter frequencies, character type and character n-grams are a few of several structures. The character n-gram measures possess significance in capturing both lexical and contextual information. Additionally, without requiring any distinct tools, they can be applied to all languages. Conversely, the facets of illustration are comparatively higher than the lexical methodology due to redundant facts by Altheneyan and Menai [27]. Syntactic Features- Syntactic features, applied by authors instinctively, make them more confident than lexical features. Various syntactic features were used in attribution studies, classified punctuations, classified verbal phrases, and function words, including part-of-speech (POS) frequencies, rewrite rule frequencies, syntactic errors, syntactic structures, POS tags, phrase types, verbal phrases and words per phrase. Syntactic features involve precise language reliant tools to abstract them by Prasad et al. [8]. Semantic Features- Semantic features include a set of structures that exaggerates the significance of a word in Pandian et al. [19]. The Natural Language Processing (NLP) tools used for handling semantic investigations are not trendy and insufficient. Subsequently, very few efforts have been implemented to achieve semantic features including synonyms, semantic dependencies and Systemic Functional Linguistics (SFL), which explain purposeful words in combination with POS features. According to author identification reports, the writing style features including lexical, character, syntactic and semantic features may be extracted from several scripts using suitable tools. 3. Marathi Characteristics Marathi is an official language of Maharashtra. The Devanagari script was used as a base for Marathi which constitutes 15 vowels and 37 consonants. Occasionally, more than one vowel and consonant may associate forming

4


new characters recognized as compound characters. Each character possesses a horizontal line at their top known as shirorekha. The basic character in Marathi may be inscribed uniquely using a mixture of accent marks including circles, curves and lines that may be engraved before, after, above or below the consonants by Kamble and Hegadi [25]. In Marathi, most sentences end with verbs Wanjari et al. [28]. Hence, it is termed as a verb-final language by Kulkarni et al.[29]. 4. Related works Several studies have been conducted previously on author identification. The study by Chakraborty [30] presented a strategy for authorship identification in Bengali using Stylometry by comparing statistical similarity and Machine Learning models and their various types. The experiment revealed that SVM outpaces other models with an accuracy of 83.3% after tenfold cross confirmations using similar features. In the study by Altheneyan and Menai [27], different models of Naive Bayes classifiers for attribution related to the author on texts written in the Arabic language are indicated. The results illustrate that MBNB provided higher accuracy of 97.43%. Comparative results showed that MBNB and MNB are suitable for authorship attribution. Stamatatos [31] Author Identification experimented on English and Arabic text corpora. Reference Baron [12] conducted research on author identification using Naive Bayes classifier on datasets grounded on texts of two male and female authors through the WEKA data mining software. The numerical results were observed, compared, formulated, discussed and concluded. Reference Kaur and Verma [18] applied author identification on poetry written in five Punjabi poets. The inputs provided were measures including character ngram, word ngram to a linear Support Vector Machine (SVM) classifier for 856 and 465 poems in training and testing, respectively. The performance validated on the basis of Recall, Precision, F-score and accuracy were observed noble with 80% attribution. Reference Prasad et al. [8] used Support Vector Machine (SVM) algorithm for author identification as a classifier to create and allocate the anonymous text for one among the notorious authors. Reference Kakade and Gulhane [32] executed and compared Support Vector Machines (SVM) and Particle Swarm Optimization (PSO) techniques for Devanagari character recognition in an android application using MATLAB through PHP. The accuracy of particle swarm optimization technique (90%) is comparatively higher than support vector machines (85%). Reference Pandian et al. [19] classified the dataset containing various features for authorship identification on Tamil Classical Poem (Mukkoodar Pallu) by constructing a decision tree using the C4.5 algorithm. Reference Kale and Prasad [33] presented a systematic survey on author identification methods. Moreover discussed the impact of data scale size on authorship identification accuracy by comparing large data set size like a novel, essays etc. with small dataset size like SMS, tweets etc. Reference Villar et al. [34] experimented short messages while interactive online communication. Reference Sboev et al. [13] performed a comparative study by applying different types of machine learning techniques for Russian based author gender identification. The Complicated Neural Network models (CNN+LSTM) revealed a greater level of accuracy (86±0.03% in the F1 score). The main disadvantage is the difficulty in estimating the features. Hence, future plans were prepared to generate extraction procedures using relative measures from parameters of the internal neural network and to examine the attained models on social media texts in Russian. Reference Alsmearat et al. [6] studied and compared Stylometry Features (SF), and Bag-of-Words (BOW) approaches for author gender identification on the Arabic language. The comparison carried out in diverse settings revealed that the SF approach is cheaper and more accurate than BOW approach under varied settings. The best inhouse dataset accuracy levels obtained by the SF and BOW approaches are 80.4% and 73.9%, respectively. Reference Sayagili et al. [24] reviewed and analysed the Authorship Attribution (AA) on the Turkish language through stylometric features using support vector machine algorithm. In addition, different languages have the usages of Naıve Bayes classiﬁers for authorship attribution/identification, including English by Hoorn et al., Zhao and Zobel, Tan and Tsai, Pillay and solorio, Turkoglu et al. [35–39], and Mexican by Coyotl et al. [40]. However, they are not used to event model to extract the features. For automatically categorising text documents written in Tamil and obtaining valuable information from those text documents to make the knowledge base presented at Kohilavani et al. [41] by applying Navie Bayes algorithm. Reference Kourdi et al. [42] used Navie Bayes algorithm for categorising the web documents written in the Arabic language automatically. As per the outcomes obtained, 68.78% is the average accuracy. Reference


5

Alsaleem [43] applied Support Vector Machine (SVM) and Navie Bayes algorithms for automatically categorising the Arabic text documents. As per the consequence obtained for various data sets written in Arabic language, SVM algorithm provides a result which is better than the NB algorithm in Vispute and Potey [44]. Reference Prasad et al. [45] examined Gujarati character recognition by pattern matching. Moreover, Reference Prasad et al. [46] proposed a Template Matching Algorithm for Gujarati character recognition. However, most of the models presented in the advanced works in the domain still rely on the datasets they were trained and tested on, since they seriously draw on content aspects, mostly on a huge number of frequent words or amalgamations of words extracted from the training sets by Company and Wanner [47]. Previously in Marathi Language, after the detection of by Wanjari et al. [28] sentence boundary using a rulebased approach provided accuracy of 70%. Similarly, the study in Sahani et al. [48] categorized the Marathi documents based on the LINGO clustering. Reference Gaikwad et al. [49] examined the performance of feature extraction using a combination of MFCC, LDA and DWT which provided accuracy of 95%. Reference Patil and Bogiri [50] categorized text on Marathi document using the LINGO based on the VSM using a data set of 200 documents and the results shows that the proposed algorithm is efficient by Vispute and Potey [44]. Kale and Prasad [51] provided a systematic review on author identification on different languages with methods used and achieved result of accuracy including Indian regional languages. Although studies conducted in the past had focused on Marathi document, but extracting features on Marathi document are challenging than any other language text document. Secondly, previous studies did not focus on the lexical, syntactic and semantic features of Marathi literature. From the review of literature, few of the studies has concentrated towards feature extraction method by Hoorn et al., Zhao and Zobel, Tan and Tsai, Pillay and solorio, Turkoglu et al. [35–39] like lexical features [27], syntactic features Prasad et al. [8], semantic features pandian et al. [19] for author identification. Among these, lexical and stylistic features for the analysis of the text provide better results. Some of the studies utilized machine learning technique for classification and extraction of features by Baron, Sboev et al., Alam and Kumar, Barros et al. [1215], Kakade and Gulhane [32], Villar et al. [34], Company and Wanner [47], Kohilavani et al., Kourdi et al., Alsaleem, Vispute and Potey, Prasad et al., Preasad et al. [41-46] in author identification system. But, still rely on the datasets they were trained and tested on, since they draw on content aspects, mostly on a huge number of frequent words or amalgamations of words extracted from the training sets. Although studies conducted in the past had focused on Marathi document, but extracting features on Marathi document are challenging than any other language text document. As over all, none of the studies combine all these into single framework towards extract the author name from the input data (dynamic data). So, in this study propose a feature extraction method, statistical similarity model (cosine similarity, euclidean distance) and machine learning technique (SMORDT-sequential minimal optimization with rule-based decision tree approach) for author identification in Marathi language. By this hybrid framework, able to improve the accuracy by both feature extraction and classification towards extract the author name from the user data. 5. Sequential Minimal Optimization with Rule-based decision tree approach In this work, presented a technique for the identification of author for the text written in the Marathi language, adopted a set of fine-grained lexical and stylistic aspects for the study of the text and used them to develop two different models (statistical similarity model and Sequential minimal optimization with rule-based Decision Tree approach). In section 2.2, the overview and types of feature extraction method have been discussed. The SMO is a constant algorithm is used for resolving the author identification for the Marathi language for the given dataset. The advantages of the SMO approach are that it diminishes the calculation period and gains an improved scaling distinctive than the SVM training process. SMO solves an issue by dividing it into a series of sub-divisions, which are further solved analytically. Despite the linear equality constraint including the Lagrange multipliers αi {\displaystyle \alpha {i}}, the least possible problem comprises two such multipliers. Formerly, for any two multipliers α1{\displaystyle \alpha _{1}} and α2 {\displaystyle \alpha _{2}}, the limits are condensed to: 0≤ α1, α2 ≤ C, y1α1 + y2α2 = k (1)

6


And, this condensed issue may be resolved logically: one needs to recognize the least of a quadratic function which is one-dimensional. k indicates the negative of the quantity over the rest of terms in the equality constraint, which is fixed in every repetition. The algorithm is: a. To identify a Lagrange multiplier α1 that violates the Karush–Kuhn–Tucker (KKT) conditions for the optimization problem, b. To select a second multiplier α2 and optimize the pair (α1{\displaystyle \alpha _{1}}, α2), c. To recur steps 1 and 2 till merging. When all the Lagrange multipliers fulfil the KKT conditions, the problem is solved. The sequential minimal optimization is designed to solve the quadratic programming problem, as shown in Figure (1): n

max   i  

i 1

1 n n  yi y j K ( xi, x j ) i j 2 i 1 j 1 (2)

O    C , i  1,2,..... n, n

y i

i

0

(3) where xi is the input vector of a training data, yi is a class label for xi, n is the number of training examples, αi are Lagrange multipliers, C is a hyperparameter, and K(xi , xj ) is the kernel function. i 1

Rule-1: Author = known Measure = Error rate FN = 0.8 TP = 0.2 Ans.: Not Correct Author Rule-2: Author = known Measure = Error rate FN = 0.8 TP = 0.2 Ans.: Correct Author Test Text Data (Input)

Author is AB

Yes (p=0.8)

Yes (p=0.8)

SMO

Rule-3: Author = known Measure = Error rate TN = 0.2 FP = 0.8 Ans.: Correct Author is not Targeted or marked -1 in training data set

Author is AD

Is Aut hor AB

No (p=0.2)

Author is MM

Author is AK

Yes (p=0.8)

Yes (p=0.8)

Is Aut hor AD

Is Aut hor AK

SMO

No (p=0.2)

Yes (p=0.8)

Is Aut hor MM ?

SMO

Is Aut hor MV ??

SMO No (p=0.2)

Rule-4: Author = Known Measure = Error rate TN = 0.2 FP=0.8 Ans.: Correct Author is one marked +1 in training data set

Author is MV

SMO No (p=0.2)

No (p=0.2)

Author not part of corpus

Rule-5: Author = Unknown Measure = Error rate FN = 0.8 Ans.: Author not in corpus

Figure 1. Implementation flow of Author Identification process.

To identify the author based on SMO with a rule-based decision tree, it might start with one of the two situations: a) Start with a known author and confirm with the corpus, or b) Start with an unknown author and search in the corpus. If we analyse with a known author, there are two possible situations to occur – True Positive or False Negative. Model throws error rate = 80% for False Negative and that of 20% for True Positive. So, in this situation, rule-1 of classification could keep splitting the Training Data by excluding zero marked cases, till getting error rate = 20%, i.e., True Positive situation. Rule-2: If the error rate is 20% in the True Positive situation, then Author is


7

identified as one with same +1 marking in the training dataset. Now when our analysis is started with an unknown author, then again two situations can only happen: a) True Negative (assuming the correct author is negatively marked in training data set), and b) False Positive (assuming the correct author is positively marked in training data set). Our model will throw an error rate of 20% in the True Negative situation and that of 80% in the False Positive situation. Hence, rule-3 could be when the error rate of 20% is witnessed in the True Negative situation, then Author is identified as one marked as negative in the training dataset. Finally, decision rule-4 could be, when the error rate of 80% is witnessed in the False Positive situation, then the author is identified as one marked as positive in training database. Note that Rule 1 to 4 would work only in the situation of Author that is being searched as part of the corpus. In the situation, when the author is not present in the corpus, for the known author can tell without any analysis. For the unknown author not available in our corpus, only one situation will arise that is false negative with an error rate of 80%. For this situation, decision rule-5 can be as if the error rate is 80% in the False Negative situation, and even after fully splitting of training data this error rate is not changing then Author is not present in our corpus. Algorithm - SMO with rule based Decision Tree (SMORDT) Initialize αi = 0, ∀i, b = 0. Initialize p = 0. while (p < max - p) num – alphasc = 0 for i = 1, . . . N, Calculate Xi = f(A(i) ) − B(i)) if ((B(i)Xi < −tol && αi < C) || (B(i)Xi > tol && αi > 0)) Select j = i randomly. Calculate Xj = f(A(j)) − B(j) (Build the layout feature set) Save old α’s: αi (old) = αi, αj (old) = αj. Compute and clip new value for αj if (L == H)( Extract the style feature set) if (|αj − α j(old) | < 10−5) continue to next i. num changed alphas : = num changed alphas + 1. end if the end for if (num changed alphas == 0) p:=p+1 Construct the structure feature set Extract the logic feature set else p:=0 end while Employ the SMORDT to classify the authorship of each program. 6. Experimental Environments 6.1 Dataset Preparation In Author Identification (AI), classes are recognised based on the written documents of the authors. The steps to be followed in author identification are as mentioned. The initial step is to pre-process the given data to tokenize the input word and receive the token output. The second step is to extract the features that distinguish the author writing style from the given data. The third step is to lessen the dimensionality of the feature vector space using feature selection measures. The fourth step is to input these vectors to the classifier to achieve the learning model. Final step is to identify the author of unknown text by giving an input of the assessment document to the learning model.

8


Devanagari script forms the base for several north Indian languages, especially Marathi. The given test document contains 15 Marathi philosophical articles written by 5 different authors. If the format of text documents is raw, documents are unsuitable for the generation of pattern. Hence, they should be converted into suitable input formats which require pre-processing. The main phases of the technique include the text pre-processing, feature selection and extraction, classification, author identification and performance evaluation. The pictorial representation of architecture is illustrated in Figure 2. The number of training and testing data is discussed in Table 1 which includes a total number of Marathi articles and number of words. Training data set contained records of all the authors (i.e. total 140 records) with two classes (codes: 1 – Target Author, 0 – Others). Similarly, testing data sets contained records of all authors (i.e. total 70 records) with two classes (codes: 1- Target Author, 0-Others). Two sets of data files are used for training and testing purpose each (e.g. AB_Train.txt and AB_Test.txt). Data files used have following the general format: CLASS IDENTIFIER (+1,1) FEATURE 1 INDEX: FEATURE 1 VALUE FEATURE 2 INDEX: FEATURE 2 VALUE.

Marathi articles corpus Training Data

Testing Data

Step1: Text data Collection & Preprocessing

Text Pre-processing

Text Pre-Processing

Features Extraction

Features Extraction (Lexical and Styalistic method)

Step 2: Extraction of features from the text document

Classifier (SMO with rule based decision tree approach)

Step3: Text classification

Authorship Identification

Result

Performance Measures (Distance measurement, precision, recall, F-measure, accuracy)

Step 4: Author identification

Step5: Evaluation

Figure 2. Author identification of Marathi articles. Table 1. Statistics of Marathi articles corpus.

Data set

Number of Marathi articles

Number of words

Training dataset

10

58,495

Testing dataset

5

24,462

Total

15

82,957

The detailed description of all the input data set is discussed in Table 2 which includes the name of the author, article, testing, training data, number of articles and the total number of words.


9

6.2 Pre-processing To assure the data that are mostly downloaded from net sources are suitable, extensive pre-processing is essential. Initially, to achieve this, deleted and duplicated documents from the raw data tokenized the entire documents into sets of tokens to apply a set similarity function on each pair. Lastly, it is followed by noise removal to all residual documents, which included mark-up tags, non-words and removing digits as well as whitespace normalizing that involves converting line breaks into spaces. After noise removal, every document represents a long string, where every token is detached by precisely one space. Stop word removal, tokenization and stemming are some of the data pre-processing aspects. Table 2. Detailed description of dataset.

Sr. No. 1 2

Name of author

3 4 5 6 7

Manashi Vaidh (MV) Ananadini (AD) Ananadini (AD) Ananadini (AD) Anirddha Banhatti (AB)

8 9 10 11 12

Anirddha Banhatti (AB) Arun Kulkarni (AK) Arun Kulkarni (AK) Arun Kulkarni (AK) Machchhindranath Mali (MM) Machchhindranath Mali (MM) Machchhindranath Mali (MM) Machchhindranath Mali (MM)

13 14 15 Total

Manashi Vaidh (MV) Manashi Vaidh (MV)

Name of article Sri Dasbodh Goraksh Shatakam Suta Samhita Chakavaa Vishamrut Tahani Atra Tatra Sarvatra Saakal Ekate Jeev Inamdarin Maitrabandh Ghabaad

Number of articles for training 1 -

Number of articles for testing 1

Total number of articles 1 1

Total number of words 4856 3853

1 1 1 -

1 1

1 1 1 1 1

3827 8027 8358 6783 5564

1 1 1 1

1 -

1 1 1 1 1

11674 6761 6568 6530 2915

Gaavgappa

1

-

1

3413

Kabuli Vasuli

1

-

1

2096

Bali

-

1

1

1732

10

5

15

82,957

Tokenization- A token is an illustration of an arrangement of characters in some specific document that is semantically grouped together as a unit for processing. In tokenization, the text is divided into small units called tokens using symbols and punctuations like bullets, colons, exclamation marks, hyphens, parenthesis, numbers and semicolons thereby giving a useful semantic meaning. Stop word removal- Stop words also known as function words or non-content words are printable punctuations that are excluded through pre-processing since they form valuable features. In every text document, there is a list of usually repeated tokens. Some of them are auxiliary verbs, prepositions conjunctions, grammatical articles, and pronouns that are removed even though they possess nil effect on the classification process. Stemming- If prefixes and suffixes are removed from tokens, it is called stemming applied to lessen the quantity of variations of tokens. Stemming is divided into two methods including root-based and stem-based classes. Feature Extraction- The feature extraction is the most significant measure of any author identification method. Altering the input information into the set of features is termed as feature extraction. In the course of the analysis of author identification, Stylometry and other characteristic features of a writer are rather more essential than the text. In token level features, it constitutes trivial characteristics to each of the considered clusters, count of hapax

10


legomena, i.e., punctuations and appearance of word single time in a document. In the case of pharse level features, there are chunks from the parsed corpus and count of selected PoSs. In Marathi, sentences are separated by symbols like ‘dari’ (‘I), question mark (‘?’) or exclamation notation (‘). When it comes to a second step, different types of features are extracted and categorized into four different types such as lexical, character, syntactic and semantic features. The different types of lexical features taken for this analysis are sentences, many words, word length, sentence length and syllables per word as lexical features. From the experimental results, the extracted features via proposed SMO with rule-based decision tree are discussed in the Table 3. Table 3. Feature extraction.

Sr. No

Features

Author - AB

Author - AK

Author - AD

Author - MM

Author - MV

1

Article- No. of Lines

3319

10622

7080

4829

11474

2

Article- No. of Words

23466

25078

24949

14198

15759

3

Article- No. of Characters

403527

454238

385131

267418

348934

4 5

Word Size / Format Size No. of Consonants per Word

294 3908

441 4354

441 3123

588 2809

441 3078

6

No. of Vowels per Word

104

89

63

65

85

7

No. of Matra s per Word

2627

2870

2178

1802

1457

8

No. of Diacritics per Word

0

0

0

0

0

9 10

No. Halanta s per Word Article-level Average Word Size

1207 4122

1373 896

1192 686

1223 1316

2077 1008

11

Article-Level Average Word Frequency Word Size level Frequency

2717

3290

2982

2198

2184

5415

5506

4890

3455

3194

No. of different Words of particular Size Different words level Frequency

3763

4260

3449

2889

2517

11550

11998

12194

6720

5040

Number of Words above Word Size Number of Words below Word Size Article level Average Word Format Size (CVMDH Format) Article Level Average Word Format Frequency (CVMDH Format) Word CVMDH Format Size level Frequency No. of different CVMDH Formats of particular Size Different formats level Frequency Number of CVMDH formats above this Format Size Number of CVNDH formats below this Format Size

10811

13144

9454

9461

13865

41958

55466

40591

34396

27410

378

588

489

882

1149

1988

1680

1778

1399

1248

2917

3149

2512

2057

1954

939

1174

943

1115

1313

6314 2661

6299 4020

6244 2607

4536 3130

3030 5723

10946

13342

11654

13577

13005

12 13 14 15 16 17 18 19 20 21 22 23

Feature extraction tool / method used for experimentation Unix- cat, grep and wc. Excelpivot Unix- cat, grep and wc. Excelpivot Unix- cat, grep and wc . Excel-pivot Tokenizer, Unix-grep, wc Tokenizer, Character dictionary, Java script Tokenizer, Character dictionary, Java script Tokenizer, Character dictionary, Javascript Tokenizer, Character dictionary, Java script Tokenizer, Java script Tokenizer, Excel – grouping, pivot Tokenizer, Excel – grouping, pivot Tokenizer, Excel – grouping, pivot Tokenizer, Excel – grouping, pivot Tokenizer, Excel – grouping, pivot Tokenizer, Excel – grouping, pivot Tokenizer, Excel – grouping, pivot Tokenizer, Character dictionary Tokenizer, Character dictionary Tokenizer, Character dictionary Tokenizer, Character dictionary, javascript Tokenizer, java script Tokenizer, javascript, Excel – grouping, pivot Tokenizer, javascript, Excel – grouping, pivot


11

Feature Selection- Removal of irrelevant measures from a set of measures is termed as feature selection. Feature selection is prepared to overcome the difficulties including computational cost and imprecise accuracy of classifier because of irrelevant data. In feature selection process, the rule-based decision tree approach is applied for choosing the best aspects from the available set of features. The decision tree generated from root node comprises segregation of nodes into subsets which involve homogeneous matters. The algorithm used to construct decision tree is SMO algorithm. The parameters used to construct decision tree are (1) Entropy and (2) Information Gain. Entropy is customized to quantify the degree of homogeneity among the nodes which exist in a subset. Information gain inclines as entropy declines. Entropy and information gain are inversely proportionate to one another. Decision tree creation with all the features is cropped down to a number that delivers maximum information gain and accuracy on classification, respectively. Training and attributing (Training and testing procedure)- The training and testing procedure is as follows: Decision trees are used to characterize the classification basis on the basis of the tree structure. The root node comprises of the feature test which segregates the samples of data that have a different value for the tested feature. Each test results in the possible category subsets. Each terminal node will contain the class label wherein in the present scenario; the name of the author is the terminal node. For the breaking of the problem into several problems in the decision tree, SMO is used which segregates the complex problem of classification into several quadratic programming problems. Both training and testing data related to true positive, true negative, false positive and false negative are discussed in Table 4A and Table 4B. Table 4A. Test data classification for TP, TN, FP and FN.

Target author class

Other author class

TRUE POSITIVE

+1

-1

TRUE NEGATIVE

-1

+1

FALSE POSITIVE

+1

-1

FALSE NEGATIVE

-1

+1

Table 4B. Training data classification for TP, TN, FP and FN.

Target author class

Other author class

TRUE POSITIVE

+1

-1

TRUE NEGATIVE

+1

-1

FALSE POSITIVE

-1

+1

FALSE NEGATIVE

-1

+1

The subsequent comparison ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually representing independence, and in-between values representing intermediate similarity or dissimilarity. 6.3 Performance Evaluation. Distance measurement (Cosine Similarity, Euclidean Distance). In this study, two statistical similarity based metrics are used to get their individual effect on classifying documents, and their combined effort has also been compared with the others. These two measures are highlighted briefly in this section. Cosine-Similarity (COS).Cosine-similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them. It is often applied to relate documents in text mining. Given two vectors R and T with a same number of attributes, the cosine similarity is represented using a dot product and magnitude as:

12

Kale Sunil Digamberrao & Dr. Rajesh S. Prasad / Procedia Computer Science 00 (2018) 000–000 n

similarity  cos() 

rt

R.T  | R |.| T |

i 1

n

r

i

2

i i n

*

t

2 i

(4) Euclidean Distance (ED). The Euclidean distance, the length of the line segment, is between two points, p and q. In Cartesian coordinates, if p = (p1, p 2... pn) and q = (q) are two points in Euclidean n-space, then the distance from p to q is given by: i 1

d(p, q) 

n

 (p i 1

i

i 1

 qi )2

(5) Where n indicates the number of features or dimension of a point. p indicates the reference point (i.e., mean vector) of each cluster and q is the testing vector. For each test vector, three reference points in three distances have been measured, and smallest distance defines the probable cluster. By the experimental results, the indexed feature value of Cosine similarity, Distance related to Cosine similarity and Euclidean distance for authors are discussed in Table 5 and Figure 3. Subsequently, the average of the predicted value of cosine similarity, Distance related to Cosine similarity and Euclidean distance as 0.0615, 0.9384 and 0.3618 respectively. Table 5. Computation based on indexed feature values.

Cosine similarity

Distance-related to cosine similarity

Euclidean distance

AB

0.0896

0.9103

0.4600

AD

0.0519

0.9480

0.3957

AK

0.0257

0.9742

0.5727

MM

0.0279

0.9720

0.1364

MV

0.1126

0.8873

0.2444

Distance measurement technique

Author name

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.97

0.94

0.91

0.97 0.89

Cosine similarity

0.57 0.46

0.40

Distance-related to Cosine similarity 0.24

0.09

AB

0.14 0.05

AD

0.03 AK

0.11

Euclidean distance

0.03 MM

MV

Author Name

Figure 3. Computation based on Indexed Feature Values.

Precision (P). Precision which measures the ratio of pertinent output instances to the entire instances gained from the output is given as,


13

here, TP is the number of true positives when FN is the number of false negatives. The precision, the ability of the classifier not to label as positive a sample, is negative. Precision can be calculated as.

Recall (R). The methodical way of recall, the ratio of pertinent output instances to the entire instances, is defined as,

The recall is intuitively the ability of the classifier to find all the positive samples. This parameter is included to observe nature of the output. In this paper, Recall can be defined as

F-Measure (F). The entire performance of the proposed technique is provided from F-Measure, also called F-score. The high value of exactness leads to more correctly recognised authors than improper ones, while high recall means AA model has returned most of the relevant results. F-measure is used to represent the correctness of the AA model. The F-measure equation is,

Accuracy. In the proportion of the test results, the exactness is truly positive and true negatives among the entire cases.

7. Experimental Evaluations 7.1 Marathi Corpus A store is having a collection of Marathi philosophical articles related to a particular author called the Marathi articles corpus. The Marathi articles corpus of the present research, the author has considered 15 texts of 5 Marathi authors. The philosophical articles were collected from Marathi corpus. The texts of these authors have been converted into PDF files for ease of use and consistency. For each author, 2 articles (PDF files) were taken as the training data, and hence, 10 PDF files were considered as the training data. However, one article of each author is considered as test data. 7.2 Sensitivity to Stemming and Normalization Stemming is defined as the normalization technique in morphological basis which identifies the word roots through the elimination of affixes (generally suffixes). Stemming on the basis of morphological normalization is pure feature conflation. It is deemed that word segmentation, and morphological normalization are dependent on the language. It is obvious that the techniques of morphological normalization such as stemming or lemmatization require dictionaries that are language-specific or stripping rules. In the present paper, stemming is basically done based on the size of word wherein only words that are within 4–17 are taken for the analysis.

14


7.3 Performance Comparison The performance of the proposed method has been estimated based on the measured values of distance measurement, precision, recall, F-measure and accuracy. The outcomes of this experiment indicate that the method proposed here prominently improves the overall performance of authorship identification. Table 7 depict the comparison of the performance of the proposed technique and the different algorithms used. 7.4 Comparison with other methods The confusion matrix obtained by performing classification algorithm is as discussed in Table 6A and Table 6B. It can be seen that 140 instances of 80% data are correctly classified while 20% instances are incorrectly classified. The parameters confidence factor and a minimum number of objects are varied to improve the classifier accuracy. By varying the confidence factor, the classifier accuracy obtained is 80%. Using SVM classification, 140 instances of 50% data are correctly classified while 20% instances are incorrectly classified. From the experimental results revealed, the proposed provides the better classification than the existing SVM approach. Table 6A. Confusion matrix SVM.

N=140 Actual: No Actual: Yes

Predicted: No 14 14 28

Predicted: Yes 56 56 112

Total 70 70

Table 6B. Confusion matrix SMO with decision tree.

N=140 Actual: No Actual: Yes

Predicted: No 20% 20% 40%

Predicted: Yes 80% 80% 160%

100% 100%

The experimental results show that the predicted value of recall, precision, f-measure, and accuracy are 50%, 80%, 61.54% and 80% respectively (Refer Table 7). From the literature, the predicted accuracy using SVM for data set language Bengali and Panjabi is 83.33% and 79% respectively and on Tamil using decision tree approach is 88.23. The proposed classification approach on data set in Marathi Language provides 80%. Accuracy. As revealed in Table 7, there was no possibility to directly compare the results as it is related to the type Language of datasets taken by former researchers. The present paper provides a novel approach and a strategy for author recognition in Marathi text which has not been previously attempted. Table 7. Experimental results of Precision, Recall, F-Measure with Accuracy.

Reference

Method

Recall

Precision

F-measure

Accuracy

[30]

Data set Language Bengali

SVM

Not Reported

Not Reported

Not Reported

83.33%

[18]

Panjabi

SVM

72%

68%

39%

79%

[19]

Tamil

Not Reported

Not Reported

Not Reported

88.23%

Proposed method

Marathi

Decision Tree C 4.5 SMO with rule based decision tree approach

50%

80%

61.54%

80%

8. Conclusion and Future work A novel author recognition technique (combination of feature extraction, statistical similarity model and machine


15

learning technique), presented in this paper, is developed in the text written in Marathi for various articles written by five authors. For the design of technique, the Sequential Minimal Optimization combined with Rule-based Decision Tree (SMORDT) is used which has been tested for Marathi philosophical articles considered in this research. The performance of the proposed method is evaluated based on the measure of recall, precision, f-measure and accuracy as 50%, 80%, 61.54% and 80% respectively. Although result of accuracy decreases when applying small training size. It is deemed that the rules for author recognition are considered to be the same in English and Marathi, and there are some steps to be performed to achieve better recognition for texts in the Marathi Language. Suggested methodology also applicable for author identification on other languages like english, Bengali, Punjabi, Arab and so on. However, future works should be concerned about the implementation of other classifiers which could improve the recognition score. Furthermore, result of accuracy can be enhanced by incresing feature set like character Ngram and word N-gram, the researcher further recommends the use of topic modelling to enhance the performance of the proposed technique. References [1] Stamatatos E. (2011) "Plagiarism detection using stopword n-grams."Journal of the Association for Information Science and Technology 62: 2512–2527. [2] Cheng N, Chandramouli R and Subbalakshmi KP. (2011) "Author gender identification from text."Digit Investig. 8: 78–88. [3] Rangel F, Rosso P, Verhoeven B, Walter Daelemans, Martin Potthast and Benno Stein (2016) "Overview of the 4th author profiling task at PAN 2016: Cross-genre evaluations." CEUR Workshop Proc. 1609: 750–784. [4] Fatima M, Hasan K, Anwar S and Nawab R M A. (2017) "Multilingual author profiling on Facebook." Inf Process Manag. 53: 886–904. [5] Kocher M and Savoy J. (2017) "Distance measures in author profiling." Inf Process Manag. 53: 1103–1119. [6] Alsmearat K, Al-Ayyoub M, Al-Shalabi R and Kanaan G. (2017) "Author gender identification from Arabic text." J Inf Secur Appl. 35: 85– 95. [7] Sapkal K and Shrawankar U. (2016) "Transliteration of Secured SMS to Indian Regional Language." Procedia Comput Sci 78: 748–755. [8] Prasad S N, Narsimha V B, Reddy P V and Babu A V. (2015) "Influence of Lexical, Syntactic and Structural Features and their Combination on Authorship Attribution for Telugu Text." Procedia Comput Sci. 48: 58–64. [9] Stolerman A, Overdorf R, Afroz S and Greenstadt R. (2014) "Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution." Tenth Annu IFIP WG 119 Int Conf Digit Forensics 433: 185–205. [10] Potha N and Stamatatos E. (2014) "A Profile-Based Method for Authorship Verification." SETN 313–326. [11] Franco-Salvador M, Kondrak G and Rosso P. (2017) "Bridging the Native Language and Language Variety Identification Tasks." Procedia Comput Sci. 112: 1554–1561. [12] Baron G. (2014) "Influence of Data Discretization on Efficiency of Bayesian Classifier for Authorship Attribution." Procedia Comput Sci. 35: 1112–1121. [13] Sboev A, Litvinova T, Gudovskikh D, Rybka R and Moloshnikov I. (2016) "Machine Learning Models of Text Categorization by Author Gender Using Topic-independent Features." Procedia Comput Sci. 101: 135–142. [14] Alam H and Kumar A. (2013) "Multi-lingual author identification and linguistic feature extraction-A machine learning approach." IEEE International Conference on Technologies for Homeland Security (HST) 386–389. [15] Barros R C, Basgalupp M P, de Carvalho A C P L F and Quiles M G. (2012) "Clus-DTI: improving decision-tree classification with a clustering-based decision-tree induction algorithm." J Brazilian Comput Soc. 18: 351–362. [16] Juola P. (2012) "Large-Scale Experiments in Authorship Attribution." English Stud 93: 275–283. [17] Das S and Mitra P. (2011) "Author Identification in Bengali Literary Works." International Conference on Pattern Recognition and Machine Intelligence 220–226. [18] Kaur N and Verma A. (2015) "Authorship Attribution of Punjabi Poetry using SVM Classifier." Int J Adv Res Comput Sci Softw Eng. 5: 1055–1061. [19] Pandian A, Ramalingam V V and Preet R V. (2016) "Authorship Identification for Tamil Classical Poem Mukkoodar Pallu using C4.5 Algorithm." Indian J Sci Techno. 9: 1–5. [20] Ma J, Xue B and Zhang M. (2016) "A Profile-Based Authorship Attribution Approach to Forensic Identification in Chinese Online Messages." Pacific-Asia Workshop on Intelligence and Security Informatics 33–52. [21] Kestemont M. (2016) "Stylometric Authorship Attribution for the Middle Dutch Mystical Tradition from Groenendaal." Dutch Crossing 1– 35. [22] Peng F, Schuurmans D, Wang S and Keselj V. (2003) "Language independent authorship attribution using character level language models." Proc tenth Conf Eur chapter Assoc Comput Linguist. 1: 267–274. [23] Tsuboi Y and Matsumoto Y. (2002) "Authorship Identification for Heterogeneous Documents." IPSJ SIG Notes 17-24. [24] Saygili N S, Amghar T, Levrat B and Acarman T. (2017) "Taking advantage of Turkish characteristic features to achieve authorship attribution problems for Turkish." IEEE 25th Signal Processing and Communications Applications Conference SIU 1–4. [25] Kamble P M and Hegadi R S. (2015) "Handwritten Marathi Character Recognition Using R-HOG Feature." Procedia Comput Sci. 45: 266– 274 . [26] Cavanaugh C. (2009) "Effectiveness of cyber charter schools: A review of research on learnings." TechTrends 53: 28–31. [27] Altheneyan A S and Menai M E. (2014) "Naïve Bayes classifiers for authorship attribution of Arabic texts." J King Saud Univ - Comput Inf Sci. 26: 473–484.

16


[28] Wanjari N, Dhopavkar G M and Zungre N B. (2016) "Sentence Boundary Detection For Marathi Language." Procedia Comput Sci. 78: 550– 555. [29] Kulkarni S B, Deshmukh P D and Kale K V. (2013) "Syntactic and Structural Divergence in English-to-Marathi Machine Translation." IEEE International Symposium on Computational and Business Intelligence 191–194. [30] Chakraborty T. (2012) "Authorship Identification in Bengali Literature: a Comparative Analysis." Proc COLING Demonstration Papers 41-50. [31] Stamatatos E. (2008) "Author identification: Using text sampling to handle the class imbalance problem." Information Process Management 44: 790–799. [32] Kakde P M and Gulhane S M. (2016) "A Comparative Analysis of Particle Swarm Optimization and Support Vector Machines for Devnagri Character Recognition: An Android Application." Procedia Comput Sci. 79: 337–343. [33] Kale S D and Prasad R S. (2017) "A Systematic Review on Author Identification Methods." Int J Rough Sets Data Anal. 4: 81–91. [34] Villar-Rodriguez E, Ser J Del, Bilbao M N and Salcedo-Sanz S. (2016) "A feature selection method for author identification in interactive communications based on supervised learning and language typicality." Eng Appl Artif Intell. 56: 175–184. [35] Hoorn J, Frank S, Kowalczyk W and van der Ham F. (1999) "Neural network identification of poets using letter sequences." Lit Linguist Comput. 14: 311–338. [36] Zhao Y and Zobel J. (2005) "Effective and Scalable Authorship Attribution Using Function Words." AIRS. 174–189. [37] Tan RHR and Tsai F S. (2010) "Authorship Identification for Online Text." IEEE International Conference on Cyberworlds 155–162. [38] Pillay S R and Solorio T. (2010) "Authorship attribution of web forum posts." IEEE eCrime Researchers Summit 1–7. [39] Türkoglu F, Diri B and Amasyali M F. (2007) "Author Attribution of Turkish Texts by Feature Mining." Internationalconference on intelligent computing 1086-1093. [40] Coyotl-Morales R M, Villaseñor-Pineda L, Montes-y-Gómez M and Rosso P. (2006) "Authorship Attribution Using Word Sequences." Lecture Notes in Computer Science 844–853. [41] Kohilavani S, Mala T and Geetha T V. (2009) "Automatic Tamil Content Generation." IEEE International Conference on Intelligent Agent & Multi-Agent Systems 1–6. [42] Kourdi M El, Bensaid A and Rachidi T. (2004) "Automatic Arabic document categorization based on the Naïve Bayes algorithm." Association for Computational Linguistic 51–58. [43] Alsaleem S. (2011) "Automated Arabic Text Categorization Using SVM and NB." Int Arab J e-Technology 2: 124–128. [44] Vispute S R and Potey M A. (2013) "Automatic text categorization of marathi documents using clustering technique." IEEE International Conference on Advanced Computing Technologies 1–5. [45] Prasad J R, Kulkarni U V and Prasad R S. (2009) "Offline Handwritten Character Recognition of Gujrati script using Pattern Matching." IEEE International Conference on Anti-counterfeiting, Security, and Identification in Communication 611–615. [46] Prasad J R, Kulkarni U V and Prasad R S. (2009) "Template Matching Algorithm for Gujrati Character Recognition." IEEE International Conference on Emerging Trends in Engineering & Technology 263–268. [47] Company J S and Wanner L. (2014) "How to use less features and reach better performance in author gender identification." Proceedings of the 9th International Conference on Language Resources and Evaluation LREC 1315–1319. [48] Sahani A, Sarang K, Umredkar S and Patil M. (2016) "Automatic Text Categorization of Marathi Language Documents." Int J Comput Sci Inf Technol. 7: 2297–2301. [49] Gaikwad S, Gawali B and Mehrotra S. (2012) "Novel Approach Based Feature Extraction for Marathi Continuous Speech Recognition." International Conference on Advances in Computing, Communications and Informatics 795-804. [50] Patil J J and Bogiri N. (2015) "Automatic text categorization: Marathi documents." IEEE International Conference on Energy Systems and Applications 689–694. [51] Kale Sunil Diagamberrao and Dr. Rajesh S. Prasad. (in press) "Author Identification on Literature in Different Languages: A Systematic Survey." IEEE 2018 International Conference on Advances in Communication and Computing Technology 174- 181.

Author Identification using Sequential Minimal

Author Identification using Sequential Minimal

Suggest Documents

Author Identification Using Compression Models

Author Identification using Compression and

Author Identification Using Imbalanced and ... - Semantic Scholar

Minimal sufficient balance randomization for sequential randomized ...

MINIMAL TIME SEQUENTIAL BATCH REACTORS ...

Protein identification using sequential ion ion reactions and tandem ...

Fast sequential forensic camera identification

Author Identification using Writer-Dependent and Writer-Independent ...

Author Identification in Chatlogs using Formal Concept Analysis

Nonlinear identification of a minimal NeuroMuscular ... - CiteSeerX

Sequential Minimal Optimization: A Fast Algorithm for ... - CiteSeerX

Ship identification in sequential ISAR imagery - CiteSeerX

Sequential monitoring of chimerism and detection of minimal ... - Nature

Sequential Minimal Optimization: A Fast Algorithm for Training Support ...

Parallel sequential minimal optimization for the training of support

About Minimal Time Impulsive Control of Sequential Batch Reactors ...

An Elaboration of Sequential Minimal Optimization for ... - IEEE Xplore

Temporary AV sequential pacing using

Sequential Collaborative Software Engineering Using

Sequential Folding using Light-activated

Sequential Learning Using - ULB Bonn

Author Identification on the Large Scale - CiteSeerX

Author Identification on the Large Scale - CiteSeerX

Author Correction: Identification of Genetically ...