Bayesian network theory

2 downloads 0 Views 133KB Size Report
interference, such as manual settings or the language expert, hence it is an ... bidang perlombongan teks yang banyak menerima perhatian disebabkan ... ini memberi perhatian kepada penganalisaan teks seperti novel dan drama yang.
STYLOMETRIC AUTHORSHIP BALANCED ATTRIBUTION PREDICTION METHOD

By TAREEF KAMIL MUSTAFA

Thesis Submitted to the School of Graduate Studies, University Putra Malaysia, in Fulfillment of the Requirement for the Degree of Doctor of Philosophy May 2011

Dedicated to Professor Kamil Alshaibi God rest his soul …

ii

Abstract of thesis presented to the Senate of University Putra Malaysia in fulfillment of the requirement for the degree of Doctor of Philosophy

STYLOMETRIC AUTHORSHIP BALANCED ATTRIBUTION PREDICTION METHOD

By TAREEF KAMIL MUSTAFA MAY 2011

Chairman:

Norwati Mustapha, PhD

Faculty:

Computer Science and Information Technology

Stylometric authorship attribution is one of the important approaches in the text mining field that has received growing attention due to its delicateness. This approach concerns about analyzing texts such as novels and plays written by famous authors, trying to measure their writing style by choosing some attributes that shows uniquely belong to the author, assuming that each author have his special artistic way of writing that no other author has.

There are two major problems that tie up the progress in this field, which are the predictions accuracy results and the human expert judgment. The techniques that manage such predictions are either using the statistical attributes such as frequent words or the use of more sophisticated semantic techniques such as lexicons. Nonetheless, the results are still considerably less accurate. iii

In this research, we propose a new Stylometric method known as the Stylometric authorship balanced attribution (SABA) that is able to overcome these problems with higher accuracy prediction and independent from human judgments, which means that the method does not rely on the domain experts. The new method is implemented by merging three methods, which are called the computational approach, the Winnow algorithm and the Burrows-delta method.

A computational approach puts all the useful attributes in hand in one equation, hence yielding one value for automated decision-making. The Winnow algorithm inspires this work on the potential use of its weighting technique, while the Burrows-delta method is effective in distinguishing the training attribute parameters against the testing attribute parameters differences such as the Pearson correlation.

The proposed method (SABA) also uses a set of more effective attributes as compared to the frequent words method. This results in higher Stylometric prediction thus far, having more alibis for author artistic writing style for authorship recognition and prediction. The effective attributes are represented by the word pair and the trio, while both are multiple words attributes.

The method is also designed to tackle any language without the need of any human interference, such as manual settings or the language expert, hence it is an unguided model.

iv

Meanwhile, sources of text under investigation could be Arabic, English, Malay or any other language, hence showing that the method is language independent.

The experiment is performed using dataset gathered from the Gutenberg website, which collects a large scale of literature books. Nonetheless, this research limits its scope into only a collection of 50 novels from 5 famous authors who have published their works in English language during the 19th century. In conducting the experiment, 10 books are assigned to each author, 9 of the books are used in training while the 10th book is used in testing.

The proposed SABA method is compared against three other methods using the computational approach, the Winnow algorithm method, and the Burrows-delta method. The results showed that the proposed method produces superior prediction accuracy and even provides a completely correct result during the final stage of the experiment.

v

Abstrak tesis yang dikemukakan kepada Senat Universiti Putra Malaysia sebagai memenuhi keperluan untuk ijazah Doktor Falsafah

STAILOMETRIK PEREKAYASAAN KAEDAH TEKNIK RAMALAN PENULISAN

Oleh TAREEF KAMIL MUSTAFA MEl 2011

Pengerusi:

Norwati Mustapha, PhD

Fakulti:

Sains Komputer dan Teknologi Maklumat

Atribusi stailometrik penulisan adalah satu daripada pendekatan yang penting di dalam bidang perlombongan teks yang banyak menerima perhatian disebabkan ketelitiannya. Pendekatan ini memberi perhatian kepada penganalisaan teks seperti novel dan drama yang ditulis of penulis-penulis terkemuka, cuba untuk mengukur gaya penulisan mereka dengan memilih atribut-atribut yang secara uniknya milik seseorang penulis, dengan andaian bahawa setiap penulis mempunyai cara artistik teristimewa dalam penulisan yang tidak dimiliki oleh penulis-penulis yang lain.

Terdapat dua masalah yang besar yang mengekang perkembangan bidang ini, iaitu keputusan ketepatan ramalan dan pertimbangan daripada pakar perseorangan. Teknikteknik yang mengurus ramalan adalah sama ada menggunakan atribut statistikal seperti

vi

perkataan kerap atau penggunaan teknik-teknik semantik yang sofistikated seperti leksikon. Walau bagaimanapun, semua keputusan masih dianggap kurang tepat.

Di dalam penyelidikan ini, kami mencadangkan sebuah algoritma stailometrik yang baharu, yang berupaya mengatasi masalah-masalah tersebut dengan menghasilkan ketepatan ramalan yang lebih tinggi dan tidak bergantung kepada pertimbangan manusia, yang mana memberi maksud bahawa algoritma tersebut tidak perlu bergantung kepada pakar bidang. Algoritma baharu ini dibangunkan dengan menggabungkan tiga kaedah, iaitu pendekatan pengkomputeran, algoritma Winnow dan kaedah Burrows-delta.

Kaedah pengkomputeran meletakkan kesemua atribut dalam simpanan ke dalam satu persamaan, yang akan menghasilkan satu nilai bagi automasi dalam pembuatan keputusan. Algoritma Winnow memberi inspirasi kepada penyelidikan ini melalui teknik pemberatnya yang berpotensi, mana kala teknik Burrows-delta adalah efektif dalam membezakan parameter atribut latihan daripada parameter atribut pengujian seperti korelasi Pearson.

Kaedah yang dicadangkan juga menggunakan satu set atribut-atribut yang lebih efektif jika dibandingkan dengan kaedah perkataan kerap. Ini memberi keputusan ramalan stailometrik yang lebih tinggi setakat ini, membuktikan pengecaman dan ramalan gaya penulisan artistik seseorang penulis. Atribut-atribut efektif ini diwakili oleh pasangan perkataan dan trio, yang mana kedua-duanya adalah merupakan atribut dengan banyak perkataan.

vii

Algoritma yang dicadangkan juga direka bagi menampung apa-apa bahasa tanpa memerlukan campur tangan manusia seperti memerlukan tetapan manual atau pun pakar bahasa. Oleh itu, ianya adalah sebuah model tidak dibimbing. Dalam masa yang sama, sumber-sumber teks di bawah kajian adalah dalam Bahasa Arab, Bahasa Inggeris, Bahasa Melayu atau apa-apa bahasa, lantas menunjukkan bahawa algoritma ini adalah bebas daripada kebergantungan ke atas bahasa.

Eksperimen telah dijalankan dengan menggunakan set data daripada laman web Gutenberg, yang mengumpul banyak buku-buku kesusasteraan. Walau bagaimanapun, kajian ini mengehadkan skop kepada koleksi 50 novel daripada 5 penulis terkemuka yang telah menerbitkan karya mereka dalam Bahasa Inggeris sewaktu kurun ke-19. Dalam menjalankan eksperimen-eksperimen tersebut, 10 buah buku telah diagihkan kepada setiap penulis, yang mana 9 daripada buku-buku tersebut digunakan untuk tujuan latihan dan buku yang kesepuluh digunakan untuk tujuan pengujian.

Atribusi Stailometrik Penulisan Seimbang (SABA) yang dicadangkan dibandingkan dengan tiga lagi model yang lain, iaitu pendekatan pengkomputeran, algoritma Winnow, dan algoritma Burrow-delta. Keputusan-keputusan menunjukkan bahawa algoritma yang dicadang menghasilkan ketepatan ramalan lebih hebat malahan memberikan keputusan berketepatan penuh di dalam tahap akhir eksperimen.

viii

ACKNOWLEDGEMENT

I would like to take this opportunity and thank my supervisor, Dr. Norwati Mustapha, for her support, guidance’s, and understanding. Her comments and suggestions for further development as well as her assistance during writing this thesis are invaluable to me. Her patience, humility, tutorship, interest, teaching and research style have provided for me an exceptional opportunity to learn and become a better researcher.

I would also like to thank the committee members, Dr. Masrah Azrifah Azmi Murad and Associate Professor Dr. Md. Nasir Sulaiman for their help and valuable suggestions.

My deepest appreciation to my family for their utmost support and encouragement without which all these would not be possible, wishing health for my wife.

For the others who have directly or indirectly helped me in the completion of my work, I thank you all.

Finally, my deepest appreciation to University Putra Malaysia and beautiful Malaysia for their support, encouragement and for accepting me in their community and giving me the feeling that I am not far from home.

ix

APPROVAL

I certify that a Thesis Examination Committee has met on ________ to conduct the final examination of Tareef Kamil Mustafa on his thesis entitled “Stylometric Authorship Balanced Attribution Prediction Method” in accordance with the Universities and University College Act 1971 and the Constitution of the University Putra Malaysia [P.U.(A) 106] 15 March 1998. The committee recommends that the student be awarded the Doctor of Philosophy. Members of the Thesis Examination Committee were as follows:

_____________________, PhD Professor Faculty of Computer Science and Information Technology University Putra Malaysia (Chairman) _____________________, PhD Professor Faculty of Computer Science and Information Technology University Putra Malaysia (Internal Examiner) _____________________, PhD Professor Faculty of Computer Science and Information Technology University Putra Malaysia (Internal Examiner) _____________________, PhD Professor

(External Examiner) ___________________________________ SHAMSUDDIN SULAIMAN, PhD Professor and Deputy Dean School of Graduate Studies University Putra Malaysia Date: x

This thesis was submitted to the Senate of University Putra Malaysia and has been accepted as fulfillment of the requirement for the degree of Doctor of Philosophy. The members of the Supervisory Committee were as follows:

Norwati Mustapha, PhD Senior Lecturer Faculty of Computer Science and Information Technology University Putra Malaysia (Chairman)

Masrah Azrifah Azmi Murad, PhD Senior Lecturer Faculty of Computer Science and Information Technology University Putra Malaysia (Member)

Md. Nasir Sulaiman, PhD Associate professor Faculty of Computer Science and Information Technology University Putra Malaysia (Member)

_____________________________________ HASANAH MOHD GHAZALI, PhD Professor and Dean School of Graduate Studies University Putra Malaysia Date:

xi

DECLARATION

I hereby declare that the thesis is based on my original work except for quotations and citations which have been duly acknowledged. I also declare that it has not been previously or concurrently submitted for any other degree at UPM or other institutions.

_____________________________________ TAREEF KAMIL MUSTAFA Date:

xii

TABLE OF CONTENTS

Page DEDICATION ABSTRACT ABSTRAK ACKNOWLEDGEMENT APPROVAL DECLARATION LIST OF TABLES LIST OF FIGURES LIST OF ABBREVIATIONS

ii iii vi ix x xii xvii xviii xx

CHAPTER

1

INTRODUCTION 1.1 Background ..........................................................................................................1 1.2 Problem statement ................................................................................................4 1.3 Objectives .............................................................................................................6 1.4 Scope and limitations of research .........................................................................6 1.5 Contribution .........................................................................................................8 1.6 Thesis organization ..............................................................................................9

2

STYLOMETRIC AUTHORSHIP ATTRIBUTION 2.1 Introduction ........................................................................................................11 2.2 Authorship attribution analysis ..........................................................................11 2.2.1 Authorship identification...........................................................................11 2.2.2 Authorship characterization ......................................................................12 2.2.3 Similarity detection ...................................................................................12 2.3 Authorship attribution (AA) ...............................................................................13 2.4 Stylometric authorship attribution (SAA) ..........................................................14 2.5 Real-world projects in SAA ...............................................................................16 2.5.1 The Federalist Papers ......................................................................................16 xiii

2.5.2 The Book of Mormons ..............................................................................17 2.5.3 Literature with authorship .........................................................................18 2.6 Statistical measures applied in the proposed SAA method .....................................18 2.6.1 Pearson correlation coefficient ..................................................................19 2.6.2 Coefficient of variance ..............................................................................21 2.7 Summary ..................................................................................................................22

3

LITERATURE REVIEWS 3.1 Introduction ........................................................................................................23 3.2 Related methods and algorithms in SAA ...........................................................23 3.2.1 Content analysis ........................................................................................23 3.2.2 Computational stylistic approach ..............................................................25 3.2.3 Exponentiated Gradient learning algorithm ..............................................27 3.2.4 Winnow regularized algorithm ..................................................................30 3.2.5 Modeling long canons as Markov chains ..................................................31 3.2.6 Burrows-delta method ...............................................................................32 3.3 Summary ............................................................................................................34

4

METHODOLOGY 4.1 Introduction ........................................................................................................35 4.2 System requirements for running experiment ....................................................36 4.3 Stylometric authorship attribution methodology ................................................36 4.3.1 Dataset .......................................................................................................38 4.3.2 Transforming text into stylometric database map .....................................42 4.3.3 Data preprocessing ....................................................................................43 4.3.4 SAA feature extraction ..............................................................................44 4.3.5 Clustering stylometric map in learning path .............................................45 4.3.6 Feature selection:.......................................................................................47 4.3.7 Classification stylometric attributes ..........................................................49 4.4 Evaluation Method ............................................................................................50 4.5 Summary ............................................................................................................53 xiv

5

STYLOMETRIC AUTHORSHIP BALANCED ATTRIBUTION METHOD 5.1 Introduction ........................................................................................................54 5.2 Proposed improvements in the stylometric method ...........................................55 5.2.1 Enhancement of SAA feature selection and feature extraction .................57 5.2.2 Clustering improvement using pair and trio attributes ..............................61 5.2.3 Building the SABA algorithm for result accuracy measurement .............62 5.2.4 Histogram graphical results for the SAA map ..........................................64 5.3 Running example................................................................................................65 5.4 Summary ............................................................................................................73

6

RESULTS AND DISCUSSION 6.1 Introduction ........................................................................................................74 6.2 Model 1: The computational approach ...............................................................75 6.2.1 Frequent words ..........................................................................................76 6.2.2 Frequent pairs ............................................................................................79 6.2.3 Trio words .................................................................................................81 6.2.4 Result summary for Model 1 .....................................................................84 6.3 Model 2: The Winnow algorithm .......................................................................84 6.3.1 Frequent words ..........................................................................................85 6.3.2 Frequent pairs ............................................................................................88 6.3.3 Trio words .................................................................................................90 6.3.4 Result summary for Model 2 .....................................................................92 6.4 Model 3: Improved Burrows-delta .....................................................................92 6.4.1 Frequent words ..........................................................................................93 6.4.2 Word pairs .................................................................................................95 6.4.3 Trio words .................................................................................................97 6.4.4 Result summary for Model 3 .....................................................................99 6.5 Model 4: Stylometric Authorship Balanced Attribution ....................................99 6.5.1 Frequent words ........................................................................................100 6.5.2 Frequent pairs ..........................................................................................102 6.5.3 Trio words ...............................................................................................105 xv

6.5.4 Result summary for Model 4 ...................................................................106 6.6 Discussions .......................................................................................................107 6.7 Summary ..........................................................................................................111

7

CONCLUSIONS AND FUTURE WORKS 7.1 Conclusions ......................................................................................................111 7.2 Future works .....................................................................................................113

REFERENCES.............................................................................................................114 BIODATA OF THE AUTHOR ..................................................................................118 LIST OF PUBLICATIONS ........................................................................................119

xvi

LIST OF TABLES

Page Table 2.1: Results of Pearson correlation ........................................................................20 Table 4.1: Dataset ............................................................................................................39 Table 4.2: Details of the dataset.......................................................................................40 Table 5.1: Calculating SABA using Pearson ...................................................................72 Table 6.1: Pearson coefficient in frequent words for each stylometric map ...................77 Table 6.2: Pearson coefficient in frequent pairs for each stylometric map .....................79 Table 6.3: Pearson coefficient in trio words for each stylometric map ...........................82 Table 6.4: Winnow results for frequent words ................................................................86 Table 6.5: Winnow results for word pair .........................................................................88 Table 6.6: Winnow results for trio words ........................................................................90 Table 6.7: Burrows-delta method results for frequent word ............................................93 Table 6.8: Burrows-delta method results for word pair ...................................................95 Table 6.9: Burrows-delta method results for trio words ..................................................97 Table 6.10: SABA results for frequent word .................................................................100 Table 6.11: SABA results for word pair ........................................................................102 Table 6.12: SABA results for trio words .......................................................................104

xvii

LIST OF FIGURES

Page Figure 1.1: Text mining in relations to other fields ...........................................................2 Figure 4.1: Methodology of stylometric authorship attribution ......................................38 Figure 4.2: Proposed dataset plan ....................................................................................42 Figure 4.3: Transforming text into database ....................................................................43 Figure 4.4: Snapshot for a stylometric database map ......................................................45 Figure 4.5: Stylometric map compared with two testing maps .......................................48 Figure 4.6: The Winnow algorithm .................................................................................50 Figure 4.7: The experiment design ..................................................................................52 Figure 5.1: Improvments on the stylometric method .......................................................56 Figure 5.2: Example of proposed stylometric map database ...........................................58 Figure 5.3: Enhancement steps in extraction and selection of SAA method ...................60 Figure 5.4: SABA algorithm ...........................................................................................64 Figure 5.5: Result of the SQL statements ........................................................................67 Figure 5.6: Replace the frequency (Cnt) with percentage (Prcnt) ...................................68 Figure 5.7: Repeating steps in Figure 5.6 steps for the 9 training files ...........................69 Figure 5.8: Showing the extraction of CV .......................................................................70 Figure 5.9: The excluded maximum itemset attributes in the stylometric map ...............71 Figure 5.10: Calculating Pearson .....................................................................................72 Figure 6.1: Frequent words for Shakespeare map ...........................................................78 Figure 6.2: Frequent words for Wilde map as a noisy case .............................................78 Figure 6.3: Frequent pairs for Shakespeare map as a clear case......................................80 Figure 6.4: Frequent pairs for Wilde map as a noisy case ...............................................81 Figure 6.5: Trio words for Shakespeare map as a clear case ...........................................83 Figure 6.6: Trio words for Wilde map as a noisy case ....................................................83 Figure 6.7: Frequent words for London map as the best case .........................................87 Figure 6.8: Frequent words for Wilde map as the worst case..........................................87 Figure 6.9: Word pairs for London map as the best case ................................................89 Figure 6.10: Word pairs for Wilde map as the worst case ...............................................89 xviii

Figure 6.11: Trio words for Shakespeare map as a clear case .........................................91 Figure 6.12: Trio words for Wilde map as the worst case ...............................................91 Figure 6.13: Frequent words for London map as the best case .......................................94 Figure 6.14: Frequent words Wilde map as the worst case .............................................95 Figure 6.15: Word pair for Shakespeare map as the best case ........................................96 Figure 6.16: Word pairs for Wilde map as the worst case ...............................................97 Figure 6.17: Trio words for Twain map as a clear case ...................................................98 Figure 6.18: Trio words for Wilde map as the worst case ...............................................98 Figure 6.19: Frequent Word for Dickens map as the best case .....................................101 Figure 6.20: Frequent words for Wilde map as the worst case......................................101 Figure 6.21: Word pairs for Shakespeare map as the best case .....................................103 Figure 6.22: Word pairs for Wilde map as the worst case .............................................103 Figure 6.23: Trio words for Twain map as a clear case .................................................104 Figure 6.24: Attribute prediction error percentages for research models ......................106 Figure 6.25: Twain prediction improvement through models .......................................107 Figure 6.26: Frequent words attribute predictions .........................................................108 Figure 6.27: Word pairs attribute predictions ................................................................109 Figure 6.28: Trio words attribute predictions ................................................................109

xix

LIST OF ABBREVIATIONS

AA

Authorship Attribution

CV

Coefficient of Variance

SAA

Stylometric Authorship Attribution

SABA

Stylometric Authorship Balanced Attribution

xx