Link Discovery for Chinese/English Cross-language ... - QUT ePrints

LINK DISCOVERY FOR CHINESE/ENGLISH CROSSLANGUAGE WEB INFORMATION RETRIEVAL

Ling-Xiang Tang BEng, MIT

Principal Supervisor: Associate Professor Shlomo Geva Associate Supervisor: Associate Professor Andrew Trotman Associate Supervisor: Associate Professor Yue Xu

Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy

Electrical Engineering & Computer Science Science and Engineering Faculty Queensland University of Technology July, 2012

To my wife Min and my son Jerry

Keywords Anchor Identification, Algorithm, Assessment, Assessment Tool, BM 25, Chinese Segmentation, Cross-lingual Information Retrieval, Cross-lingual Link Discovery, Crosslingual Question Answering, Development, Evaluation, Evaluation Framework, Evaluation Metrics, Evaluation Tool, Experimentation, INEX, Information Retrieval, Link Discovery, Link Recommendation, Link Probability, Machine Translation, N-Gram Mutual Information, Named Entity Translation, NTCIR, Page Name Matching, Search Engine, Software, Translation, Validation Tool, VMNET, Wikipedia, XML

i

Abstract Nowadays people heavily rely on the Internet for information and knowledge. Wikipedia is an online multilingual encyclopaedia that contains a very large number of detailed articles covering most written languages. It is often considered to be a treasury of human knowledge. It includes extensive hypertext links between documents of the same language for easy navigation. However, the pages in different languages are rarely cross-linked except for direct equivalent pages on the same subject in different languages. This could pose serious difficulties to users seeking information or knowledge from different lingual sources, or where there is no equivalent page in one language or another. In this thesis, a new information retrieval task—cross-lingual link discovery (CLLD) is proposed to tackle the problem of the lack of cross-lingual anchored links in a knowledge base such as Wikipedia. In contrast to traditional information retrieval tasks, cross language link discovery algorithms actively recommend a set of meaningful anchors in a source document and establish links to documents in an alternative language. In other words, cross-lingual link discovery is a way of automatically finding hypertext links between documents in different languages, which is particularly helpful for knowledge discovery in different language domains. This study is specifically focused on Chinese / English link discovery (C/ELD). Chinese / English link discovery is a special case of cross-lingual link discovery task. It involves tasks including natural language processing (NLP), cross-lingual information retrieval (CLIR) and cross-lingual link discovery. To justify the effectiveness of CLLD, a standard evaluation framework is also proposed. The evaluation framework includes topics, document collections, a gold standard dataset, evaluation metrics, and toolkits for run pooling, link assessment and system evaluation. With the evaluation framework, performance of CLLD approaches and systems can be quantified. This thesis contributes to the research on natural language processing and cross-lingual information retrieval in CLLD: 1) a new simple, but effective Chinese segmentation method, n-gram mutual information, is presented for determining the ii

Link Discovery for Chinese/English Cross-language Web Information Retrieval

boundaries of Chinese text; 2) a voting mechanism of name entity translation is demonstrated for achieving a high precision of English / Chinese machine translation; 3) a link mining approach that mines the existing link structure for anchor probabilities achieves encouraging results in suggesting cross-lingual Chinese / English links in Wikipedia. This approach was examined in the experiments for better, automatic generation of cross-lingual links that were carried out as part of the study. The overall major contribution of this thesis is the provision of a standard evaluation framework for cross-lingual link discovery research. It is important in CLLD evaluation to have this framework which helps in benchmarking the performance of various CLLD systems and in identifying good CLLD realisation approaches. The evaluation methods and the evaluation framework described in this thesis have been utilised to quantify the system performance in the NTCIR-9 Crosslink task which is the first information retrieval track of this kind.


iii

Table of Contents Keywords........................................................................................................................................... i Abstract ............................................................................................................................................ ii Table of Contents..............................................................................................................................iv List of Figures.................................................................................................................................... x List of Tables .....................................................................................................................................xi List of Abbreviations ....................................................................................................................... xiii Statement of Original Authorship .................................................................................................... xiv Acknowledgements ......................................................................................................................... xv CHAPTER 1: INTRODUCTION........................................................................................................ 1 1.1

Background............................................................................................................................ 1

1.2

Research Problem .................................................................................................................. 5

1.3

Research Challenges............................................................................................................... 6

1.4

Research Objectives ............................................................................................................... 7

1.5

Research Significance ............................................................................................................. 8

1.6

Research Methodology .......................................................................................................... 9

1.7

Thesis Outline ...................................................................................................................... 10

CHAPTER 2: LITERATURE REVIEW .............................................................................................. 13 2.1

Natural Language Processing ................................................................................................ 13 2.1.1 Chinese Segmentation Overview ............................................................................... 13 2.1.1.1Supervised Segmentation Method ............................................................................. 14 2.1.1.2Unsupervised Segmentation Method......................................................................... 15 2.1.1.3Segmentation Evaluation ........................................................................................... 18 2.1.2 Named Entity Recognition ......................................................................................... 18 2.1.3 Chinese / English Translation ..................................................................................... 20

2.2

Cross-lingual Information Retrieval....................................................................................... 20 2.2.1 Indexing .................................................................................................................... 20

iv


2.2.2 Documents Ranking................................................................................................... 21 2.2.3 Performance Deciding Factors ................................................................................... 23 2.2.4 Out of Vocabulary Word Recognition......................................................................... 23 2.2.5 Cross-language Information Retrieval ........................................................................ 24 2.3

Link Discovery ...................................................................................................................... 24 2.3.1 Overview................................................................................................................... 24 2.3.2 Best Entry Point......................................................................................................... 25 2.3.3 Anchor Identification ................................................................................................. 26 2.3.4 Assessment and Evaluation ....................................................................................... 26 2.3.5 Mono-lingual Link Discovery ...................................................................................... 27 2.3.6 Cross-lingual Link Discovery ....................................................................................... 28

2.4

Other Relevant Tasks ........................................................................................................... 28

2.5

Summary and Implications ................................................................................................... 30

CHAPTER 3: NATURAL LANGUAGE PROCESSING ........................................................................ 31 3.1

Introduction......................................................................................................................... 32

3.2

Chinese Segmentation with N-Gram Mutual Information ..................................................... 35 3.2.1 A Boundary-Oriented Segmentation Method ............................................................. 35 3.2.2 Definition of N-Gram Mutual Information.................................................................. 35

3.3

Unsupervised N-Gram Mutual Information........................................................................... 36 3.3.1 Definitions ................................................................................................................ 36 3.3.2 Segmentation Algorithm ........................................................................................... 40 3.3.3 Experiments .............................................................................................................. 41 3.3.3.1String Pattern Frequency Table.................................................................................. 41 3.3.3.2Stop Words ............................................................................................................... 42 3.3.3.3Test Data................................................................................................................... 42 3.3.3.4Experimental Runs .................................................................................................... 42 3.3.4 Results ...................................................................................................................... 44 3.3.4.1Runs On the In-house Test Data................................................................................. 44 3.3.4.2Runs On the Bake off Test Data ................................................................................. 46


v

3.4

Supervised N-Gram Mutual Information ............................................................................... 48 3.4.1 Segmentation System Design With Supervised NGMI................................................. 49 3.4.1.1Frequency Table and Its Alignment ............................................................................ 49 3.4.1.2Segmentation With Supervised NGMI ........................................................................ 49 3.4.1.3OOV Word Recognition ............................................................................................. 51 3.4.2 Supervised NGMI at CIPS-SIGHAN 2010 ..................................................................... 52 3.4.2.1The Chinese Word Segmentation Bakeoff .................................................................. 52 3.4.2.2Experimental Setup ................................................................................................... 53 3.4.3 Results and Discussions ............................................................................................. 53

3.5

Summary ............................................................................................................................. 55

CHAPTER 4: CHINESE / ENGLISH CROSS-LINGUAL INFORMATION RETRIEVAL ............................ 56 4.1

Introduction......................................................................................................................... 57

4.2

Natural Language Processing for Question Answering .......................................................... 59 4.2.1 Name Entity Identification ......................................................................................... 59 4.2.2 A Voting Mechanism for Named Entity Translation (VMNET) ..................................... 60 4.2.2.1Observations ............................................................................................................. 60 4.2.2.2VMNET ...................................................................................................................... 60

4.3

Query Generation Algorithm with VMNET ............................................................................ 61

4.4

Information Retrieval ........................................................................................................... 65 4.4.1 Chinese Document Processing ................................................................................... 65 4.4.2 Weighting Model....................................................................................................... 65 4.4.3 English-Chinese CLIR System Design........................................................................... 66

4.5

CLIR Experiment ................................................................................................................... 67 4.5.1 Test Collection and Topics ......................................................................................... 67 4.5.2 Evaluation Measures ................................................................................................. 67 4.5.3 CLIR Experiment runs ................................................................................................ 68

4.6

Results and Discussion ......................................................................................................... 68 4.6.1 Translation Evaluation ............................................................................................... 68 4.6.2 IR Evaluation ............................................................................................................. 72

vi


4.7

Summary ............................................................................................................................. 74

CHAPTER 5: CROSS-LINGUAL LINK DISCOVERY EVALUATION FRAMEWORK .............................. 77 5.1

Introduction......................................................................................................................... 78

5.2

Methodology ....................................................................................................................... 79

5.3

An Overview of the Framework ............................................................................................ 81

5.4

Task Definition ..................................................................................................................... 82

5.5

Topics .................................................................................................................................. 83 5.5.1 Topics for NTCIR English-to-CJK tasks ......................................................................... 83 5.5.2 Topics for Chinese-to-English task.............................................................................. 84

5.6

Document Collections .......................................................................................................... 84

5.7

Submission .......................................................................................................................... 86 5.7.1 Rules ......................................................................................................................... 86 5.7.2 Run Specification....................................................................................................... 87 5.7.3 Anchor Offset ............................................................................................................ 87

5.8

Validation ............................................................................................................................ 89 5.8.1 Validation in Run Submission..................................................................................... 89 5.8.2 Validation Tool .......................................................................................................... 89

5.9

Assessment .......................................................................................................................... 90 5.9.1 Automatic Assessment .............................................................................................. 91 5.9.2 Manual Assessment .................................................................................................. 92 5.9.2.1Criteria for Manual Assessment ................................................................................. 92 5.9.2.2Manual Assessment Tool ........................................................................................... 93

5.10

Evaluation Methods ............................................................................................................. 96 5.10.1 Link Precision and Recall............................................................................................ 96 5.10.2 System Evaluation Metrics....................................................................................... 101 5.10.3 Evaluation Tool ....................................................................................................... 103 5.10.3.1 User Interface....................................................................................................... 103 5.10.3.2 Interpolated precision-recall Plot .......................................................................... 104

5.11

Summary ........................................................................................................................... 106


vii

CHAPTER 6: CHINESE / ENGLISH CROSS-LINGUAL LINK DISCOVERY ......................................... 107 6.1

Overview ........................................................................................................................... 108 6.1.1 Chinese and English Collections ............................................................................... 108 6.1.2 A Two-Step Approach .............................................................................................. 109 6.1.3 Natural Language Processing in Chinese-to-English Link Discovery ........................... 109 6.1.4 CLLD Realisation ...................................................................................................... 109

6.2

Chinese / English Link Discovery Methods .......................................................................... 110 6.2.1 Finding Links with Link Mining ................................................................................. 110 6.2.1.1Mono-lingual Link Probability .................................................................................. 110 6.2.1.2Cross-lingual Link Probability ................................................................................... 111 6.2.2 Finding Links with Page Name Matching .................................................................. 112 6.2.3 Finding Links with Cross-Lingual Information Retrieval ............................................. 113 6.2.4 Comparison of CLLD Methods.................................................................................. 114

6.3

Implementation of Chinese / English Link Discovery ........................................................... 114 6.3.1 Cross-lingual Link Probability ................................................................................... 114 6.3.1.1Chinese-to-English Link Probability .......................................................................... 115 6.3.1.2English-to-Chinese Link Probability .......................................................................... 117 6.3.2 Cross-lingual Name Matching .................................................................................. 119 6.3.3 Cross-lingual Inforamtion Retrieval .......................................................................... 120

6.4

Experimental Runs ............................................................................................................. 120 6.4.1 Chinese-to-English Runs .......................................................................................... 121 6.4.2 English-to-Chinese Runs .......................................................................................... 122

6.5

Results and Discussion ....................................................................................................... 123 6.5.1 Chinese to English Link Discovery............................................................................. 123 6.5.2 English to Chinese Link Discovery at NTCIR-9 ........................................................... 127 6.5.2.1Evaluation of Link Mining Runs ................................................................................ 127 6.5.2.2Evaluation of Page Name Matching Runs ................................................................. 131 6.5.2.3Evaluation of CLIR Runs ........................................................................................... 131 6.5.2.4Comparison with Other Teams ................................................................................ 131

viii


6.6

Summary ........................................................................................................................... 136

CHAPTER 7: THE EFFECTIVENESS OF CROSS-LINGUAL LINK DISCOVERY ................................... 138 7.1

Introduction....................................................................................................................... 139

7.2

Manual Assessment ........................................................................................................... 140 7.2.1 Link Pooling for Assessment .................................................................................... 140 7.2.2 Human Assessors .................................................................................................... 141 7.2.3 Overlapping Anchors ............................................................................................... 142 7.2.4 Assessment Tool ..................................................................................................... 142 7.2.5 The Wikipedia Ground-Truth Run ............................................................................ 143 7.2.6 Links Found in Manual Assessment.......................................................................... 144 7.2.7 The Validity of Automatic Assessment ..................................................................... 144

7.3

Link Evaluation ................................................................................................................... 146 7.3.1 Evaluation Types and Measures............................................................................... 146 7.3.2 Evaluation Results ................................................................................................... 146 7.3.3 Comparison of CLLD algorithms ............................................................................... 148 7.3.4 Unique Relevant Links ............................................................................................. 151

7.4

Discussion: CLLD in Action .................................................................................................. 153

7.5

Summary ........................................................................................................................... 154

CHAPTER 8: CONCLUSIONS...................................................................................................... 155 8.1

Conclusions........................................................................................................................ 156

8.2

Contributions ..................................................................................................................... 158

8.3

Limitations and Future Work .............................................................................................. 159

BIBLIOGRAPHY ............................................................................................................................ 161


ix

List of Figures Figure 1-1: Cross-lingual Linking in Wikipedia .................................................................................... 2 Figure 1-2: The Wikipedia Pages on“flower crab” .............................................................................. 3 Figure 1-3: Lost in Translation ........................................................................................................... 4 Figure 1-4: The Research Method Employed for This Thesis............................................................... 9 Figure 4-1: The CLIR System Design ................................................................................................. 66 Figure 5-1: The Cross-Lingual Link Discovery Evaluation Methodology ............................................. 80 Figure 5-2: Crosslink Run Validation Tool ......................................................................................... 90 Figure 5-3: Crosslink Manual Assessment Tool (Chinese-to-English) ................................................. 94 Figure 5-4: Crosslink Manual Assessment Tool (English-to-Chinese) ................................................. 94 Figure 5-5: Crosslink Manual Assessment Tool (English-to-Japanese) ............................................... 95 Figure 5-6: Crosslink Manual Assessment Tool (English-to-Korean) .................................................. 95 Figure 5-7: Crosslink Evaluation Tool ............................................................................................. 103 Figure 5-8: An Example of Interpolated Precision-Recall Plot ......................................................... 104 Figure 6-1: The size of Chinese and English Wikipedia.................................................................... 108 Figure 6-2: Cross-lingual triangulation (English-to-Chinese) ........................................................... 111 Figure 6-3: Cross-lingual triangulation (Chinese-to-English) ........................................................... 112 Figure 6-4: A Flowchart of Proposed CLLD Methods ...................................................................... 113 Figure 6-5: An Example of Chinese-to-English Document Linking ................................................... 122 Figure 6-6: The Interpolated Precision-Recall Curves for the Different Methods............................. 124 Figure 6-7: The Interpolated Precision-Recall Curves of Runs ......................................................... 130 Figure 7-1: The NTCIR-9 Crosslink Manual Assessment Tool........................................................... 143 Figure 7-2: Interpolated precision-recall graph showing F2F evaluation of Wikipedia groundtruth runs on their own manual assessments ................................................................ 145 Figure 7-3: Interpolated Precision-Recall Graph Showing En-2-Zh F2F Evaluation against the Automatic Assessments ................................................................................................ 147 Figure 7-4: Interpolated Precision-Recall Graph Showing En-2-Zh A2F Evaluation against the Manual Assessments .................................................................................................... 147 Figure 7-5: A Prospective Anchor Found for the Topic "Croissant" ................................................. 152

x


List of Tables Table 3-1: The Results of Chinese Segmentation of Different Systems ............................................. 33 Table 3-2: Information of Segmentation Runs ................................................................................. 43 Table 3-3: The Segmentation Results on the In-House Test Data...................................................... 45 Table 3-4: Recall of Segmentation Runs on In-House Test Data ........................................................ 46 Table 3-5: The Segmentation Results on the Bakeoff Test Data........................................................ 47 Table 3-6: Recall of Segmentation runs on the Bake-off Test Data ................................................... 47 Table 3-7: System Settings for both Closed and Open Training Evaluations ...................................... 53 Table 3-8: Segmentation Results for Four Domains in both Closed and Open Evaluations ................ 54 Table 4-1: Question Templates in Question Answering .................................................................... 58 Table 4-2: Statistics of Test Corpus and Topics................................................................................. 67 Table 4-3: The Misspelled Terms in Topics....................................................................................... 68 Table 4-4: The Descriptions of CLIR Experimental Runs .................................................................... 69 Table 4-5: Translation Evaluation Results ........................................................................................ 69 Table 4-6: The Topics with OOV Phrases .......................................................................................... 70 Table 4-7: Translations of OOV Phrases ........................................................................................... 71 Table 4-8: Results of All Experimental Runs ..................................................................................... 72 Table 4-9: Alternative Translations .................................................................................................. 74 Table 5-1: Training Topics for Crosslink Task .................................................................................... 83 Table 5-2: Test Topics for Crosslink Task .......................................................................................... 83 Table 5-3: Chinese Topics for Chinese-to-English Link Discovery ...................................................... 85 Table 5-4: Article Statistics of CJK Corpora....................................................................................... 86 Table 5-5: Submission XML File DTD ................................................................................................ 88 Table 6-1: Pros and Cons of Three Link Discovery Approaches ....................................................... 114 Table 6-2: Extracts from Tlang ....................................................................................................... 115 Table 6-3: Extracts from Tlink-chinese ........................................................................................... 116 Table 6-4: Extracts from Tlink-english............................................................................................ 118 Table 6-5: Extracts from Tlang with Title Mapping of all CJK Languages ......................................... 119 Table 6-6: Information of Chinese-to-English Runs ........................................................................ 121 Table 6-7: Information of English-to-Chinese Runs ........................................................................ 123 Table 6-8: Performance of Chinese-to-English Experimental Runs ................................................. 124 Table 6-9: Example Translation Errors in the Runs ......................................................................... 125


xi

Table 6-10: LMAP and R-Prec Scores of English-to-Chinese Runs in both F2F and A2F Evaluations ................................................................................................................... 128 Table 6-11: P@N Scores of English-to-Chinese Runs in Both F2F and A2F Evaluation ...................... 129 Table 6-12: LMAP Scores and Rankings of Run QUT_LinkProb_ZH in Evaluations for English-toChinese Link Discovery .................................................................................................. 132 Table 6-13: The Top Six Runs in F2F Evaluation with Wikipedia Ground-Truth (Measured with Precision-at-5) .............................................................................................................. 133 Table 6-14: The Top Six Runs in F2F Evaluation with Manual Assessment Results (Measured with Precision-at-5)....................................................................................................... 133 Table 6-15: The Top Six Runs in F2F Evaluation with Manual Assessment Results (Measured with Precision-at-5)....................................................................................................... 134 Table 6-16: A Comparison of Number of Unique Links Found in En-2-Zh Evaluation ....................... 135 Table 6-17: Number of Unique Relevant Links Found by Each QUT Run in Automatic Assessment Link Set...................................................................................................... 135 Table 6-18: Number of Unique Relevant Links Found by Each QUT Run in Manual Assessment Link Set......................................................................................................................... 136 Table 7-1: Average Number of Links in Pooling .............................................................................. 140 Table 7-2: Assessors Information................................................................................................... 141 Table 7-3: Links in the Result Sets of Two Different Assessments ................................................... 144 Table 7-4: LMAPs of Teams in Two Evaluations.............................................................................. 148 Table 7-5: Comparison of different implementations of four CLLD systems.................................... 150 Table 7-6: Unique Relevant English-to-Chinese Links ..................................................................... 152

xii


List of Abbreviations The following table lists the abbreviations and acronyms used in the thesis. The third column is the number of the page where the abbreviation or acronym is defined or first used. Abbreviation

Meaning

Page

A2F

anchor to file

96

ASP

adjoining segment pairs

51

BC

boundary confidence

35

BEP

best entry point

25

BMM

backward maximum match

51

C/ELD

Chinese / English link discovery

4

CLIR

cross-lingual information retrieval

5

CLLD

cross-lingual link discovery

4

CLQA

cross-lingual question answering

57

CRF

conditional random field

15

DTD

document type definition

26

F2F

file to file

96

INEX

initiative for the evaluation of XML retrieval

25

IR

information retrieval

57

LMAP

link mean average precision

101

MAP

mean average precision

67

MT

machine translation

30

NE

named entity

18

NGMI

n-gram mutual information

31

NTCIR

NII test collection for IR Systems

24

OOV

out of vocabulary

23

PNM

page name matching

104

QUT

Queensland University of Technology

141

TREC

text retrieval conference

25

VMNET

voting mechanism of name entity translation

60

WSD

word sense disambiguation

26


xiii

Statement of Original Authorship The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

xiv

Signature:

_________________________

Date:

_________________________


Acknowledgements First of all, my deepest gratitude goes to my principal supervisor Dr. Shlomo Geva for his guidance, encouragement and patience throughout the journey of my thesis. I am also so much indebted to my associate supervisor Dr. Andrew Trotman who has inspired me and excited me in the practical programming for information retrieval, and help me improve my skills in academic writing. I am also grateful to my associate supervisor Dr. Yue Xu who provided generous comments and corrections for all my work. It is extremely fortunate for me to have had such an excellent supervisory team who have provided tremendous amount of support in helping me finishing this research. I would like to thank the HPC team of QUT, especially Ashley Wright and Mark Dwyer who provided generous support in assisting me to create a super large English Wikipedia corpus with millions of articles. Without their help, there would be no experiments on cross-lingual link discovery. I really appreciate the opportunity I had to run the cross-lingual link discovery task at NTCIR-9. The evaluation framework used for system performance benchmarking in the task was partially funded by the National Institute of Informatics (Japan). I would like to acknowledge the people behind the NTCIR project for providing such an excellent evaluation platform and making the Crosslink task possible. Especially, I am very grateful to Ms. Sugimoto Miho, Dr. Noriko Kando, Dr. Tetsuya Sakai and Dr. Hideo Joho for their assistance in the organisation of the Crosslink task. Also, I thank Mrs. Diane Josey who helped proof-read this thesis and corrected many grammar errors in my writing. Writing has always been a challenge for me. With her valuable comments, my academic writing skills can certainly be improved to a higher level. I would like to express my thankfulness to my friends for their friendship, especially Zhu-Feng Chen who has faith in me. It has been fun to work with my colleagues and fellow students too. Special thanks go to Kelly Y. Itakura, Darren Huang, Elly Liang, and Chris De Vries for providing valuable advice and help. Huge Link Discovery for Chinese/English Cross-language Web Information Retrieval

xv

thanks also go to Wei Song, Yan Shen, Bin Liu, Tony Wang and the Korean Student Association of QUT (particularly Yong Jae Lee, Tina Son, and Hye Jung Yang) who helped assess the cross-lingual link discovery submission which is a very important part of my research. I am grateful to many QUT staff members who have been so kind and helpful in dealing with my inquiries regarding my research and other issues. In this respect many thanks are due to Andrea Russo, Nicole Levi, Matt Williams, Ann Ahlberg and Susan Teare. Finally and most importantly, I would like to thank my family especially my wife, Min, who has given me support for all these years. Her encouragement and love always give me hope. I thank my son Jerry. He is the joy of my life and always gave me a cuddle when I was down. I thank my parents Mei-Lan Li, Rong-Shi Tang, my aunts Ai-Lan Li, Yu-Lan Li, and my uncles Zhang-Xin Huang, Tu-Jin Li and other family members. They are always on my side.

xvi


Chapter 1: Introduction This chapter gives the background (section 1.1), the problem (section 1.2), the challenges (section 1.3), and the purposes (section 1.4) of the research. Section 1.5 describes the significance of this research. Then section 1.6 presents the research methodology and design. Finally, section 1.7 includes an outline of the remaining chapters of the thesis.

1.1

BACKGROUND

Nowadays people rely heavily on the Internet for information and knowledge. Wikipedia is a multilingual knowledge base that contains a very large number of detailed articles covering most written languages. It is often considered to be a treasury of human knowledge. As the largest free online encyclopaedia, Wikipedia has tremendous influence on people’s daily lives. People search and discover information on Wikipedia, and many contribute to the wiki-community by adding content based on their knowledge. For easy navigation and management, it includes extensive hypertext links between documents of the same language. However, the pages in different languages are rarely cross-linked except for direct equivalent pages (on the same subject) in different languages. Wikipedia generally only inter-links pages in the same language. Articles on the same topic but in different language domains are sometimes connected through a single page-topage language link. These links are not specifically anchored in the text since the respective pages refer to the same general topic and the link is general and generic. Even so, sometimes those links are missing because maintaining such a huge multilingual knowledge base is very difficult.

Chapter 1: Introduction

1

A few scenarios of cross-lingual knowledge discovery are shown below:



Scenario 1. A martial artist would like to find out what the Chinese / Japanese / Korean words for names of various martial arts are.

Figure 1-1 shows a snippet of the Martial Arts Wikipedia article in which anchors are linked only to related English articles about different types of martial arts; direct links to other related Chinese/Japanese/Korean articles do not exist in Wikipedia.

Figure 1-1: Cross-lingual Linking in Wikipedia



Scenario 2. Visitors to Hong Kong may notice that a word

花蟹

(“flower crab”, but also a colloquialism for the ten-dollar note) is often mentioned with respect to money, and consequently would like to find a detailed English explanation as to why.

2


Figure 1-2: The Wikipedia Pages on“flower crab”

Figure 1-2 shows English and Chinese Wikipedia pages on the Hong Kong tendollar note. From the figure, it can be seen that there should be bi-directional language links, but that they have not yet been created. The boxed texts in the Chinese page could be used to further generate anchored links for multi-lingual users to explore the English counterparts of those anchors.



Scenario 3. A gourmet in China likes reading English Wikipedia articles about exotic foods. “Crema pasticcera” is a term that often appears in articles, but what is “Crema pasticcera” and what is the corresponding word in Chinese?

Figure 1-3 shows several different language versions of the page on “Crema pasticcera”. Note that: 1) anchors are largely linked to articles in the source languages; 2) not all cross-language equivalent links exist – the Italian custard article “Crema pasticcera” is not linked to the English article “Custard”, and vice versa; 3) some cross-language equivalent links are incorrect – the Chinese custard article “奶黄” is not very precisely linked to the Italian pudding article “Budino”, and vice versa.


3

Figure 1-3: Lost in Translation

As shown in the above examples, the lack of cross lingual links in Wikipedia can pose serious difficulties for multi-lingual users seeking information or knowledge from different lingual sources, or where there is no equivalent page in one language or another.

They are forced to use one language version of the

resource and are not easily able to switch languages where appropriate. Specifically, because Wikipedia is user solicited, the different language versions have evolved at different rates and are unbalanced in coverage (and sometimes differently biased in content). A user may prefer multiple explanations, or just the one in their preferred language, or the richer content, or to extend their understanding of a language through reading translations. With this background in mind, this thesis explores the solutions for crosslingual link discovery (CLLD) that can assist users to have an easy access to crosslingual information and break language barriers in knowledge sharing. This thesis particularly focuses on Chinese / English link discovery (C/ELD).

4


1.2

RESEARCH PROBLEM

A link, which is often referred as hyperlink in the language of the web, comes with a piece of anchor text and is a navigation entity in a document that points to an entry point either within the same document or in any other documents. Links in articles of a knowledge base are provided for easy access to other related information that may also interest users. Especially for Wikipedia users, links are the most powerful elements for information navigation. Information access can be facilitated via a simple click on a link. However maintaining links in Wikipedia is difficult because most links are manually inserted by content contributors. Link discovery is a way of automatically finding hyperlinks between documents. For English, there are several mono-lingual link discovery tools. These help topic curators of knowledge bases discover and maintain appropriate anchors and targets that can be added to a given document. No such tools yet exist, to support linking across multiple languages. The example in Figure 1-3 (above) shows a need for this. To achieve automatic cross-lingual link discovery in knowledge bases, the process of cross-lingual document linking can be divided into two phases: 1) natural language processing phase—detecting prospective anchors in the source document; and 2) cross linking document phase—identifying relevant articles in target language. In contrast to traditional cross-lingual information retrieval (CLIR) tasks, cross language link discovery can be viewed as a process that takes the identified anchor as “query” and search for relevant links in the target document collection. So the whole research will involve studies on natural language processing, cross-lingual information retrieval and cross-lingual link discovery. It can be considered as a three-stage study. For each stage of the research, the performance of realisation methods for Chinese segmentation, cross-lingual information retrieval, and cross-lingual link discovery is measured separately. To quantify the performance of CLLD systems, an evaluation framework must first be developed for performance measurement. This evaluation framework should include: 1) experiment data; 2) evaluation metrics; and 3) evaluation tools. A quantified score for an experimental run is computed against the gold standard dataset with the defined evaluation metric(s). Chapter 1: Introduction

5

The research problems of this thesis are: 1) to find out how natural language processing and cross-lingual information retrieval affect cross-lingual link discovery; 2) to develop an assessment and evaluation framework to quantify performance of CLLD approaches and systems; and 3) to suggest an algorithm for high quality automated Chinese / English cross-lingual link generation in Wikipedia. 1.3

RESEARCH CHALLENGES

Chinese / English link discovery focuses on solving the cross-lingual knowledge discovery problems between Chinese and English Wikipedia. To achieve the goal of being able to provide easy access to cross-lingual information between these two knowledge domains, the following challenges must be considered.

Natural Language Processing Issue Chinese Wikipedia is a mix of traditional, simplified, and classical Chinese writing, as well as of several Chinese variants and dialects. This may cause serious problems in Chinese language text processing such as in Chinese segmentation and Chinese to English translation because most state-of-the arts segmentation and translation software can handle only one form of Chinese writing at a time. For Chinese / English link discovery, identifying a set of meaningful anchors in Chinese articles could be problematic because of a missing boundaries issue in Chinese text. Even if prospective anchors in a source document are identified, how to properly rank them and find relevant articles in a different language for them could also be a challenge.

Chinese / English Cross-lingual Link Discovery Realisation Issue Mono-lingual link discovery has been studied for many years, and state-of-theart realisation methods have been developed. Multi-lingual adoption of these approaches for cross-lingual link discovery is possible. However, differently from mono-lingual links discovery, a CLLD system should be able to not only identify the missing anchors in source documents but also to jump across the language boundary in order to recommend relevant target links.

6


Cross-lingual Link Discovery Evaluation Issue For Chinese segmentation and cross-lingual information retrieval, standard evaluation datasets and tools have already been developed by researchers in those fields. However, there are no standard datasets and tools existing for CLLD evaluation. As this is a new research filed in knowledge discovery, to provide a standardised evaluation framework including criteria and procedure for cross-lingual link discovery research is the greatest challenge in this study. The evaluation of traditional information retrieval tasks required the construction of a standard set of experiment data. Similarly, the research into providing a standardised evaluation framework on cross-lingual link discovery needs to construct its own specific set of experiment data: topics, document collections (corpora) and gold standard dataset (qrel). To create a gold standard dataset for performance evaluation of CLLD methods and systems, the recommended links for the given topics from the experiments must be assessed. Link assessment in cross-lingual link discovery is a very difficult task because of the number of degrees of freedom available to the assessors when judging the relevance of identified anchors and their association target links. When defining the CLLD experiment for system evaluation the choice of numbers of anchors and links should be reasonable because it must be feasible to have sufficient recommended anchors and links assessed for proper system evaluation when the CLLD experiments are conducted. Importantly, to be able to bench-mark a CLLD system, user friendly tools must be developed to facilitate the evaluation process, that is, to support run validation, link assessment and run evaluation. With all these assistant tools available, the evaluation framework can then be completed. 1.4

RESEARCH OBJECTIVES

Cross-language link discovery in this research aims to resolve the research problems and deal with the challenges discussed in the previous sections. The primary research objectives of this thesis are: • Objective 1. Realisation of Chinese / English bi-directional cross-lingual link discovery.


7

This research objective is to suggest a good algorithm of realising Chinese / English bi-directional cross-lingual link discovery in Wikipedia, and determine the affection of NLP and CLIR on CLLD. A prototype cross-language link discovery (CLLD) system will be also developed. • Objective 2. Development of an evaluation framework for cross-lingual link discovery. Without a standardised evaluation framework, the effectiveness of the crosslingual link discovery approaches cannot be justified. The evaluation framework will include test topics, document collections, a gold dataset, evaluation metrics, and toolkits for run pooling, link assessment and system evaluation. 1.5

RESEARCH SIGNIFICANCE

This thesis brings cross-lingual link discovery to knowledge base management, and helps break language barriers in knowledge sharing. With a special focus on Chinese / English link discovery, it improves the understanding of the effects of Chinese / English natural language processing and information retrieval in crosslingual link discovery. Practically, the outcomes of this research can be used to develop CLLD real life applications to facilitate cross-lingual knowledge discovery in Wikipedia, which will be especially beneficial for Chinese / English bi-lingual users. Users are then able to discover documents in languages which they are familiar with, or which have a richer set of documents than in their language of choice. Another major contribution of this thesis is the provision of a standard evaluation framework for the cross-lingual link discovery research. It is important in CLLD evaluation to have this framework which can help in benchmarking performance of various CLLD systems and in identifying good CLLD realisation approaches. With the evaluation framework, a CLLD information retrieval track can be run to provide a standardised evaluation procedure to CLLD research groups. The developed evaluation framework can also be adapted for the evaluation of other cross-lingual information retrieval tasks.

8


1.6

RESEARCH METHODOLOGY

As the aim is to achieve Chinese / English bi-directional cross-lingual link discovery, various realisation approaches will be experimented with. The effectiveness, the efficiency and the feasibility of these approaches must be supported by the results of properly designed CLLD experiments. Overall, the experimental method for accomplishing the research goals can be illustrated in Figure 1-4.

Figure 1-4: The Research Method Employed for This Thesis


9

1.7

THESIS OUTLINE

The remainder of this paper is organized as follows:



Chapter 2:

This chapter is structured with literature reviews of three main aspects of this research: 1) natural language processing; 2) cross-lingual information retrieval; 3) link discovery.



Chapter 3:

This chapter discusses natural language processing required in realising Chinese / English link discovery. The relevant publications for this part of the work are below: 

Tang, L.-X., Geva, S., Xu, Y., & Trotman, A. (2009). Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information. Proceedings of 14th Australasian Document Computing Symposium (ADCS 2009) (pp. 8299). University of New South Wales, Sydney: School of Information Technologies, University of Sydney.



Tang, L.-X., Geva, S., Trotman, A., & Xu, Y. (2010). A BoundaryOriented

Chinese

Segmentation

Method

Using

N-Gram

Mutual

Information. Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (pp. 234-239). Chinese Information Processing Society of China.

10




Chapter 4:

This chapter discusses cross-lingual information retrieval (CLIR) which is another important part of overall cross-lingual link discovery. The relevant publications for this part of the work are below: 

Tang, L.-X., Trotman, A., Geva, S., & Xu, Y. (2010). Wikipedia and Web document based Query Translation and Expansion for Cross-language IR. Proceedings of NTCIR-8 (pp. 121-125). Tokyo, Japan.



Tang, L.-X., Geva, S., Trotman, A., & Xu, Y. (2010). A Voting Mechanism for Named Entity Translation in English–Chinese Question Answering. Proceedings of the 4th Workshop on Cross Lingual Information Access (pp. 43--51). Beijing, China: Coling 2010 Organizing Committee.



Chapter 5:

Chapter 5 details an assessment and evaluation framework for the evaluation of cross-lingual link discovery approaches. The relevant publication for this part of the work is below: 

Tang, L.-X., Geva, S., Trotman, A., Xu, Y., & Itakura, K. Y. (2011). Overview of the NTCIR-9 Crosslink Task: Cross-lingual Link Discovery. Proceedings of NTCIR-9 (pp. 437-463). Tokyo, Japan.


11



Chapter 6:

Chapter 6 examines automated cross-lingual hypertext identification in Wikipedia, and focuses on the particular case of Chinese / English bidirectional link discovery. The study includes the effects of Chinese natural language segmentation and translation on CLLD. The relevant publication for this part of the work is below: 

Tang, L.-X., Cavanagh, D., Trotman, A., Geva, S., Xu, Y., Sitbon, L. (2011).

Automated

Cross-lingual

Link

Discovery

in

Wikipedia.

Proceedings of NTCIR-9 (pp. 512-519). Tokyo, Japan. 



Chapter 7:

This chapter discusses the effectiveness of cross-lingual link discovery in assisting easy cross-lingual information access in Wikipedia, and emphasises the importance of manual evaluation in quantifying the performances of CLLD systems. The relevant publications for this part of the work are below: 

Tang, L.-X., Itakura, K. Y., Geva, S., Trotman, A., & Xu, Y. (2011). The Effectiveness of Cross-lingual Link Discovery. Proceedings of The Fourth International Workshop on Evaluating Information Access (EVIA) (pp. 18). Tokyo, Japan.



Chapter 8:

This chapter concludes this thesis and points to directions for future work.

12


Chapter 2: Literature Review This chapter is structured using literature reviews for the three main aspects of this research: 1) natural language processing; 2) cross-lingual information retrieval; 3) link discovery. In section 2.1, Chinese segmentation and Chinese / English translation methods are reviewed. Section 2.2 presents the reviews for cross-lingual information retrieval approaches. Then, studies on current link discovery research are outlined in section 2.3. Other relevant information retrieval tasks are given in section 2.4. Section 2.5 summarises a conceptual framework for cross-lingual link discovery that can be built from previous studies based on the three aspects of this research. 2.1

NATURAL LANGUAGE PROCESSING

2.1.1 CHINESE SEGMENTATION OVERVIEW

Chinese segmentation is a process of breaking a string of Chinese text into meaningful and component words. Machine text processing tasks such as machine translation or query analysis may find it difficult to handle Chinese documents, because there are no word boundaries in written Chinese. Without having proper segmentation of the Chinese text beforehand, the capturing of the exact meanings of sentences for automatic text processing tasks is impossible. So text segmentation is often the first step in Chinese text processing. Also, it is clear that the matter of Chinese segmentation is vital to the cross-lingual link discovery methods proposed in the thesis because it is probable that neither meaningful anchors nor their associated relevant links can be created if a Chinese article is not properly segmented. The Chinese segmentation problem has been an interesting research topic for decades. Many delicate methods are used for providing good Chinese segmentation for natural language processing. In general, on the basis of the required human effort, Chinese word segmentation approaches can be classified into two categories: • Supervised methods These often are training-based or rules-based methods, which require specific language knowledge. Normally, a pre-segmented corpus is employed to train the Chapter 2: Literature Review

13

segmentation models e.g. PPM (Teahan, McNab, Wen, & Witten, 2000), or word lexicons need to be prepared for dictionary-based methods e.g. CRF (Peng, Feng, & McCallum, 2004). Supervised segmentation methods can achieve a very high precision in the targeted knowledge domain with the help of training corpus—the collection with manually segmented text. However, the drawbacks of supervised methods are also obvious. The effort needed for preparing the manually segmented corpus and parameters tuning is extensive. Also, the selected corpus, mainly from modern Chinese text sources, may cover only a small portion of Wikipedia Chinese text. Out-of-vocabulary words are also problematic for dictionary based methods. The combinations of different writings and different variants can lead to different representations of the same word. Furthermore, according to the 2nd International Chinese Word Segmentation Bake-off result summary (SIGHAN, 2005), the fact that rankings of participants’ results in different corpora are not very consistent may indicate that the supervised methods used in their segmentation system are form (simplified or traditional) sensitive. • Unsupervised methods Unsupervised methods are less complicated than supervised methods, and commonly need only simple statistical data derived from known text to perform the segmentation. For instance, statistical methods using different mutual information formulas to extract two-character words rely on the bi-gram statistics from a corpus (Dai, Loh, & Khoo, 1999; Sproat & Shih, 1990). Unsupervised methods are suitable for general Chinese segmentation where there is no or limited training data available. The resulting segmentation accuracy with unsupervised methods may not be very satisfactory, but human effort for creating the training data set can be avoided. 2.1.1.1

SUPERVISED SEGMENTATION METHOD

Teahan et al. (2000) use an adaptive language model, which is a standard modelling method in text compression, to infer the position for the word boundaries. This model uses prediction by partial matching (PPM) a symbol-wise compression scheme to predict the upcoming character giving the known previous ones. A collection of manually segmented texts is used to train the model before it is applied 14

Chapter 2: Literature Review

on the new text. The new text can then be broken into separated words when the maximum compression is achieved. They believe this approach could also be utilised to name recognition with a good outcomes. Peng, Feng and McCallum (2004) show that using linear-chain Conditional Random Fields (CRFs) and integrating them with lexicon formed domain knowledge produces a great result in Chinese word segmentation. The performance can be further improved by introducing a probabilistic new word detection method. Conditional random fields (CRFs) are probabilistic models that are widely used in the determination of the label and segmentation of sequential data (Lafferty, McCallum, & Pereira, 2001). Kit, Pan and Chen (2002) examine a supervised case-based approach to resolve the ambiguity with Chinese word segmentation. Their experimental result shows more than 90% of found ambiguities were recognised. They further conclude that the segmentation accuracy can theoretically reach 99.5% if their approach is bound with a segmentation method with precision greater than 95%. 2.1.1.2

UNSUPERVISED SEGMENTATION METHOD

Purely statistical methods for word segmentation are relatively less studied. Statistical approaches make use of statistical information extracted from text to identify words. The text itself is the only “training” corpus used by the segmentation models. Generally, the statistical methods used in Chinese segmentation can be classified into the following groups: 

Information Theory (e.g. entropy and mutual information),



Accessory Variety,



T-Score



Others.

Among methods using mutual information formula, the prestigious one and also the one with the longest history is the original mutual information formula. Mutual information and its derived work are mainly used in finding bigram words. It is believed that 99% of Chinese words are two-character words (Sproat & Shih, 1990).


15

Mutual information is used to measure the strength of association for two joining characters. The stronger the association they have, the more likely for them to be a word. The formula used for calculating the association score for two adjacent characters is:

(2.1)

where A(xy) is the association score of bi-gram characters xy; freq(x) is the frequency of character x occurring in the given corpus; freq(xy) is the frequency of the two characters sequence (x followed by y) occurring in the corpus; N is the size of the given corpus, p(x) is the possibility for character x happening in the corpus, calculated from freq(x)/N. Based on Sproat's work, Dai and Loh (1999) further developed an improved mutual information(IMI) formula to segment bigram words using regression analysis. This formula is listed as below:

(2.2) Their experiment results indicate that using this formula provides similar precision to that provided by the original mutual information formula. They also developed another formula called contextual information (CI) formula which considers the frequency of the character preceding and the character following the bigram as well. Given a character sequence – vxyz, the association strength of bigram xy is calculated as:

(2.3)

16


where

is the weighted possibility for the given character or bigram in the

corpus by considering the frequency of documents where that character or bigram appears. The contextual information formula proved to have better performance in term of precision than the IMI formula. There is a 7% improvement on average when comparing it with IMI formula. There are many other ways that researchers use text statistics in different formulae to perform word segmentation. Yang et al. (1998) proposed a method for Chinese language modelling by constructing a segment pattern lexicon. First, the lexicon is built from string patterns selected according the average Kullback-Leibler distance of characters; then the segmentation model is able to be developed based on such lexicon. Sun et al. (1998) tried to segment words without using lexicon and handcrafted training data. The segmentation algorithm is implemented by way of location decision. It is decided whether the two adjacent characters should be “bounded” or “separated” through a serials of judgment rules based on values of mutual information and difference of t-score. The number of correctly identified locations, where the words are separated, is used to measure the performance of this method. In their testing, 1456 out of 1587 (91.75%) locations for 1588 characters from 100 sentences are correctly identified. Statistical methods are often used to recognise unknown out of vocabulary words, but the identified word candidates often have an extra unwanted character due to the strong statistical associations between word and character. Ma and Chen (2003) proposed a special algorithm to resolve this problem. In their method, a set of general morphological rules with linguistic and statistical constraints is used to separate the extra character from the associated word. In a recent study, Feng et al. (2005) propose an accessory variety (AV) which segments words in an unsupervised manner. Accessory variety is a measurement for the possibility for a sequence of characters to be a word. A word is separated from the input text by judging the independence of the candidate word from the rest by using assessor variety criteria in considering the number of distinct preceding and trailing characters. An AV value of a candidate word is the minimal number of


17

distinct preceding or trailing characters. The higher an AV value is, the more independent the word. Also, for a long time, people have found that information theory can help in putting characters together to make them recognizable as a word. Lau and Gan (Lua, 1990; Lua & Gan, 1994) used the entropy formula in their application to do the word segmentation. A sequence of characters might be a word if the overall entropy of it is less than the sum of the entropy of individual characters. Using this entropy theory for word judgment differently, Tung and Lee (1994) considered the relationship of candidate words with all possible preceding and trailing single-characters appearing in the corpus. The entropy values are calculated for those characters that occur on either the left hand side or the right hand side of the candidate word. If values for the sum of the entropy for each side are high, the candidate word could be an actual word. 2.1.1.3

SEGMENTATION EVALUATION

Segmentation is commonly evaluated using simple recall and precision measures: (2.4) and (2.5) where R is the recall rate of the segmentation; P is the precision rate of the segmentation; c is the number of correctly identified segmented words; N is the number of unique correct words in the test data; n is the number of segmented words in the test data. 2.1.2 NAMED ENTITY RECOGNITION

Ye, Chua and Liu (2002) suggest a rationality model in a multi-agent framework for extracting named entities (NEs) such as person, location and organization names in Chinese text. The heuristic and statistical models used in most English name recognition systems are not good enough to tackle Chinese name recognition problems due to the lack of exterior features (such as capitalization) and 18


the difficulty of segmenting words in Chinese. Their solution for that is that a greedy strategy and NE rationality measures are used to detect all possible Chinese NEs in the text, and then choosing the most appropriate NEs among the candidates is treated as a multi-agent negotiation process. Zhang et al. (2002) introduce a unified solution based on role tagging for automatic recognition of Chinese unknown words. Initially, text is segmented using a dictionary to produce a sequence of tokens: monosyllabic characters and short words. Then tokens with most probable roles are tagged based on the Viterbi algorithm. Role tagging is similar to Part-of-Speech tagging, and roles knowledge can be predefined or acquired from the test corpus by training. It is already known that a person or location name is formed in certain role patterns. So the unknown words are identified through maximum pattern matching. Wang, Li and Chang (1992) demonstrate a mechanism particularly for personal name recognition based on a sub-language concept. As observed in the United News corpus, most unknown words are named entities (persons, locations and organizations) and personal name usually comes with a title or a role noun. Based on this observation, they take the named entity as a sub-language of Chinese, and it has its own grammar, syntax and semantics. By making judgement based on the summarized name sub-language rules for name and title or role noun combination, candidate names from unknown words are selected. This process is called title-driven name recognition. If the candidate name appears again in a news article, then the name is identified correctly. And this step is called adaptive dynamic word formation. Lin and Huang (2002) design a probabilistic verification model for verifying the correctness of the named entity candidate. The correctness of named entity recognition is regarded as a performance-affecting factor of word segmentation tasks in East Asian languages processing. Both commonly used methods – the handcrafted approach and statistical approach – have problems in named entity recognition. However, the statistical approach is more adaptive because of its portability and ability in scaling up. To reduce the recognition errors within the statistical model, the ‘confidence’ (could be either positive or negative) of identified name candidates is considered. The confidence level is assessed based on both the candidate word's


19

structure and its contexts. With the confidence gained on the identified name entity, whether it is a correct word then can be decided. 2.1.3 CHINESE / ENGLISH TRANSLATION

An easy way to translate the query terms or the anchor terms from one language to another is to use a transfer dictionary. It is very effective and efficient to get the translation done by mapping one word to corresponding translations in the dictionary. However, problem appears when the word is not part of the standard vocabulary, and this is often the case with named entities and new medical, technological, or political terms. A web-based English-Chinese translation method was proposed to resolve this problem in a recent report by Lu, Xu and Geva (2007). The idea behind it is that the word and its corresponding translation normally co-exist on the same page because if the word is new and authors/editors often provide the translation for easy reading. A modern web search engine can be set to return results in the target language for a query term that is not yet translated. Subsequently, the retrieved results are broken into candidate words. A good translation is then identified by using a new word extraction method, since the retrieved results are relatively small size snippets and the traditional supervised methods cannot be used in such situation. Chen et al. (1998) discuss several important issues in cross-language information retrieval. In their work, a query translation issue, particularly proper Chinese-to-English name translation, is mainly targeted. According to their findings, there are a large number of search queries containing names, but either dictionarybased or corpus-based approaches have limitations in proper name translation. To overcome the limitations of the above approaches, a two-step method is introduced: first, two sets of Chinese and English proper names as the translation candidates are collected from the Internet, then a mate matching process based on phonic string matching technique is conducted to locate the appropriate translation. 2.2

CROSS-LINGUAL INFORMATION RETRIEVAL

2.2.1 INDEXING

Zobel and Moffat (2006) present a tutorial on inverted files for text search engines. With relying heavily on underlying indexing techniques, search engines can return a large set of information and provide ease of access to it. Among those 20


techniques, inverted file is a most efficient index structure for query evaluation and document ranking. Sometimes, the volume of data could be too large to be fitted in main memory when constructing index. In-memory inversion, sort-based inversion, and merge-based inversion are the three major techniques to deal with this problem. In order to have further efficient indexing and query processing, compression is used. Both the required space usage and disk access times can be reduced through appropriate compression algorithms. There are many integer coding techniques used in inverted file compression, such as compact storage of integers, parameterless codes, Golomb and Rice codes, and binary codes. 2.2.2 DOCUMENTS RANKING

Robertson and Jones (1976) discuss a theoretical framework that makes use of relevance information to weight query terms for document ranking. Their theory for the probabilistic weighted retrieval makes two assumptions: 1) terms in a query are distributed independently; and 2) the retrieved documents should be ranked based on their probable relevance to the query. Robertson and Jones’s experiments indicate that a formula involving considerations of both presence and absence of query terms has the most effectiveness in finding relevant documents. Robertson and Walker (1994) suggest a 2-Poisson model with consideration of three variables (within-document term frequency, document length and within-query term frequency) that could affect the weights of the search terms in the document retrieval using probabilistic theory. With a theoretical analysis of the possible effects of different variables, effective query term weighting functions can be developed. Robertson and Walker’s experiments show that the weighting functions with these three variables considered could increase retrieval precision. Ponte and Croft (1998) combine document indexing and document retrieval as a single model called language model for information retrieval. Differently from other probabilistic models, their language modelling approach to retrieving documents consists of the following steps: first, a language model is inferred for each document; next, the probability of producing the query for each of the inferred models is computed; then, the documents are ranked based on the calculated probabilities.


21

Zhai and Lafferty (2001) study the issue of using smoothing methods for the language modelling approach to information retrieval. It is very important to have smoothing in solving the sparse data problem with the document retrieval using the language model. So a general retrieval formula is derived that smoothing can be used in tf.idf weighting and document length normalization. Three smoothing methods (Jelinek-Mercer method, Dirichlet priors, and absolute discounting) are particularly examined.

Although the results of Zhai and Lafferty’s experiments could not

determine which smoothing method performs the best, some interesting observations were made. Smoothing methods perform differently as the types of queries vary. Dong and Watters (2004) address the system execution efficiency issue which is rarely discussed in IR research. They propose an approach to reducing the complexity in both similarity computation and ranking procedures. The Boolean model and Vector Space Model (VSM) are the two common retrieval strategies widely used by large scale IR systems. With the VSM, the scores for documents ranking are computed based on the similarity of documents to the query in ndimensional vector space. Through adjusting the traditional VSM based algorithm to the one with linear complexity, similarity computation and document ranking can be very efficient. Yang, Ji and Tang (2004) demonstrate a method that can improve the retrieval precision by re-ranking the top N documents from the initial search. The extracted common key terms from initially retrieved document set are used to help locate the local important phrases in each document or topic. Then they are used for re-ordering the documents. These will be the terms that are shared by the top N ranking documents, and can present the main concept of the document set. They can be identified in a single document or in the query. Also they are the terms that make each document unique from others. Their research is mainly done on Chinese IR, but this could be used a generic method applied to information retrieval tasks of any other languages. Wu et al. (2007) discuss the top 10 algorithms in data mining: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. Those are most popular and influential algorithms in the data mining research community. Although these algorithms are data mining focused, they can still be adopted in

22


information retrieval, and may inspire new variant algorithms for this cross-lingual link discovery research. 2.2.3 PERFORMANCE DECIDING FACTORS

Cross-language information retrieval is about retrieving information in a language that differs from the language of the providing query. There are many possibilities that can affect the retrieval precision. Chen, Jiang and Gey (2000) examine various factors that may have impact on the performance of Chinese-English retrieval. Segmentation dictionary coverage, segmentation algorithm, transfer dictionary coverage, transfer dictionary quality, and translation disambiguation are all taken into consideration. They created two sets of experiments: monolingual English retrieval and Chinese-to-English cross-lingual retrieval. Only 56% of overall precision of monolingual retrieval is achieved in the cross-language retrieval experiment. Their failure analysis finds that incorrect segmentation, limited dictionary size and inappropriate translation are the major causes of poor retrieval results. They also find that the wider the transfer dictionary coverage is expanded, the higher the precision that can be achieved. 2.2.4 OUT OF VOCABULARY WORD RECOGNITION

In CLIR, retrieving documents with a cross-lingual query with out-ofvocabulary phrases has always been difficult. To resolve this problem, an external resource such as the Web or Wikipedia is often used to discover possible translations for out of vocabulary (OOV) terms. Wikipedia and other Web documents are thought of as treasure troves for OOV problem solving because they potentially cover the most recent OOV terms. A Web-based translation method was shown to be an effective way to solve the OOV phrase problem (Chen, et al., 2000; Lu, et al., 2007; Zhang & Vines, 2004; Zhang, Vines, & Zobel, 2005). The idea behind this method is that a term / phrase and its corresponding translation normally co-exist in the same document because authors often provide translations of new terms for easy reading. In Wikipedia the language links provided for each entry cover most popular written languages, therefore, it was used to solve a low coverage issue on named entities in EuroWordNet (Ferrández, Toral, Ferrández, Ferrández, & Muñoz, 2007); a number of research groups (Chan, Chen, & Lu, 2007; Shi, Nie, & Cao, 2008; Su, Chapter 2: Literature Review

23

Lin, & Wu, 2007; Tatsunori Mori, 2007) employed Wikipedia to tackle OOV problems in the NTCIR (NII Test Collection for IR Systems) evaluation forum. 2.2.5 CROSS-LANGUAGE INFORMATION RETRIEVAL

Kwok, Dinstl and Deng (2001) show a CLIR system named PIRCS (Probabilistic Indexing and Retrieval Components System) with a GUI for EnglishChinese cross-language information retrieval. The strategy used in PIRCS for CLIR is based on the automatic translation of the query to the document language using an LDC Chinese-English bilingual word-list. To remove the ambiguity due to multiple explanations, PIRCS employs a total of four disambiguation methods: dictionary structure-based, phrase-based, corpus frequency-based and weight-based techniques. The researchers develop the PIRCS based on the probabilistic indexing and retrieval methods, and it can best be viewed as “activation spreading in a three-layer network”. Queries and documents form two different layers, and they are connected via a middle layer of term nodes. Retrieval is done by spreading activation from a query node to a document node through the common term nodes in the middle layer. 2.3

LINK DISCOVERY

2.3.1 OVERVIEW

Link prediction, also called link discovery, is described by Getoor and Diehl (2005b) as one of the sub-tasks of link mining. It is thought of as a new data mining challenge (Getoor, 2003). Among the emerging link mining themes, link discovery is one of the core tasks (Getoor & Diehl, 2005a). In the link mining context, the meaning of link discovery mainly focuses on predicting the existence of a link between two objects based on their attributes and other observed links (Junlin, Le, & Jinming, 2005). In this project, link discovery has another explanation. It means not just predicting the links, but also discovering the missing, unknown and isolated links in the documents collection. So link discovery is an automated process that identifies anchor texts and links them to target documents at the level of anchor-todocument or anchor-to-bep. Wikipedia has become a popular place for reference works because of its being an open platform for collaborative knowledge sharing and because of its holding vast collections of articles in different languages. However, there is limited functionality for providing automated links discovery between documents in Wikipedia, and the 24


quality of existing links is never quantified. Adafre and De Rijke (2005), found some of these valuable links in Wikipedia are actually missing. For example, only 32 out of the selected 44 tennis player's pages are referred in the article on “tennis”, and 34 of 65 chosen singers pages are linked to the “singer” article. To promote study on link discovery, an INEX track called Link-the-Wiki was proposed by Geva and Trotman in 2007 to “produce a standard procedure and metrics for the evaluation of link discovery between documents” (INEX, 2009). 2.3.2 BEST ENTRY POINT

A best entry point (BEP) is a place or position in the document which is the perfect beginning for getting the right information (Reid, Lalmas, Finesilver, & Hertzum, 2006a, 2006c). The optimal access in a document means giving users a best entry point for starting reading. Preliminarily, the basic characteristics, types of best entry points (BEP) in structured document retrieval were analysed, and their usage and effectiveness were also examined (Reid, et al., 2006a, 2006c). There are three interpretations of BEP regarding its uses in different contexts, such as TREC, INEX and hypermedia multimedia retrieval. Reid et al. (2006a) define BEP as “a document component from which the user can obtain optimal access, by browsing, to relevant document components”. It means the same in this research. Lalmas and Reid (2003) report their two small-scale studies on the automatic identification of BEPs for focused structured document retrieval. The studies use a Shakespeare user study for the experimental data to develop and evaluate the different BEPs identification approaches. Two criteria are followed in designing the BEP selection algorithm: (1) the parent component will be returned if many of its child nodes are thought relevant to the query, (2) the first node will be chosen from the closely related ordered nodes resulting from retrieval based on criterion one. Study 1 involves the design of such algorithms to identify the BEP from a list of ranked retrieved document components, and study 2 tries to deduce the BEP identification rules from the known relevant document components based on the statistical analysis method.


25

2.3.3 ANCHOR IDENTIFICATION

Mihalcea and Csomai (2007) present their findings in “wikifying” the text with links to Wikipedia documents by combining the automatic keyword extraction and word sense disambiguation (WSD) methods. “Wikify” is a term that means a way of creating the important anchors in the input document and linking them to the corresponding Wikipedia pages automatically. First, an unsupervised algorithm is implemented to extract the keyword in two steps: candidate extraction and keyword ranking. The candidate keywords are the possible n-grams words in the article and also in the controlled vocabulary. Then they further evaluate three keyword ranking methods including tf.idf, x2 independence text, and keyphraseness. Lastly, the word sense disambiguation algorithm is used to determine the most appropriate meaning of the extracted keyword in the given context among many possible explanations. Both knowledge-based and data-driven approaches for keyword disambiguation are evaluated. 2.3.4 ASSESSMENT AND EVALUATION

INEX1 ran the mono-lingual Link-the-Wiki track for several years and while doing so explored various evaluation frameworks including automatic and manual assessment. The most surprising outcome of the INEX track was that the human (manual) assessors did not generally consider all the pre-existing Wikipedia links to be relevant (Huang, Xu, Trotman, & Geva, 2008; Huang, Geva, & Trotman, 2009, 2010). Huang, Trotman and Geva (2007) present their discussion on the automated link discovery evaluation for a newly launched INEX 2007 Link-the-Wiki track. Particularly, a preliminary assessment procedure is proposed for utilization in this track and beyond. It includes the topic selection, submission, pooling and evaluation. Various specific techniques such as the anchor text specification, result file specification, and its Document Type Definition (DTD), which are required for a unified link discovery evaluation, are discussed in details.

1

https://inex.mmci.uni-saarland.de

26


Huang et al. (2008) gave an overview of INEX 2007 Link-the-Wiki track. The tasks of Link-the-Wiki include finding both outgoing links and incoming links for search topics nominated by the participants. As in the first year of the track, the link discovery and its evaluation are only operated at the document level (File-to-File). The Link-the-Wiki track of INEX 2010 focuses on linking the Te Ara 2 encyclopaedia with Wikipedia (Trotman, Alexander, & Geva, 2011). 2.3.5 MONO-LINGUAL LINK DISCOVERY

Several methods of mono-language link discovery for Wikipedia have been proposed by, for example, Adafre & De Rijke (2005). Mihalcea & Csomai (2007) proposed the Wikify system for automatic link generation in Wikipedia while Milne & Witten (2008) tested a learning-to-link method that can be used to link any document to Wikipedia. Yeung et. al. (2011) proposed a framework for assisting cross-lingual editing in Wikipedia. Fachry et al. (2008) record their work in participating in the Link–the-Wiki Track. First, relevant documents are retrieved from the corpus using the Vector Space model. Next, the results are filtered based on the names of pages then links can be created from them. Geva (2008) did the Link-the-Wiki task by using the GPX search engine. The incoming links were created from the results retrieved using a standard NEXI query://article[about(.,name)] to search for the elements that were about the topic name element. Regarding the outgoing links, a simple approach is adopted to achieve the goal which is that a potential link is identified by matching the anchor text with all the existing page names. The selection of anchor text is done by sliding a window with a size of 1 to 12 words. The longer words, if matching with an article title, have higher ranking. Itakura and Clarke (2008) also demonstrate their findings in the Link the Wiki track. In their work, a simple algorithm was implemented to produce incoming and outgoing links and the result is then used as a baseline. Incoming links were generated only in document level. The first 250 documents returned from Wumpus 3

2

http://www.teara.govt.nz

3

http://www.wumpus-search.org


27

search engine containing the topic title but without the intra-corpus link were selected as the incoming links for the topic. To generate outgoing links, a formula to calculate the ratios,  score, is used for deciding what terms to pick as anchor text, and the corresponding pages containing these most frequent terms are taken as outgoing links. Jenkinson and Trotman (2008) present their work on Link-the-Wiki track. They decided that terms can be used as an anchor if they are over-represented. Regarding the creation of outgoing and incoming links, it was thought that both types of links should be reciprocal which means the incoming links can be inferred from the outgoing links by reversing them. 2.3.6 CROSS-LINGUAL LINK DISCOVERY

Automatic link discovery between documents in different languages is a new research field. Sorg & Cimiano (2008) tackle the German and English Wikipedia languagelink problem using a classification-based approach. Their study particularly examines missing language-links between Wikipedia articles on the same topic. They discovered that only a small portion of the English and German Wikipedia articles are cross-linked in their experiments. However cross-lingual natural language processing applications require a reasonable coverage and accuracy of cross-lingual links in an ever-growing knowledge base like Wikipedia. It is necessary to have a way for generating cross-lingual links automatically. To achieve this goal, they used a classification-based approach to produce new cross-lingual links. These new links are inferred from the already existing links by way of clustering. Melo & Weikum (2010) do the opposite: they examine incorrect Wikipedia language-links between articles on the same topic. 2.4

OTHER RELEVANT TASKS

After they conducted an experiment at INEX 2005, Trotman and Lalmas (2006) present their findings on the fact that structural hints to a query do not help much in improving the precision of XML retrieval. Giving an analysis on the runs submitted by INEX 2005 participants, they found out that there are improvements in some cases, but they are not significant enough to prove definite precision

28


enhancement after adding structure hints to the query. The researchers suggest that that could be because the users were not good at providing structural hints. Kamps, Lalmas and Pehcevski (2007) propose a generalized average precision measure for the evaluation of Relevant in Context (RiC) task at INEX 2006. In the RiC task, in addition to the standard relevant documents retrieval, the relevant information within each document should also be accurately located. They state that to evaluate this combined retrieval job, the calculated evaluation score should reflect not only the ranked list of retrieved documents, but also the correctness of the relevant elements location within the document. Trotman, Geva, and Kamps (2007) give a report on the four sessions held on the SIGIR 2007 workshop, which addresses the issues in the focused information retrieval theory and methodology. Question answering, passage retrieval, element retrieval (XML-IR) are the three kinds of focused retrieval tasks. These tasks have actually been studied for many years in many information retrieval forums such as TREC, INEX, CLEF and NTCIR. The focused retrieval tasks defined in those evaluation forums are different in many aspects such as data format, metrics for evaluation, etc, but have a lot in common. Solving general problems of them and providing a systematic study of focused retrieval are the major purposes of this workshop. In a recent study done by Vercoustre, Thom and Pehcevski (2008), entity ranking in Wikipedia was investigated. The task of entity ranking aims to retrieve entities as answers to a query rather than tag named entities in documents. That is a new INEX track focused on entity ranking, using Wikipedia as the document collection. The proposed tasks of the track would provide the target category and example entities for deciding the expected entity answers. Their realization of entity ranking in Wikipedia is based on three ideas: (1) using the full-text pages retrieved from an IR system that answer the query; (2) using popular links, which are extracted and selected from the above highest ranked pages; and (3) using category similarity calculated by comparing the category of popular links with that of entity examples. The overall scores are then calculated based on the above three aspects to decide the ranking of the entities.


29

2.5

SUMMARY AND IMPLICATIONS

This chapter reviewed the literatures that cover three main aspects of this research: natural language processing, cross-lingual information retrieval and crosslingual link discovery. Cross-lingual link discovery for Chinese / English Wikipedia will require integration and extension work on the current existing work for proper Chinese word segmentation, Chinese / English translation, Chinese / English crosslingual information retrieval and Chinese / English link discovery. Chinese segmentation has been studied for a long time, but the existing good methods with high precision may not be able to satisfy the segmentation needs for Wikipedia as Wikipedia is a collaborative work which may contain different forms of characters, different word variants and often errors. Also, the effect of Chinese segmentation on anchor identification in cross-lingual link discovery is unknown yet, so a proper Chinese segmentation method for Wikipedia would need to be simple, flexible, and easily adapted from the existing methods. As CLIR is claimed as a solved problem (Oard, 2002) and the current statistical machine translation (MT) can achieve high translation precision, this study will emphasise on employing the existing mature technologies in cross-lingual information retrieval and applying them in the realisation of cross-lingual link discovery. Especially, this research will need to identify a good text indexing strategy for Chinese text indexing and choose a reliable machine translation service for the translation between Chinese and English text. Good algorithms for recommending mono-lingual anchored links were seen at the first INEX Link-the-Wiki track in 2007 (Geva, 2008; Huang, et al., 2008; Itakura & Clarke, 2008). So these algorithms could be potentially adapted in a multi-lingual environment to suggest cross-lingual links. Also, evaluation metrics and tools (Huang, et al., 2010) used at INEX 2009 can be reused but will be adapted to the peculiarities of CLLD. The following chapters will discuss the employment, integration, and improvement of existing work for realisation of Chinese / English bidirectional link discovery and development of a cross-lingual link discovery evaluation framework.

30


Chapter 3: Natural Language Processing This chapter discuses natural language processing required in realising Chinese / English link discovery. Natural language processing for Chinese / English link discovery includes Chinese segmentation and Chinese / English translation. With emphasis on Chinese segmentation in this chapter, a detailed translation approach will be discussed in section 4.2.2.2. When working with a heterogeneous document collection, such as the Chinese Wikipedia for example, one encounters a mix of traditional, simplified, and classical Chinese writing, as well as several Chinese variants and dialects. Experiments reveal that even state of the art Chinese segmentation systems cannot produce high precision segmentation in such a (language wise) inconsistent collection. An unsupervised segmentation approach, named "n-gram mutual information", or NGMI, is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons.

Chapter 3: Natural Language Processing

31

3.1

INTRODUCTION

Due to lack of word boundaries in Chinese text, word segmentation is often the first step in Chinese text processing, for instance, when indexing text for information retrieval or identifying anchor for link discovery. Below is a piece of text extracted from an article of a person’s blog about her travelling in Tibet4: “班公措最奇特的是，在中國境內的湖水是淡水，水草豐茂，清甜甘美，魚類繁多，湖中盛產西藏弓魚和高原裸裂團魚，高原裸裂團魚是西藏特有魚類，此魚的排泄生殖孔和臀鰭兩側，有排列成行的大鱗片，乍看好像腹部裂開一條口，故名「裂腹魚」” In the above Chinese text, a type of fish – 西藏弓魚 (snowtrout)5 living in Pangong Tso 6 , Rutog, Tibet is described. In order to suggest Wikipedia links automatically for this article, characters should be broken into words for the purpose of proper anchor recommendation. For example, 西藏弓魚 is a term that is not familiar to readers who may be interested in finding out what kind of fish it is. However, to automatically generate a link for 西藏弓魚 to the Wikipedia “snowtrout” article (in either Chinese or English) is not easy because to identify this term from the text could be problematic for even a state of the art segmentation system for various reasons. Table 3-1 shows the segmentation results (data retrieved on 7 March. 2011) of three state of the art systems for the sentence “湖中盛產西藏弓魚和高原裸裂團魚” taken from the above Chinese example text, and all systems

4

http://tw.myblog.yahoo.com/hsiang1025/article?mid=5284

5

http://www.discoverlife.org/mp/20q?search=Racoma+labiata http://en.wikipedia.org/wiki/Snowtrout http://www.bioinfo.cn/db05/BjdwSpecies.php?action=view&id=6226

6

http://en.wikipedia.org/wiki/Pangong_Tso

32


segment “ 西藏弓魚 ” into: “ 西藏 ”(Tibet) “ 弓 ”(bow)

“魚 ” (fish). With these

segmentation results, neither the anchor for 西藏弓魚 nor the links to corresponding Wikipedia pages can be created. Also, modern Chinese has two forms of writings: simplified and traditional. For instance, the word “China” is written as “中国” in simplified Chinese, but as “中國” in traditional Chinese. Furthermore, a few variants of Chinese language exist in different locales including: Mainland China, Taiwan, Hong Kong, Macau, Singapore and Malaysia. For instance, a laser printer is called 激光打印机 in mainland China, but 鐳射打印機 in Hongkong, and 雷射印表機 in Taiwan. Table 3-1: The Results of Chinese Segmentation of Different Systems System

Result

SCWS7

湖中盛產西藏弓魚和高原裸裂團魚

CKIP Chinese Parser8

湖(Na) 中(Ng) 盛產(VC) 西藏(Nc) 弓(Na) 魚(Na) 和 (Caa) 高原(Na) 裸裂(VH) 團魚(Na)

ICTCLAS9

湖/n 中/f 盛產/v 西藏/ns 弓/n 魚/n 和/cc 高原/n 裸/ag 裂/v 團魚 /n

So for digital representations of Chinese text different encoding schemes have been adopted to encode Chinese characters. However, most encoding schemes are incompatible with each other. To avoid the conflict of different encoding standards and to cater for people's linguistic preferences, Unicode is often used in collaborative work, for example in Wikipedia articles. With Unicode, Chinese articles can be

7

http://www.ftphp.com/scws/demo/v4.php

8

http://parser.iis.sinica.edu.tw/

9

http://ictclas.org/ictclas_demo.html


33

composed by people from all the above Chinese-speaking areas in a collaborative way without encoding difficulties. As a result, these different forms of Chinese writings and variants may coexist within the same pages. Besides this, Wikipedia also has a Chinese collection in Classical Chinese 10 (also called Literary Chinese, or 文言文) only, and other versions for a few Chinese dialects, for example, 贛語 (Gan) Wikipedia11, 粵語 (Cantonese) Wikipedia12 and others. However, the study of segmentation for Chinese dialects such as Gan and Cantonese is not included in this thesis. Moreover, in this Internet age more and more new Chinese terms are coined at a faster than ever rate. Correspondingly, new Chinese Wikipedia pages will be created for the explanations of such terms. It is difficult to keep the traditional dictionary up to date due to the rate of creation and extent of new terms. All these issues could lead to serious segmentation problems in Wikipedia text processing while attempting to recognise meaningful words / anchors in a Chinese article, as text will be broken down into single character words when the actual n-gram word cannot be recognised. In order to extract n-gram words from a Wikipedia page, the following problems must be solved: • Mix of Chinese writing forms: simplified and traditional • Mix of Chinese variants • Mix of Classical Chinese and Modern Chinese • Out of vocabulary (OOV) words To tackle these issues and avoid the effort of preparing and maintaining segmented text and lexicons for different corpora and potential issues when applying existing methods on Wikipedia for Chinese articles segmentation, a new statistical method called n-gram mutual information (NGMI) is presented in this chapter.

10

http://zh-classical.wikipedia.org

11

http://gan.wikipedia.org

12

http://zh-yue.wikipedia.org

34


3.2

CHINESE SEGMENTATION WITH N-GRAM MUTUAL INFORMATION

3.2.1 A BOUNDARY-ORIENTED SEGMENTATION METHOD

N-gram mutual information (NGMI) is a simple statistical method particularly designed for Chinese segmentation in Wikipedia. It can be used as either an unsupervised or a supervised segmentation method dependant on the availability of training resources. It relies on text segments’ mutual information, which can be computed with statistical data from text mining on Chinese Wikipedia corpus. The core idea of this NGMI segmentation method is to identify words by looking for word boundaries by combining contextual information, rather than directly looking for words. This was tried by Sun et al. (1998): the two adjacent characters are “bounded” or “separated” through a series of judgment rules based on values of mutual information and difference of t-score. But for NGMI there are no hand-crafted rules involved, and mutual information of not only characters but also segments is considered. In NGMI a new concept named boundary confidence (BC) is introduced. The boundary confidence is used to determine how the text should be segmented. As any place between two adjacent Chinese characters could be a possible boundary of two different words, it is difficult to find the rightful ones. By measuring the confidence of a place being a boundary, spaces are inserted at the places with the highest confidence. N-gram words can thus be separated. The estimation of boundary confidence is based on the mutual information of adjoining segments. In other words, BC measures the association level of the left and right characters around each possible boundary to decide where a boundary should be actually placed. Since n-gram mutual information looks for boundaries in text but not for words directly, it overcomes the limitation of traditional usages of mutual information in the recognition of bi-gram words only. 3.2.2 DEFINITION OF N-GRAM MUTUAL INFORMATION

With considering at most two characters on each side of a boundary, the NGMI score of a given sentence is defined as: (3.1)


35

For the boundary confidence of the boundaries at the beginning or end of an input string, only one character from one side of the boundary is retrieved: (3.2)

(3.3)

where MI is the mutual information of adjacent segments. Detailed explanations of notations in above equations and how these equations are derived will be given in section 3.3. Basically, for any possible boundary the lower the NGMI score it has, the more likely it is to be an actual boundary. A threshold then can be set to decide whether a boundary should be placed. So even without a lexicon, it is still possible to segment text with a certain precision which just simply means the suggested words are all outof-vocabulary. Hence, NGMI can be subsequently used for OOV word recognition. 3.3

UNSUPERVISED N-GRAM MUTUAL INFORMATION

3.3.1 DEFINITIONS

For example, there is an input Chinese string, :

(3.4) where

, or Sx, is a single segment from the Chinese sequence , and it

could be a candidate word;

is a possible segmentation of input

string . Whether this segmentation has all correct words separated needs to be decided by a model that can make the best choice based on the ranking scores for all possible segmentations. So for ease of possible segmentation, the confidence of having correct boundary will be measured. The boundary | of a sub-string (L|R)—Sx| Sx+1 in , consisting of a left substring L—Sx, and a right substring R—Sx+1, is then determined based on the confidence. In other words, the boundary confidence

36


measures the association level of the left and right substrings. The BC of any adjoining segments is defined as:

(3.5)

where, (3.6)

is the association score of segment Sx and segment Sx+1. The lower the mutual information score of L and R, the more confident it is about having the correct boundary placed. Generally speaking, characters that appear more frequently with each other will have a higher mutual information value, and this suggests there is a strong association between them. Therefore it is unlikely that there will be a boundary between the characters. So far, how to exactly compute the association scores for the adjacent segments is still a problem. In order to identify an appropriate method to determine the boundary confidence of segments, a few BC variants will be tried. By considering at most two characters of segments

and

, BC can be

computed in a few ways: . They are defined as:

(3.7)

where

is the right most character of Sx, and


is the left most character of Sx+1.

37

(3.8)

(3.9)

(3.10)

(3.11)

where, (3.12)

An NGMI value of a sentence will decide whether the prospective boundaries inside it are acceptable. The lower the NGMI score of a segmentation, the more likely for it to have the right words split. So to find out the most appropriate splits for the sentence, it is necessary to sum up the confidence scores of all boundaries for all possible segmentations. So for the different BC variants, there are:

38


(3.13)

(3.14)

(3.15)

(3.16)

(3.17)

For any particular split if the sum of the boundary confidences of a sentence has a negative value, it is unlikely to split correct words in the middle. A detailed example may help explain this, given a sentence se— abcdef, there are two possible segmentations:

(3.18) and, (3.19)

So for

and

of se, there

are:

(3.20)

(3.21)


39

3.3.2 SEGMENTATION ALGORITHM

For the input string

as given in section 3.3.1, the NGMI word segmentation

process has the following steps: 1.

Retrieve the first x characters (in the experiments, x was set to 11): from the beginning of string .

2.

Build a list of all possible segmentations - Slist for the x characters. The upper bound of Slist equals x2-1. For 11 characters, there will be 1023 permutations, but segmentations will be removed from the list if they contain any substring that is not in the frequency table.

3.

For each boundary, apply boundary confidence calculation, and sum all BC scores up for each segmentation.

4.

Sort Slist based on the NGMI scores in ascending order. It means that the lower score, the more likely it is to have the correct boundaries.

5.

Choose the first segmentation (having the highest ranking) as the best segmentation:

This step will only be executed if this is a stop words elimination run (NGMI_MIN_SW or NGMI_MIN_SW_AS, etc. which will be illustrated in Table 3-2). Any word—Wi (

) in the best segmentation

with length more than two characters, will be further broken down as in the previous four segmentation steps (step 2 to step 5) into:

40


6.

This further segmentation will be accepted only if it meets the following conditions: both the first segment ( are not one-character words; or if either

and the last segment ( or

and it is in the stop words list. For example, if

) of

is a one-character word contains a stop word

at the beginning, then the best segmentation

from step 5 will now

become:

7.

Accept all the segments of

as words except for putting the last word

back into the unsegmented text. The last segmented word is returned to the unsegmented text since the split of these x characters was arbitrary and the best segmentation 8.

may have split a long word in the middle.

Start the segmentation loop and repeat the segmentation process from step 1-8, until all the remaining characters are consumed.

This segmentation algorithm requires computing NGMI scores of all possible segmentations, so it has high computational complexity. Thus, its performance will not be satisfactory. As the feasibility of NGMI segmentation method and the determination of the best NGMI variant are focused on, the optimisation of this algorithm is not considered here. 3.3.3 EXPERIMENTS 3.3.3.1

STRING PATTERN FREQUENCY TABLE

NGMI relies on a frequency table of n-gram string patterns, which can be built from any selected text. Therefore, general statistical information for the Chinese language can be obtained through text mining of the Chinese Wikipedia XML corpus (Denoyer & Gallinari, 2006), which contains 56,662 documents, 27,360,399 Chinese characters, and 11,464 unique Chinese characters in total. For any character sequence with length less than 12, their corresponding frequencies are recorded in a string pattern frequency table. Since the size of such a complete table is very large, only those string patterns appearing in the corpus more than 216 times are kept. 216 is a number that is arbitrarily chosen to ensure that the frequency table can fit into experimental machine memory.


41

3.3.3.2

STOP WORDS

It is recognised that an extra character like a preposition or a postposition cannot be separated from the actual word because of the strong statistical association. From the string pattern frequency table, the top 20 single-character words with the highest frequency (over 100,000 times) were selected as “stop words”. 3.3.3.3

TEST DATA

In-house Test Data The following articles were randomly chosen as segmentation test topics from the Chinese version of the Wikipedia: 本草纲目 (Bencao Gangmu), 马可·波罗 (Marco Polo), 張仲景 (Zhang Zhongjing)¸贫民百万富翁 (Slumdog Millionaire), 网络评论员 (50 Cent Party), and 风水 (Feng Shui). All text from the above might be a mix of classical Chinese, simplified and traditional Chinese, and Chinese language variants. Bake-off 2005 Test Data In the Second International Chinese Word Segmentation Bake-off test set, there are four groups of data (each having training, testing and gold-standard) provided by Academia Sinica, City University of Hong Kong, Peking University and Microsoft Research respectively (Second International Chinese Word Segmentation Bakeoff - Result Summary, 2005). The gold-standard data is the manually segmented text following the word specifications defined by each corpus creator. Each data set contains only one form of Chinese writing: simplified or traditional Chinese. 3.3.3.4

EXPERIMENTAL RUNS

The Chinese segmentation experiments were performed using different NGMI variants and test data. Their run names and descriptions are listed in Table 3-2.

42


Table 3-2: Information of Segmentation Runs Run Name

Description

ICTCLAS

With ICTCLAS Chinese word segmentation system online demonstration version 13 , from Chinese Academy of Sciences, using the in-house test data. It was developed based on multi-layer hidden Markov model (Institute of Computing Technology)

MI

With original mutual information formula (Sproat & Shih, 1990) using the in-house test data

IMI

With the improved mutual information formula proposed by Dai et al. (Dai, et al., 1999) using the in-house test data

NGMI_PAIR

With NGMIpair formula using the in-house test data

NGMI_SUM

With NGMIsum formula using the in-house test data

NGMI_MIN

With NGMImin formula using the in-house test data

NGMI_MAX

With NGMImax formula using the in-house test data

NGMI_MEAN

With NGMImean formula using the in-house test data

NGMI_MIN_SW

With NGMImin formula combining stop-words judgment using the in-house test data. The segmentation process repeats on already segmented words with length more than two characters, the words will be further split if the conditions as stipulated are met, (see “step 6” in the segmentation algorithm)

NGMI_MIN_SW_AS

Same with NGMI_MIN_SW run but using the Academia Sinica test data

NGMI_MIN_SW_CU

Same with NGMI_MIN_SW run but using the City University of Hong Kong test data

NGMI_MIN_SW_PU

Same with NGMI_MIN_SW run but using the Peking University test data

NGMI_MIN_SW_MSR

Same with NGMI_MIN_SW run but using the Microsoft Research test data

13

http://ictclas.org/ictclas_demo.html


43

3.3.4 RESULTS

In this section the performance of the different segmentation runs on both the in-house test data and the bake-off test data are compared. The comparison of the precision level for the in-house test data is used as major performance measurement for identifying the best NGMI variant. The performance of segmentation runs measured by the overall recall rate on all the data is also given. The precision and recall of a segmentation run are computed using the evaluation measures (equation (2.4) and (2.5)) discussed in section 2.1.1.3. 3.3.4.1

RUNS ON THE IN-HOUSE TEST DATA

The precision values and their corresponding numbers of correctly identified words in each run against in-house test data are given in Table 3-3. The recall values of all runs on the in-house test data are given in Table 3-4. Table 3-3 shows the mutual information runs are inherently limited by selecting only bi-gram words; and the NGMI runs are able to extract words with up to seven characters, even though the NGMI runs achieve only around 50% precision rate overall. The number of correctly identified words is similar for all runs on the inhouse test data. The mutual information runs identify a high number of bi-gram words accurately, and the NGMI_MIN_SW run produces similar results but with more higher-gram words correctly identified. The results of the NGMI_MIN_SW run demonstrate an increase in the overall precision rate, reaching 62.6% from 53.5%, but the numbers of correctly identified higher-gram words drop. Some of the correct n-gram words must be split and lost due to the further segmentation. Table 3-4 shows that despite the loss of correctly identified n-gram words the NGMI_MIN_SW run still has the highest recall rate of all runs. The recall rates of mutual information runs, 60.95% and 64.73% respectively, come second and third. Other NGMI runs have slightly over 60% recall rate. The recall rate of ICTCLAS (56.58%) is low considering its relatively high precision. And considering the number of the single character words identified by the ICTCLAS run on the in-house test data is significantly higher than those in other runs but with a low precision, this suggests that the ICTCLAS online word segmentation system is accurate at recognising one form of written Chinese(either simplified or traditional), but it fails in the other. In mixed form documents the use of ICTCLAS could be problematic.

44


Table 3-3: The Segmentation Results on the In-House Test Data n-gram

1

2

3

4

5

6

7

Overall

36.4%

87.2%

83.6%

75.7%

100.0%

434

1042

97

28

3

42.1%

79.0%

410

1318

42.9%

71.1%

406

1429

0

0

0

0

0

1835

49.4%

71.6%

18.5%

16.3%

9.1%

0.0%

0.0%

53.9%

365

1329

113

29

2

0

0

1838

49.0%

63.4%

18.7%

13.5%

6.3%

0.0%

0.0%

48.7%

350

1279

124

30

2

0

0

1785

48.8%

69.7%

19.5%

14.2%

4.8%

0.0%

50.0%

53.5%

362

1307

104

27

1

0

1

1802

48.7%

61.8%

18.5%

13.3%

6.1%

0.0%

0.0%

48.0%

343

1279

123

29

2

0

0

1776

50.0%

64.4%

19.2%

13.4%

6.1%

0.0%

0.0%

49.9%

342

1318

123

29

2

0

0

1814

49.4%

72.8%

29.9%

33.3%

0.0%

368

1520

96

8

0

63.0%

ICTCLAS

0

0

1604 65.4%

MI

0

0

0

0

0

1728 62.0%

IMI

NGMI_PAIR

NGMI_SUM

NGMI_MIN

NGMI_MAX

NGMI_MEAN

62.6%

NGMI_MIN_SW

0

0

1992

The first row of each run is the segmentation precision for that run; the second row is the number of correctly identified n-gram words. Overall precision = # of correctly identified words / # of all detected words.


45

Table 3-4: Recall of Segmentation Runs on In-House Test Data Run Name

Recall

ICTCLAS

56.58%

MI

60.95%

IMI

64.73%

NGMI_PAIR

64.83%

NGMI_SUM

62.96%

NGMI_MIN

63.56%

NGMI_MAX

62.65%

NGMI_MEAN

63.99%

NGMI_MIN_SW

70.26%

Overall, the supervised methods normally restrict themselves to choose words from the lexicon only, so their segmentation results have relatively a small number of found words. This explains why ICTCLAS has a high precision but a low recall. In contrast, as there isn't a finite correct words set for NGMI runs, the number identified words could be huge. And that leads to the decrement in the precision because of the larger denominator. 3.3.4.2

RUNS ON THE BAKE OFF TEST DATA

It has to be noted that all the segmentation runs on the bake-off test data are created directly using the string frequency table obtained from the Chinese Wikipedia corpus without any knowledge of the bake-off training data. The training text is in fact completely independent of the test corpus. The precision values and their corresponding numbers of correctly identified words on the bakeoff test data are given in Table 3-5. The recall figures for all bake-off runs are given in Table 3-6. The maximum length of n-gram words identified by the bakeoff runs is 10 from the run on Microsoft research corpus (PMI_MIN_SW_MSR). Having a certain level of precision for n-gram words segmentation on the bakeoff test data demonstrates the corpus independent ability of NGMI in segmenting n-gram words. From Table 3-5 it can be seen that the overall precision rate for the four bakeoff runs are much lower than that in the NGMI_MIN_SW run on the in-house test data; the precision for lower-gram words on both the in-house and the bakeoff test data are similar, but the precisions for the higher gram words on the bakeoff data

46


are significantly lower than the precision on the in-house test data. So overall that results in a lower precision rate.

Table 3-5: The Segmentation Results on the Bakeoff Test Data n-gram

PMI_MIN_SW_AS

PMI_MIN_SW_CU

PMI_MIN_SW_PU

PMI_MIN_SW_MSR

1 60.8%

1384

54.7%

776

76.1%

1276

74.0%

1208

2 65.3%

9659

69.1%

4895

67.6%

6669

63.9%

6013

3 18.7%

1237

21.8%

523

15.1%

790

12.5%

681

4

9.5%

108

12.0%

52

13.0%

205

14.9%

235

5

2.2%

3

3.5%

2

7.1%

18

11.2%

29

6

6.3%

1

0.0%

0

1.5%

1

14.7%

11

7

0.0%

0

0.0%

0

2.4%

1

28.1%

9

8

0.0%

0

0.0%

0

0.0%

0

27.3%

3

9

0.0%

0

0.0%

0

0.0%

0

28.6%

2

10

0.0%

0

0.0%

0

0.0%

0

25.0%

1

44.4% 8192 Overall 49.6% 12392 54.8% 6248 47.7% 8960 The first column of each run is the segmentation precision for that run; the second column is the number of correctly identified n-gram words. Overall precision = # of correctly identified words / # of all detected words.

Table 3-6: Recall of Segmentation runs on the Bake-off Test Data Run Name

Recall

NGMI_MIN_SW_AS

68.96%

NGMI_MIN_SW_CU

72.85%

NGMI_MIN_SW_PU

72.26%

NGMI_MIN_SW_MSR

69.86%

Table 3-6 shows that the recall rates of all bake-off run are around 70%, which indicates the corpus independent ability of NGMI in segmenting n-gram words. Of course this can be attributed to the fact that the Chinese Wikipedia corpus is a mixed language corpus and hence it covers the language of the bake-off text.


47

Both Table 3-4 and Table 3-6 shows that the recall rate of the NGMI method using NGMmin formula with stop words elimination is the highest, and this is consistent (all around 70%) through all runs. Therefore, NGMmin will be used as the official NGMI calculation formula, and this also explains the definition of NGMI given in section 3.2.2. 3.4

SUPERVISED N-GRAM MUTUAL INFORMATION

NGMI can be also used as a supervised Chinese segmentation method when manually segmented text for a specific knowledge domain is available for training. It is often a challenge to segment text that is out-of-domain for supervised methods which are good at the segmentation for the text that has been being seen segmented before. On the other hand, unsupervised segmentation methods could help to discover words even if they are not in vocabularies. To conquer the goal of segmenting out-of-domain text and take advantage of the training corpus, n-gram mutual information can be subsequently adjusted and become trainable for crossdomain text segmentation. The idea of making NGMI trainable is to turn the segmented text into a word based frequency table. It is a table that records only words, adjoining word pairs and their frequencies. For example, given a piece of training text – “A B C E B C A B” (where A, B, C and E are n-gram Chinese words), its frequency table should look like the following: A|B 2 B|C

2

C|E

1

C|A 1 A

2

B

3

C

2

E

1

Also, when doing the boundary confidence computation, any substrings (children) of the words (parents) in this table are set to have the same frequency as their parents’. 48


3.4.1 SEGMENTATION SYSTEM DESIGN WITH SUPERVISED NGMI 3.4.1.1

FREQUENCY TABLE AND ITS ALIGNMENT

In order to resolve ambiguity and also recognise OOV terms, statistical information of n-gram string patterns in test files should be collected. There are in total two groups of frequency information used in the segmentation. One is from the training data, recording the frequency information of the actual words and the adjoining word pairs; the other is from the unsegmented text, containing frequency information of all possible n-gram patterns. However, the statistical data collected from the unsegmented test file contains many noise patterns. It is necessary to remove those noise patterns from the table to avoid negative impact on the final BC computation. Therefore, an alignment of the pattern frequencies obtained from the test file is performed to reduce noise. The frequency alignment is conducted in a few steps. First, a frequency table of all string patterns for the unsegmented text including those having a frequency of one is built. Second, the frequency table is sorted by the frequency and the length of patterns. Longer patterns have a higher ranking than the shorter ones; for patterns of the same length the ones having higher frequency are ranked higher than those having lower. Next, starting from the beginning of the table where the longest and the most frequent pattern have the highest ranking, one record is retrieved each time and all its sub-patterns which have the same frequency as its parent’s are removed from the table. After such a frequency alignment is done, two frequency tables are merged into one and are ready for the final boundary confidence calculation. 3.4.1.2

SEGMENTATION WITH SUPERVISED NGMI

In the training and the system testing stages, the segmentation results using boundary confidence alone for word disambiguation were found unsatisfactory. Trying to achieve as high a performance as possible, the overall word segmentation for the bakeoff is done by using a hybrid algorithm which is a combination of NGMI for general word segmentation, and the backward maximum match (BMM) method for the final word disambiguation. Since it is common for a Chinese document to contain various types of characters: Chinese, digit, alphabet and characters from other languages, Chapter 3: Natural Language Processing

49

segmentation needs to be considered for two particular forms of Chinese words: 1) words containing non-Chinese characters such as numbers or letters; and 2) words containing purely Chinese characters. In order to simplify the process of overall segmentation, boundaries are automatically added to the places in which Chinese characters precede non-Chinese characters. Additionally, for words containing numbers or letters, only those beginning with numbers or letters and ending with Chinese character(s) against the given lexicons should be searched. If the search fails, the part with all non-Chinese characters remains the same and a boundary is added between the non-Chinese character and the Chinese character. For example, segmenting a sentence “一万多人喜迎１９９８年新春佳节” consists of the following main steps: 

First, because of … 迎 |１ …, only “一万多人喜迎 ” requires initial segmentation.



Next, find a matched word “１９９８年” in a given lexicon.



Last, segment “新春佳节”.

So the critical part of the segmentation algorithm is to segment strings with purely Chinese characters. During the computation for the possible segmentation, each boundary can be assigned with a score. By combining the information of already known words (i.e. a vocabulary from the labelled training data), a boundary can further fall in one of the following three special BC categories: 

inseparable



threshold



absolute boundary

A boundary marked as inseparable means the characters around it are a part of an actual word; an absolute boundary means the adjoining segments pairs are not seen in any words or string patterns; a threshold is given to a possible boundary for

50


which only one of the adjoining segment pairs (ASPs) can be found in the word pair table, and its length is greater than two characters. After finishing all BC computations for an input string, the string then can be broken down into segments separated by the boundaries having a BC score that is lower than or equals to the threshold value. For each segment, whether or not a word is in the vocabulary or whether it is an OOV term using an OOV judgement formula can be checked and that will be discussed in section 3.4.1.3. If a segment is not a word or an OOV term, it means there is an ambiguity in that segment. For example, given a sentence “…送行李…”, the substring “送行李” inside the sentence can be either segmented into “送行 | 李” or “送 | 行李”. To disambiguate it, a segment is divided into two chunks at the place having the lowest BC score. If one of the chunks is a word or OOV term, this two-chunk breaking-down operation continues on the remaining non-word chunk until both newly divided chunks are words, or none of them is a word or an OOV term. After this recursive operation is finished, if there are still non-word chunks left they will be further segmented using the backward maximum match (BMM) method. The overall segmentation algorithm for an all-Chinese string can be summarised as follows: 1) Compute BC for each possible boundary. 2) Input string becomes segments that are separated by the boundaries having a low BC score (not higher than the threshold). 3) Each remaining non-word segment resulting from step 2 gets recursively broken down into two chunks at the place having the lowest BC in this segment based on the scores from step 1. This breaking-down-into-twochunk loop continues on the non-word chunk if the other is a word or an OOV term; otherwise, all the remaining non-word chucks are further segmented using the backward maximum match method. 3.4.1.3

OOV WORD RECOGNITION

Here, a simple strategy is used for OOV word detection. It is assumed that an n-gram string pattern can be qualified as an OOV word if it repeats frequently within


51

only a short span of text or across a few documents. So to recognise OOV words, the statistical data extracted from the unsegmented text needs to contain not only pattern frequency information but also document frequency information. However, the documents in the test data are boundary-less. To obtain document frequencies for string patterns, in this study test files were separated into a set of virtual documents by splitting them according size. The size of the virtual document (VDS) is adjustable. For a given non-word string pattern S, its probability of being an OOV term is computed by using: (3.22)

where tf is the term frequency of the string pattern S; df is the virtual document frequency of the string pattern. Then S is considered an OOV candidate, if it satisfies:

(3.23)

where OOV_THRES is an adjustable threshold value that can be set to filter out the n-gram patterns with lower probability of being OOV words. However, using this strategy could have side effects on the segmentation performance because not all the suggested OOV words might be correct. 3.4.2 SUPERVISED NGMI AT CIPS-SIGHAN 2010 3.4.2.1

THE CHINESE WORD SEGMENTATION BAKEOFF

In the Chinese word segmentation task of CIPS-SIGHAN 2010, the focus is on the performance of Chinese segmentation on cross-domain text. There are in total two types of evaluations: closed and open. Experiments were conducted for both closed and open training evaluation tasks and both simplified and traditional Chinse segmentation subtasks. As restricted by the rules of the closed training evaluation, the provided resource for system training is limited and using external resources such as trained segmentation software, corpus, dictionaries, lexicons, etc are forbidden; especially, human-encoded rules specified in the segmentation algorithm are not allowed. 52


3.4.2.2

EXPERIMENTAL SETUP

Parameters Settings

Table 3-7 shows the parameters used in the system for segmentation and OOV recognition. Table 3-7: System Settings for both Closed and Open Training Evaluations Parameter

Value

N

# of words in training corpus

THRESHOLD log(1/N) VDS

10,000bytes

OOV_THRES

2.3

Closed and Open Training

For both closed and open training evaluations, the algorithm and parameters used for segmentation and OOV detection are exactly the same. This is true except for an extra dictionary - cc-cedict (MDBG) being used in the segmentation system in the open training evaluation. 3.4.3 RESULTS AND DISCUSSIONS

In the Chinese word segmentation task of CIPS-SIGHAN 2010, the system performance is measured by five metrics: recall (R), precision (P), F-measure (F1), recall rate of OOV words (RROOV), and recall rate of words in vocabulary (RRIV). The official results of the implemented system with NGMI segmentation method for both open and closed training evaluation are given in Table 3-8. The recall rates, precision values, and F1-scores of all tasks show the promising results of this system in the segmentation for cross-domain text. However, the gaps between the scores of this system and the bakeoff bests also suggest that there is still plenty of room for performance improvements. The OOV recall rates (RRoov) showed in Table 3-8 demonstrate that the proposed OOV recognition strategy can achieve a certain level of OOV word discovery in closed training evaluation. However, the overall result for the OOV word recognition is not very satisfactory if comparing it with the best result from other bakeoff participants. But for the


53

open training evaluation the OOV recall rate picked up significantly, which indicates that the extra dictionary — cc-cedict covers a fair amount of terms for various domains.

Table 3-8: Segmentation Results for Four Domains in both Closed and Open Evaluations Simplified Chinese Task R A (c)

P

F1

ROOV

RROOV RRIV

0.907 0.862 0.884 0.069 0.206

0.959

A (o) 0.869 0.873 0.871 0.069 0.657

0.885

B (c)

0.876 0.844 0.86

0.152 0.457

0.951

B (o)

0.859 0.878 0.868 0.152 0.668

0.893

C (c)

0.885 0.804 0.842 0.110 0.218

0.967

C (o)

0.865 0.846 0.855 0.110 0.559

0.903

D (c)

0.904 0.865 0.884 0.087 0.321

0.960

D (o) 0.853 0.850 0.851 0.087 0.438

0.893

Traditional Chinese Task R A (c)

P

F1

ROOV

RROOV RRIV

0.864 0.789 0.825 0.094 0.105

0.943

A (o) 0.804 0.722 0.761 0.094 0.234

0.863

B (c)

0.868 0.85

0.926

B (o)

0.789 0.736 0.761 0.094 0.35

0.834

C (c)

0.871 0.815 0.842 0.075 0.115

0.932

C (o)

0.811 0.74

0.774 0.075 0.254

0.856

D (c)

0.875 0.834 0.854 0.068 0.169

0.926

D (o) 0.811 0.753 0.781 0.068 0.235

0.853

0.859 0.094 0.316

(c) – closed; (o) – open; A - Literature; B – Computer; C – Medicine; D – Finance. ROOV is the OOV rate in the test file.

54


3.5

SUMMARY

This chapter described a novel hybrid boundary-oriented segmentation method— NGMI. It overcomes the limitation of the original mutual information based methods in recognising only bi-gram words by introducing the judgment of boundary-confidence of adjacent segments. NGMI can be used as either an unsupervised or a supervised segmentation method. It also can be used for OOV word recognition. The unsupervised version of NGMI can help segment n-gram words with reasonable precision by using purely Chinese text statistics drawn from the Wikipedia corpus. In order to improve the segmentation precision, NGMI can be adapted into a trainable segmentation method. The evaluation of the Chinese word segmentation task of CIPS-SIGHAN 2010 shows the supervised version of NGMI can achieve promising results in cross-domain text segmentation.


55

Chapter 4: Chinese / English Cross-Lingual Information Retrieval Chapter 4 discusses cross-lingual information retrieval (CLIR) which is another important part of overall cross-lingual link discovery. Traditionally, CLIR has been used for cross-lingual information searching or question answering. Users can access cross-lingual information with the help of a CLIR system which can return relevant cross-lingual documents for a given query in a language that is different from that of the target corpus. Such a process can be thought of as identifying virtual links between the query and the retrieved documents. So the task of cross-lingual link discovery is not dissimilar from traditional CLIR, because anchors in a given document can be thought of as queries. The objective of link discovery is to choose the most suitable anchors, and for each anchor, to identify the most suitable documents. The name entity identification technique and the machine translation method used in this chapter can also be applied to cross-lingual link discovery. This chapter describes the CLIR experiments which use a voting mechanism for high precision machine translation in cross-lingual question answering. Background information of CLIR is given in section 4.1. Section 4.2 discusses the natural language processing in question answering. Section 4.3 provides the implementation of the voting mechanism for translation. Section 4.4 shows the design of a CLIR system. The CLIR experiments and results are given in section 4.5 and section 4.6 respectively. Section 4.7 summarises this chapter.

Chapter 4: Chinese / English Cross-Lingual Information Retrieval

56

4.1

INTRODUCTION

Nowadays, it is easy for people to access multi-lingual information on the Internet. Key term(s) searching on an information retrieval (IR) system is common for information lookup. However, when people try to look for answers in a different language, it is more natural and comfortable for them to provide the IR system with questions in their own natural languages (e.g. looking for a Chinese answer with an English question: “what is Taiji”?). Cross-lingual question answering (CLQA) tries to satisfy such needs by directly finding the correct answer for the question in a different language. In order to return a cross-lingual answer, a CLQA system needs to understand the question, choose proper query terms, and then extract correct answers. Crosslingual information retrieval (CLIR) plays a very important role in this process because the relevancy of retrieved documents (or passages) affects the accuracy of the answers. A simple approach to achieving CLIR is to translate the query into the language of the target documents and then to use a monolingual IR system to locate the relevant ones. However, it is essential but difficult to translate the question correctly. Currently, machine translation (MT) can achieve very high accuracy when translating general text. However, the complex phrases and possible ambiguities present in a question challenge general purpose MT approaches. Out-of-vocabulary (OOV) terms are particularly problematic. So the key for successful CLQA is being able to correctly translate all terms in the question, especially the OOV phrases. This chapter discusses an approach for accurate question translation that targets the OOV phrases and uses a translation voting mechanism. This mechanism involves translations from three different sources: machine translation, online encyclopaedia, and web documents. The translation with the highest number of votes is selected. To demonstrate this mechanism, experiments use Google Translate (GT)14 as the MT source, Wikipedia as the encyclopaedia source, and Google web search engine to retrieve Wikipedia links and relevant Web document snippets.

14

http://translate.google.com.


57

English questions on the Chinese corpus for CLQA are used to illustrate of this approach. Finally, the approach is examined and evaluated in terms of translation accuracy and resulting CLIR performance using the test collection, topics and assessment results from NTCIR-815.

Table 4-1: Question Templates in Question Answering English Question Templates (EQTs) who [is | was | were | will], what is the definition of, what is the [relationship | interrelationship | inter-relationship] [of | between], what links are there, what link is there, what [is | was | are | were | does | happened], when [is | was | were | will | did | do], where [will | is | are | were], how [is | was | were | did], why [does | is | was | do | did | were | can | had], which [is | was | year], please list, describe [relationship | interrelationship | inter-relationship] [of | between], could you [please | EMPTY] give short description[s] to, who, where, what, which, how, describe, explain Chinese Question Templates (CQTs) 之间有什么关系, 的定义是什么, 的关系是什么, 发生了什么事, 是什么关系,是什么时候, 的关系如何, 之间有什么, 请简短简述, 请简单简述, 什么时候, 什么关系,的关系是, 有何关系, 关系如何, 有何相关, 有何渊源, 为什么会, 为什么要, 为什么能, 是哪一年, 什么时候, 位于哪里, 什么样的, 你能不能, 相互之间,代表什么, 简短简述, 简单简述, 简短描述, 简单描述, 为什么,是什么, 什么是, 的关系, 在哪里, 怎么样,有哪些, 什么事, 是哪个, 是哪家, 有什么, 请列出, 请列举,请描述,哪一年, 请简述,能不能,的定义,何时,谁是, 是谁,如何, 哪个, 列举, 请问, 何谓, 何以, 为何, 描述, 有何, 简述, 哪些, 什么, 之间, 有关,定义, 解释

15

http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html.

58


4.2

NATURAL LANGUAGE PROCESSING FOR QUESTION ANSWERING

4.2.1 NAME ENTITY IDENTIFICATION

Questions for CLQA can be very complex. Consider the following, for example: “What is the relationship between the movie "Riding Alone for Thousands of Miles" and ZHANG Yimou?”. In this example, it is important to recognise two named entities (NEs) ("Riding Alone for Thousands of Miles" and “ZHANG Yimou”) and to translate them precisely. In order to recognise the NEs in the question, first, English question template phrases in Table 4-1 are removed from the question; next, the Stanford NLP POS tagger (The Stanford Natural Language Processing Group, 2010) is used to identify the named entities; then translate recognised entities accordingly. Chinese question template phrases are also pruned from the translated question at the end to reduce the noise words in the final query. There are three scenarios in which a term or phrase is considered a named entity. First, it is consecutively labelled NNP or NNPS (University of Pennsylvania, 2010). Second, term(s) are grouped by quotation marks. For example, to extract a named entity from the example question above, three steps are needed: Remove the question template phrase “What is the relationship between” from the question. Process the remaining using the POS tagger, giving “the_DT movie_NN ``_`` Riding_NNP Alone_NNP for_IN Thousands_NNS of_IN Miles_NNP ``_``and_CC ZHANG_NNP Yimou_NNP ?_.” “Riding Alone for Thousands of Miles” is between two tags (``) and so is an entity, and the phrase “ZHANG Yimou”, as indicated by two consecutive NNP tags is also a named entity. Third, if a named entity recognised in the two scenarios above is followed in the question by a phrase enclosed in bracket pairs, this phrase will be used as a tip term providing additional information about this named entity. For instance, in the question “Who is David Ho (Da-i Ho)?”, “Da-i Ho” is the tip term of the named entity “David Ho”.


59

4.2.2 A VOTING MECHANISM FOR NAMED ENTITY TRANSLATION (VMNET) 4.2.2.1

OBSERVATIONS

Observations have been made: 

Wikipedia has over 100,000 Chinese entries describing various up-todate events, people, organizations, locations, and facts. Most importantly, there are links between English articles and their Chinese counterparts.



When people post information on the Internet, they often provide a translation (where necessary) in the same document.

These pages

contain bilingual phrase pairs. For example, if an English term/phrase is used in a Chinese article, it is often followed by its Chinese translation enclosed in parentheses. 

A web search engine such as Google can identify Wikipedia entries, and return popular bi-lingual web document snippets that are closely related to the query.



Statistical machine translation relying on parallel corpus such as Google Translate can achieve very high translation accuracy.

Given these observations, there could be up to three different sources from which translations for a named entity can be obtained, and the task is to find the best one. 4.2.2.2

VMNET

A Google search on the extracted named entity is performed to return related Wikipedia links and bilingual web document snippets.

Then from the results of

Web search and MT, three different translations could be acquired.

Wikipedia Translation

The Chinese equivalent Wikipedia pages could be found by following the language links in English pages. The title of the discovered Chinese Wikipedia page is then used as the Wikipedia translation.

60


Bilingual Clue Text Translation

The Chinese text contained in the snippets returned by the search engine is processed for bilingual clue text translation. The phrase in a different language enclosed in parentheses which come directly after the named entity is used as a candidate translation. For example, from a web document snippet, “YouTube - Sean Chen (陳信安) dunks on Yao Ming…”, “陳信安” can be extracted and used as a candidate translation of “Sean Chen”, who is a basket ball player from Taiwan.

Machine Translation

In the meantime, translations for the named entity and its tip term (if there is one) are also retrieved using Google Translate.

Regarding the translation of a named entity using Wikipedia, the number of results could be more than one because of ambiguity. So for a given named entity, there could be at least one, but possibly more than three candidate translations. With all possible candidate translations from above there sources, the best one then can be selected. All translations are equally weighted, and each translation contributes one vote. Then the votes for identical translation are cumulated. The best translation is the one with the highest number of votes. So a possible translation could get up to three votes if all sources return the same result. In the case of a tie, the first choice for the best translation is the Wikipedia translation if only one Wikientry is found; otherwise, the priority for choosing the best is bilingual clue text translation, then machine translation. 4.3

QUERY GENERATION ALGORITHM WITH VMNET

Because terms can have multiple meanings, ambiguity often occurs if only a single term is given in machine translation. A state-of-the-art MT toolkit/service could perform better if more contextual information is provided. So a better translation is possible if the whole sentence is given (e.g. the question). For this reason, the machine translation of the question is for the whole query and not for the one with the templates removed.


61

However, issues arise: 1) how to know if all the named entities in a question are translated correctly? 2) if there is an error in named entity translation, how can it be fixed? Particularly for case 2, the translations for all other words except for the named entity are considered acceptable, but only the named entity is incorrectly translated. A fix for case 2 can be done by keeping most of the translation and replacing the poor named-entity translation with a good translation. However, finding an incorrect named-entity translation is difficult because the translation for a named entity can be different in different contexts. The missing boundaries in Chinese sentences make the problem harder. To solve this, when a translation error is detected, the question is reformatted by replacing all the named entities with some nonsense strings containing special characters as place holders. These place holders remain unchanged during the translation process. The good NE translations then can be put back for the nearly translated question. Given an English question Q, the detailed steps for the Chinese query generation are as follows:

62


1

Retrieve machine translation Tmt for the whole question from Google Translate.

2

Remove question template phrase from question.

3

Process the remaining using the POS tagger.

4

Extract the named entities from the tagged words using the method discussed in section 4.2.1.

5

Replace each named entity in question Q with a special string Si,(i =0,1,2,..) which makes nonsense in translation and is formed by a few non-alphabet characters. In the experiments, Si is created by joining a double quote character with a ^ character and the named entity id (a number, starting from 0, then increasing by 1 in order of occurrence of the named entity) followed by another double quote character. The final Si, becomes “^id”. The resulting question is used as Qs.

6

Retrieve machine translation Tqs for Qs from Google Translate. Since Si consists of special characters, it remains unchanged in Tqs.

7

Start the VMNET loop for each named entity.

8

With an option set to return both English and Chinese results, Google the named entity and its tip term (if there is one).

9

If there are any English Wikipedia links in the top 10 search results, then retrieve them all. Else, jump to step 12.

10 Retrieve all the corresponding Chinese Wikipedia articles by following the languages links in the English pages. If none, then jump to step 12. 11 Save the title NETwiki(i) of each Chinese Wikipedia article Wiki(i). 12 Process the search results again to locate a bilingual clue text translation candidate - NETct, as discussed in section 4.2.2.2. 13 Retrieve machine translation NETmt, and NETtip for this named entity and its tip term (if there is one). 14 Gather all candidate translations: NETwiki(*), NETct, NETtip, and NETmt for voting. The translation with the highest number of votes is considered the best (NETbest). If there is a tie, NETbest is then assigned the translation with the highest priority. The priority order of candidate translation is NETwiki(0) (if sizeof(NETwiki(*))=1) > NETct > NETmt. It means when a tie occurs and if there are more than one Wikipedia translation, all the Wikipedia translations are skipped. 15 If Tmt does not contain NETbest, it is then considered a faulty translation.


63

16 Replace Si in Tqs with NETbest. 17 If NETbest is different from any NETwiki(i) but can be found in the content of a Wikipedia article (Wiki(i)), then the corresponding NETwiki(i) is used as an additional query term, and appended to the final Chinese query. 18 Continue the VMNET loop and jump back to step 8 until no more named entities remain in the question. 19 If Tmt was considered a faulty translation, use Tqs as the final translation of Q. Otherwise, just use Tmt. The Chinese question template phrases are pruned from the translation for the final query generation.

The translation of a question includes two parts: named entity translation and translation of the whole question with the named entity excluded. As state-of-the-art machine translation systems perform very well in general, only the named entity translation needs to be focused on and improved. The core function of VMNET is its ability to refer to other sources (Wikipedia and Web) for translation accuracy improvement on the basis of machine translation results. To elaborate on this mechanism for question translation, a translation example is given below: 

For the question “What is the relationship between the movie "Riding Alone for Thousands of Miles" and ZHANG Yimou?”, retrieving its Chinese translation from a MT service, the result is the following: 之间有什么电影“利民为千里单独的关系”和张艺谋.



The translation for the movie name "Riding Alone for Thousands of Miles" of “ZHANG Yimou” is however incorrect.



Since the question is also reformatted into “What is the relationship between the movie "^0" and “^1”?”, machine translation returns a second translation: 什么是电影之间的关系“^ 0”和“^ 1”？



VMNET obtains the correct translations: 千里走单骑 and 张艺谋, for two named entities "Riding Alone for Thousands of Miles" and “ZHANG Yimou” respectively.

64




Replace the place holders with the correct translations in the second translation and give the final Chinese translation: 什么是电影之间的关系“千里走单骑”和“张艺谋”？

4.4

INFORMATION RETRIEVAL

4.4.1 CHINESE DOCUMENT PROCESSING

Approaches to Chinese text indexing vary: unigrams, bigrams and whole words are all commonly used as tokens. The performance of various IR systems using different segmentation algorithms or techniques varies as well (Chen, He, Xu, Gey, & Meggs, 1997; Robert & Kwok, 2002). It was seen in prior experiments that using an indexing technique requiring no dictionary can have similar performance to wordbased indexing (Chen, et al., 1997). Using bigrams that exhibit high mutual information and unigrams as index terms can achieve good results. Motivated by indexing efficiency and without the need for Chinese text segmentation, both bigrams and unigrams as indexing units are used for in the Chinese IR experiments. 4.4.2 WEIGHTING MODEL

A slightly modified BM25 ranking function was used for document ordering. When calculating the inverse document frequency, use: (4.1) where N is the number of documents in the corpus, and n is the document frequency of query term . The retrieval status value of a document d with respect to query is given as:

(4.2) –

where

is the term frequency of term

in document d;

is the length

of document d in words and avgdl is the mean document length. The number of bigrams is included in the document length. The values of the tuneable parameters and b set in information retrieval system are 0.7 and 0.3 respectively. Chapter 4: Chinese / English Cross-Lingual Information Retrieval

65

4.4.3 ENGLISH-CHINESE CLIR SYSTEM DESIGN

The CLIR system architecture that adopts the VMNET algorithm for question translation is depicted in Figure 4-1. After getting the final Chinese query with VMNET algorithm as described in section 4.3, query is segmented into unigrams and bigrams, and they are used to search the corpus to retrieval relevant documents.

Question? Processed by a POS tagger

VMNET A voting mechanism for name entity translation

for each name entity do

English Wikipedia

Google search

Google Translate

2 Chinese collected from search results for clue text translation

4 Chinese Wikipedia Title of Chinese Wikipedia page

Name Entity Translation

1

3 Question Translation

Clear up incorrect translation(if there is) and question template phrases

Vote for correct translation

IR System

Chinese Query Terms

Indices

Chinese corpus

Relevant documents for QA processing

Figure 4-1: The CLIR System Design

66


4.5

CLIR EXPERIMENT

4.5.1 TEST COLLECTION AND TOPICS

Table 4-2 gives the statistics of the test collection and the topics used in the experiments. The collection contains 308,845 documents in simplified Chinese from Xinhua News. There are in total 100 topics consisting of both English and Chinese questions. This is an NTCIR-8 collection for ACLIA task. Table 4-2: Statistics of Test Corpus and Topics Corpus

#docs

#topics

Xinhua Chinese (simplified)

308,845

100

4.5.2 EVALUATION MEASURES

The evaluation of VMNET performance covers two main aspects: translation accuracy and CLIR performance. As the task focuses on named entity translation, the translation accuracy is measured using the precision of translated named entities at the topic level. So the translation precision P is defined as: (4.3)

where c is the number of topics in which all the named entities are correctly translated; N is the number of topics evaluated. The effectiveness of different translation methods can be further measured by the resulting CLIR performance. In NTCIR-8, CLIR performance is measured using the mean average precision. The mean average precision (MAP) values are obtained by running the ir4qa_eval2 toolkit with the assessment results16 on experimental runs (NTCIR Project, 2010).

MAP is computed using only 73 topics due to an

insufficient number of relevant document found for the other 27 topics (Sakai et al., 2010). This is the case for all NTCIR-8 ACLIA submissions.

16

http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html.


67

It also must be noted that there are five topics that have misspelled terms in their English questions. The misspelled terms in those 5 topics are given in Table 4-3. It is interesting to see how different translations cope with misspelled terms and how this affects the CLIR result. Table 4-3: The Misspelled Terms in Topics Topic ID

Misspelling

Correction

ACLIA2-CS-0024

Qingling

Qinling

ACLIA2-CS-0035

Initials D

Initial D

ACLIA2-CS-0066

Kasianov

Kasyanov

ACLIA2-CS-0074 Northern Territories northern territories ACLIA2-CS-0075

Kashimir

Kashmir

4.5.3 CLIR EXPERIMENT RUNS

A few experimental runs were created for VMNET and CLIR system performance evaluation. Their details are listed in Table 4-4. Those with name *CSCS* are the Chinese monolingual IR runs; and those with the name *EN-CS* are the English-to-Chinese CLIR runs. Mono-lingual IR runs are used for benchmarking the CLIR system performance. 4.6

RESULTS AND DISCUSSION

4.6.1 TRANSLATION EVALUATION

The translations in the experiments using Google Translate reflect only the results retrieved at the time of the experiments because Google Translate is believed to have improved over time. The result of the final translation evaluation on the 100 topics is given in Table 4-5. Google Translate had difficulties with 13 topics. If all thirteen named entities in those topics where Google Translate failed are considered OOV terms, the portion of topics with OOV phrases is relatively small. Regardless, there is an 8% improvement over Google Translate achieved by VMNET reaching 95% precision.

68


Table 4-4: The Descriptions of CLIR Experimental Runs Run Name

Indexing Units

VMNET-CS-CS-01-T

U+B

Manually segment the question and remove all the noise words

VMNET-CS-CS-02-T

U+B

Prune the question template phrase

VMNET-CS-CS-03-T

U+B

Use the whole question without doing any extra processing work

VMNET-CS-CS-04-T

U

As VMNET-CS-CS-01-T

VMNET-CS-CS-05-T

B

As VMNET-CS-CS-01-T

VMNET-EN-CS-01-T

U+B

Use Google Translate on the whole question and use the entire translation as query

VMNET-EN-CS-02-T

U+B

Use VMNET translation result without doing any further processing

VMNET-EN-CS-03-T

U+B

As above, but prune the Chinese question template from translation

VMNET-EN-CS-04-T

U+B

Use Google Translate on the whole question and prune the Chinese question template phrase from the translation

Query Processing

(for indexing units, U means unigrams; B means bigrams)

Table 4-5: Translation Evaluation Results Method

c

N

P

Google Translate

87

100

87%

VMNET

95

100

95%

There are in total 14 topics in which Google Translate or VMNET failed to correctly translate all named entities. These topics are listed in Table 4-6. The correct and error translations of the name entities in those topics are given in Table 4-7. Interestingly, for topic (ACLIA2-CS-0066) with the misspelled term “Kasianov”, VMNET still managed to find a correct translation (米哈伊尔·米哈伊洛维奇·卡西 Chapter 4: Chinese / English Cross-Lingual Information Retrieval

69

亚诺夫). This has to be attributed to the search engine’s capability in handling misspellings. On the other hand, Google Translate was correct in its translation of “Northern Territories” of Japan, but VMNET incorrectly chose “Northern Territory” (of Australia). For the rest of the misspelled phrases (Qingling, Initials D, Kashimir), neither Google Translate nor VMNET could pick the correct translation. Table 4-6: The Topics with OOV Phrases Topic ID

Question with OOV Phrases

ACLIA2-CS-0002

What is the relationship between the movie "Riding Alone for Thousands of Miles" and ZHANG Yimou?

ACLIA2-CS-0008

Who is LI Yuchun?

ACLIA2-CS-0024

Why does Qingling build "panda corridor zone"

ACLIA2-CS-0035

Please list the events related to the movie "Initials D".

ACLIA2-CS-0036

Please list the movies in which Zhao Wei participated.

ACLIA2-CS-0038

What is the relationship between Xia Yu and Yuan Quan.

ACLIA2-CS-0048

Who is Sean Chen(Chen Shin-An)?

ACLIA2-CS-0049

Who is Lung Yingtai?

ACLIA2-CS-0057

What is the disputes between China and Japan for the undersea natural gas field in the East China Sea?

ACLIA2-CS-0066

What is the relationship between two Russian politicians, Kasianov and Putin?

ACLIA2-CS-0074

Where are Japan's Northern Territories located?

ACLIA2-CS-0075

Which countries have borders in the Kashimir region?

ACLIA2-CS-0088

What is the relationship between the Golden Globe Awards and Broken-back Mountain?

ACLIA2-CS-0089

What is the relationship between Kenneth Yen (K. T. Yen) and China?

70


Table 4-7: Translations of OOV Phrases OOV Phrases

Correct

GT

VMNET

Riding Alone for Thousands of Miles

千里走单骑

利民为千里单独

千里走单骑

LI Yuchun

李宇春

李玉春

李宇春

panda corridor zone

秦岭

宋庆龄

宋庆龄

Initials D

头文字 D

缩写 D 的事件

缩写 D 的事件

Zhao Wei

赵薇

照委

赵薇

Yuan Quan.

袁泉

袁区广

袁泉

Sean Chen(Chen Shin-An)

陈信安

肖恩陈（陈新的）

陳信安

Lung Yingtai

龙应台

龙瀛台

龙应台

East China Sea

东海

东中国海域

东海

Kasianov

卡西亚诺夫

Kasianov

米哈伊尔·米哈伊洛维奇·卡西亚诺夫

Northern Territories

北方领土

北方领土

北领地

Kashimir

克什米尔

Kashimir

Kashimir

Broken-back Mountain

断臂山

残破的背山

断臂山

Kenneth Yen(K. T. Yen)

严凯泰

肯尼思日元（观塘

裕隆汽车

日元）


71

Table 4-8: Results of All Experimental Runs Run Name

MAP

NTCIR-8 CS-CS BEST

0.4488

VMNET-CS-CS-01-T

0.4681

VMNET-CS-CS-02-T

0.4419

VMNET-CS-CS-03-T

0.4189

VMNET-CS-CS-04-T

0.3406

VMNET-CS-CS-05-T

0.4653

NTCIR-8 EN-CS BEST 0.4209 VMNET-EN-CS-01-T

0.3161

VMNET-EN-CS-02-T

0.3408

VMNET-EN-CS-03-T

0.3756

VMNET-EN-CS-04-T

0.3449

4.6.2 IR EVALUATION

The MAP values of all experimental runs corresponding to each query processing technique and Chinese indexing strategy are given in Table 4-8. The results of mono-lingual runs give benchmarking scores for CLIR runs. As expected, the highest MAP 0.4681 is achieved by the monolingual run VMNET-CS-CS-01-T, in which the questions were manually segmented and all the noise words were removed. It is encouraging to see that the automatic run VMNET-CS-CS-02-T with only question template phrase removal has a slightly lower MAP 0.4419 than that of the best performance CS-CS run (0.4488) in the NTCIR-8 evaluation forum (Sakai, et al., 2010). If unigrams were used as the only indexing units, the MAP of VMNET-CSCS-04-T dropped from 0.4681 to 0.3406. On the other hand, all runs using bigrams as indexing units either exclusively or jointly performed very well. The MAP of run VMNET-CS-CS-05-T using bigrams only is 0.4653, which is slightly lower than that of the top performer run VMNET-CS-CS-01-T, which used two forms of indexing 72


units. However, retrieval performance could be maximised by using both unigrams and bigrams as indexing units. The highest MAP (0.3756) of a CLIR run is achieved by run VMNET-EN-CS03-T, which used VMNET for translation. Comparing it to the manual run VMNETCS-CS-01-T, there is around 9% performance degradation as a result of the influence of noise words in the questions, and the possible information loss or added noise due to English-to-Chinese translation, even though the named entities translation precision is relatively high. The best EN-CS CLIR run (MAP 0.4209) in all submissions to the NTCIR-8 ACLIA task used the same indexing technique (bigrams and unigrams) and ranking function (BM25) as run VMNET-EN-CS-03-T but with “query expansion based on RSV” (Sakai, et al., 2010). The MAP difference 4.5% between the forum best run and the VMNET best run could suggest that using query expansion is an effective way to improve the CLIR system performance. Runs VMNET-EN-CS-01-T and VMNET-EN-CS-04-T, that both used Google Translate provide direct comparisons with runs VMNET-EN-CS-02-T and VMNETEN-CS-03-T, respectively, which employed VMNET for translation. All runs using VMNET performed better than the runs using Google Translate. The different performances between CLIR runs using Google Translate and VMENT is the joint result of the translation improvement and other translation differences. As shown in Table 4-7, VMNET found the correct translations for eight more topics than Google Translate. It should be noted that there are two topics (ACLIA2-CS-0008 and ACLIA2-CS-0088) not included in the final CLIR evaluation (Sakai, et al., 2010). Also, there is one phrase, “Kenneth Yen (K. T. Yen) (严凯泰)”, which VMNET couldn’t find the correct translation for, but it detected a highly associated term “Yulon - 裕隆汽车”, an automaker company in Taiwan; Kenneth Yen is the CEO of Yulon. Although Yulon is not a correct translation, it is still a good query term because it is then possible to find the correct answer for the question: “Who is Kenneth Yen?”. However, this topic was not included in the NTCIR-8 IR4QA evaluation.


73

Moreover, it is possible to have multiple explanations for a term. In order to discover as many question-related documents as possible, alternative translations found by VMNET are also used as additional query terms. They are shown in Table 4-9. For example, 丁克 is the Chinese term for DINK in Mainland China, but 顶客族 is used in Taiwan. Furthermore, because VMNET gives the Wikipedia translation the highest priority if only one entry is found, a person’s full name is used in person name translation rather than the short commonly used name. For example, Cheney (former vice president of U.S.) is translated into 迪克·切尼 rather than just 切尼. Table 4-9: Alternative Translations NE

VMNET

Wiki Title

Princess Nori

纪宫公主黑田清子

DINK

丁克

顶客族

BSE

疯牛病

牛海绵状脑病

Three Gorges Dam 三峡大坝三峡工程

The biggest difference, 3.07%, between runs that used different translation is from runs VMNET-EN-CS-03-T and VMNET-EN-CS-04-T, which both pruned the question template phrase for simple query processing. Although the performance improvement is not obvious, the correct translations and the additional query terms found by VMNET are still very valuable. 4.7

SUMMARY

Although question answering is not directly related to link discovery, the techniques utilised for name entity identification in a question can be also used to recommend meaningful anchors in Wikipedia articles. General machine translation can already achieve very good translation results, but VMNET can further improve the translation accuracy. The results from the CLIR experiments indicate that VMNET is capable of providing high quality translation for query terms or anchor candidates. 74


Also, as cross-lingual link discovery allows multiple target links for each anchor, it perfectly matches the nature of cross-lingual information retrieval which can return a list of relevant cross-lingual documents for a given query which in this case is an identified anchor. So CLIR can accordingly be adapted for cross-lingual link discovery, and the accuracy of anchor translation will be critical to the performance of CLIR and the quality of the recommended links. Subsequently, a CLIR system which can be embedded into a CLLD system as a sub-component can achieve good results for link discovery by using the VMNET for translation and a simple segmentation strategy (bigrams and unigrams for Chinese text) for text indexing.


75

Chapter 5: Cross-lingual Link Discovery Evaluation Framework Chapter 5 details an assessment and evaluation framework for the evaluation of cross-lingual link discovery approaches. There could be many methods to realise cross-lingual link discovery in Wikipedia. However, 

How effective are they in identifying meaningful anchors and also suggesting relevant links for Wikipedia articles?



How do the methods used in Chinese natural language processing and cross-lingual information retrieval affect the accuracy of anchor identification?



How does cross-lingual information retrieval affect the performance of a cross-lingual link discovery system?



How is the performance of one CLLD system compared with others?

Without a proper evaluation, such questions can not be answered. To effectively benchmarking CLLD system performance, it is necessary to have a standardised evaluation framework. The framework includes standard topics and document collections, evaluation metrics, and toolkits for run pooling, link assessment and system evaluation. This chapter discusses this framework in several parts: introduction (section 5.1), evaluation methodology (section 5.2), an overview of the evaluation framework (section 5.3), cross-lingual link discovery task definition (section 5.4), topics for evaluation (section 5.5), document collections (section 5.6), run submission specification (section 5.7) , run validation (section 5.8), link assessment (section 5.9), and evaluation methods (section 5.10). Section 5.11 summarises this chapter.

Chapter 5: Cross-lingual Link Discovery Evaluation Framework

77

5.1

INTRODUCTION

CLLD is a general and big research topic for better cross-lingual information viewing and editing in a knowledge base regardless of what language pairs are involved. However, cross-lingual information searching in different language pairs might have special features, and needs to be treated differently. Also, to promote research on CLLD and attract international collaborations, a proposal for a new pilot CLLD task was submitted by the author of this thesis to the NTCIR workshop17 which is an evaluation forum focused on cross-lingual information access between English and East Asian languages. And the new Cross-lingual Link Discovery (Crosslink) 18 task, the research presented here, was accepted at NTCIR-919 in 2011. The goal of the pilot CLLD task is to create a reusable resource for evaluating automated cross language link discovery approaches. The results of this research can be used in building and refining systems for automated link discovery. This NTCIR pilot task focuses on linking between English source documents and Chinese, Korean, and Japanese target documents. The task organisers need to build and deliver a standard platform for the evaluation of cross language link discovery approaches. Participants need to build and test CLLD systems and evaluate them in an independent, reproducible, and methodologically sound manner. Unlike the traditional cross-lingual links in Wikipedia, the CLLD task extends the page-to-page language links to anchored cross-lingual links. With the help of anchored cross-lingual links in articles, knowledge base users can have easy access to multi-lingual information via a simple link click. To identify good or bad CLLD algorithms,

an assessment

and

evaluation

framework was proposed for

benchmarking CLLD system performance.

17

http://research.nii.ac.jp/ntcir/index-en.html

18

http://ntcir.nii.ac.jp/CrossLink/

19

http://research.nii.ac.jp/ntcir/ntcir-9/index.html

78


The framework includes standard topics, document collections, a gold dataset (qrel), evaluation metrics, and toolkits for run pooling, link assessment and system evaluation. This chapter presents the construction and the critical components of a general cross-lingual link discovery evaluation framework. Chapter 6 discusses the realisation of Chinese / English bidirectional link discovery and the evaluation of the proposed approaches using this framework. The results of the NTCIR-9 Crosslink task experiments that utilised the framework for system performance benchmarking are analysed in Chapter 7 where the effectiveness of cross-lingual link discovery in helping knowledge discovery in Wikipedia is discussed. 5.2

METHODOLOGY

The effectiveness, the efficiency and the feasibility of various CLLD approaches should be supported by the results of properly designed experiments. The effectiveness is the most important factor in bench-marking a CLLD system and must be considered when designing the evaluation framework. The proposed evaluation methodology for cross-lingual link discovery system bench-marking is illustrated in Figure 5-1. Similar to the setup of the Cranfield methodology (Ellen, 2007), there are four parts in Figure 5-1: Inputs, Systems, Outputs, and Evaluation.


79

Figure 5-1: The Cross-Lingual Link Discovery Evaluation Methodology

The Inputs part contains a test collection that is used as the inputs of various CLLD systems. The test collection includes topics and target document sets. 

Topics are a small set of the orphaned Wikipedia articles (with links removed) in one language (e.g., English). These articles should be processed by CLLD systems for cross-lingual link recommendation.



Target document sets comprise entire sets of Wikipedia articles in other language domains (e.g., Chinese Wikipedia) A CLLD system should identify relevant links from these target document sets for each topic. It needs to be noted that the target document sets contains prior link knowledge—the pre-existing links in articles.

The Systems section of the methodology consists of a set of CLLD systems to be assessed by the framework. Each CLLD system may be implemented with different approaches for realising cross-lingual link discovery.

80


The Outputs part of the methodology consists of the experimental runs that are produced by the different CLLD systems. An experimental run often contains multiple sets of recommended anchors and links for all topics. The Evaluation part of the methodology includes assessment and evaluation. 

Assessment: there can be two kinds of assessments: automatic assessment and manual assessment. As an advantage of using Wikipedia as target document set, the already existing links (called Wikipedia ground-truth) can be used for automatic assessment. Manual assessment is the link relevant judgement done by human assessors. The details of these two kinds of assessments are discussed in section 5.9.



Evaluation: to evaluate a CLLD system or an approach, properly designed system evaluation metric(s), an assistant software tool and gold standard datasets from relevance assessment are needed. The gold standard datasets also called relevance judgements (qrels) include relevant sets of anchors and their associated target links for the given topics.

With the evaluation results, a CLLD system then can be further refined to provide a better automated cross-lingual linking for knowledge management and sharing. 5.3

AN OVERVIEW OF THE FRAMEWORK

The complete evaluation framework includes topics, document collections, a gold standard dataset, evaluation metrics, and toolkits for run pooling, link assessment and system evaluation. The tool set of cross-lingual link discovery framework should include these three core parts: validation, assessment, and evaluation. For an effective evaluation, submission validation is a very important part of the evaluation framework because it can ensure validity of submitted anchors which will be pooled for correct manual assessment by human assessors. With run validation, all submissions generated by various CLLD systems with different approaches will be pooled for manual assessment in the style of Cranfield. With an evaluation tool and the assessment results from either the manual assessment or the Wikipedia ground-truth which will be detailed in section 5.9 the performance of an Chapter 5: Cross-lingual Link Discovery Evaluation Framework

81

experimental run then can be quantified. Therefore, the evaluation results can be used for refining the CLLD system. The tool set including validation tool, assessment tool, evaluation tool, and others involved in this CLLD evaluation framework and presented in this thesis is a multi-lingual adaptation of the tools used in the INEX Link-the-Wiki track (Huang, et al., 2010). However, these tools (now open sourced 20 ) were restructured and improved by the author of this thesis according to the peculiarities of CLLD. 5.4

TASK DEFINITION

The task of cross-lingual link discovery is to identify prospective anchors for the provided topics and recommend cross-lingual links for them in the standard document collections. A prospective anchor is a piece of text that is relevant to the topic and worthy of being linked to other documents for further reading. Although there is no hard limit to the number of anchors that may be inserted into a document, a user will become overwhelmed if every single term in an article is also an anchor – and so for evaluation purposes a limit of 250 anchors per document was imposed. Wikipedia currently supports 282 languages, but for evaluation purposes only up to 5 targets were allowed per anchor. In total this makes up-to 1250 outgoing links per article. Although the Wikipedia is a constantly evolving collection, for evaluation purposes a snapshot in a small number of languages (Chinese, Japanese, and Korean (CJK)) was taken. It is important to stress that these are the first experiments in CLLD and so such restrictions are not unreasonable. The evaluated links list for a document can be symbolized as: ai → (d1, d2, … dn), i ≤ 250, n ≤ 5

where ai is the ith anchor in the source document, and di is the cross-lingual target document j for the anchor. Both anchors and targets should be ordered on relevancy.

20

http://code.google.com/p/crosslink/

82


5.5

TOPICS

5.5.1 TOPICS FOR NTCIR ENGLISH-TO-CJK TASKS

For Crosslink task at NTCIR-9, the source language was chosen as English and the task was to identify the most relevant anchors and for each anchor the most relevant targets in the standard CJK document collections. Table 5-1: Training Topics for Crosslink Task #

Title

ID

1

Australia

4689264

2

Femme fatale 299098

3 Martial arts

19501

Table 5-2: Test Topics for Crosslink Task #

Title

ID

#

1

Zhu Xi

244468

14 Cretaceous

5615

2

Kimchi

178952

15 Croissant

164372

3

Kim Dae-jung

152967

16 Crown prince

236210

4

Fiat money

22156522 17 Cuirassier

504031

5

Boot

191574

18 Dew point

54912

6

Asian Games

39205

19 Oracle bone script 480299

7

Spam (food)

28367

20 Ivory

15165

8

Source code

27661

21 Jade

191395

9

Puzzle

86368

22 Kiwifruit

17363

23871

23 African Wild Ass

3696045

24 Mohism

21032

10 Pasta

11 Abdominal pain 593703

Title

12 Cuttlefish

20976520 25 Sushi

13 Credit risk

372603


ID

28271

83

Only three topics are used for system training; another set of 25 articles were randomly chosen from the English Wikipedia and used as test topics for the evaluation. The training topic details are given in Table 5-1, ant the test topics are given in Table 5-2. All test topics had their pre-existing links removed – a process known at INEX as orphaning. 5.5.2 TOPICS FOR CHINESE-TO-ENGLISH TASK

Besides participating in the English-to-Chinese CLLD subtask at NTCIR-9 Crosslink track, Chinese-to-English experiments were also conducted to complete the bi-directional link discovery between Chinese and English documents. For the Chinese-to-English link discovery experiments, a set of 36 topics21 (including 香港十元紙幣 (Hong Kong ten-dollar note)) were created. The list of Chinese topics is given in Table 5-3. It needs to be noted that the number of Chinese topics is different from the number of English topics, and not all Chinese counterparts of English topics are used as Chinese topics in the Chinese-to-English document linking experiments. Generally, the length of most Chinese counterparts of English topics is short. In order to have a proper system evaluation using Wikipedia ground-truth, it is important for each topic to include a reasonable number of links. So those Chinese counterparts of English topics which are too short are discarded, and additional, alternative Chinese Wikipedia articles with proper length are chosen as topics. 5.6

DOCUMENT COLLECTIONS

The Wikipedia is an excellent collection to use for CLLD because it is a (mostly) closed hypertext collection and exists in several languages. Articles can be re-distributed for experiments under Creative Commons Attribution-Share-Alike License 3.022, and so copyright issues are minimal.

21

http://crosslink.googlecode.com/files/zh-topics-36.zip

22

http://creativecommons.org/licenses/by-sa/3.0/

84


Table 5-3: Chinese Topics for Chinese-to-English Link Discovery #

Title

ID

#

Title

ID

1

朱熹

7300

19

孙悟空

11200

2

韓國泡菜

16885

20

象牙

176642

3

金大中

49406

21

玉

155918

4

法定貨幣

118690

22

奇異果

193518

5

靴

217829

23

非洲野驴

41923

6

亚洲运动会

111369

24

墨家

18322

7

午餐肉

268147

25

黄山

135989

8

源代码

99224

26

汇率

146192

9

智力游戏

47565

27

寿司

40425

10

伪代码

17935

28

甲骨文

11616

11

肚痛

962949

29

意式麵食

84315

12

乌贼

129496

30

女人香

331605

13

信用風險

117819

31

功利主义

75635

14

白垩纪

42661

32

黑田清子

140559

15

羊角麵包

326551

33

桑拿

394771

16

储君

1119970

34

僧伽

750323

17

胸甲骑兵

1099245

35

旧金山

23829

18

露点

2129

36

香港十元紙幣 583701


85

June 2010 dumps of Wikipedia (English, Chinese, Japanese, and Korean versions) were downloaded and converted into XML using the YAWN system (Schenkel, Suchanek, & Kasneci, 2007). A multi-lingual adaptation of the Schenkel et.al (2007) Java YAWN program was used to insert the XML structure. The document collection is summarised in Table 5-4. Column 1 lists the language, column 2 the number of articles in that collection, and column 3 the number of links to English.

For example, after conversion there were 316,251

Chinese articles of which 170,637 contained links to English articles.

Table 5-4: Article Statistics of CJK Corpora Corpus

#Articles

Cross-lingual links

English

3,484,250 169,974 (en→zh, 4.9%) 292,548 (en→ja, 8.4%) 87,367 (en→ko, 2.5%) 316,251

170,637 (zh→en, 54.0%)

Japanese 715,911

289579 (ja→en, 40.4%)

Korean

201,512

89230

Total

4,717,924 200,825

Chinese

5.7

(ko→en, 44.3%)

SUBMISSION

5.7.1 RULES

Topic files including their CJK counterparts must be removed from document collections either physically or virtually. Special case links (numbers, years, dates, and century links) are not recommended for being included in the runs. There were several reasons for this decision.

First, they are special cases that can be handled uniformly.

Second,

Wikipedia has special rules on their use; chronological items should not be linked

86


“unless their content is germane” 23. If those links are included in the submissions, they will be rejected and not considered in either assessment or evaluation. 5.7.2 RUN SPECIFICATION

XML is used for formatting run results. The specification of submission is similar to that of the INEX Link-the-Wiki task (INEX, 2010). The document type declaration (DTD) for experimental run XML file is given in Table 5-5. Only submissions complying with this DTD can be recognised by the assessment and evaluation tools. The root element crosslink-submission should contain information about participant's ID, run ID (which should include university affiliation), the task which should be either A2F or A2B and the default target language should be given in language abbreviation which could be either zh, ja, or ko. The linking algorithm should be described in description node. The collections element contains a list of document collections used in the run. Generally, the collection element should contain text from one of the following: Chinese Wikipedia, Japanese Wikipedia or Korean Wikipedia. Each topic should be contained in a topic element which should contain an anchor element for each anchor-text that should be linked. Each anchor element should include offset, length and name attributes for detailed information of the recommended anchor, and should also have one or more tofile subelements with the target document ID contained within them. The tofile element should contain following information: language id, title and bep (specified in lang, title and bep_offset attributes separately) of the linked document. 5.7.3 ANCHOR OFFSET

The position (zero-based offset) of an anchor is calculated by counting the number of bytes from the beginning of the provided topic file in binary form. The offset of an anchor could be different if it is calculated by counting the number of characters because a character can occupy up to a few number of bytes of storage size due to various encoding schemes. Anchors with incorrect offset will not be properly recognised by the assessment and evaluation framework. Similarly, the length of an anchor is the number of bytes occupied by that anchor text.

23

http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style_(linking)#Specific_cases


87

Table 5-5: Submission XML File DTD

88


5.8

VALIDATION

5.8.1 VALIDATION IN RUN SUBMISSION

Unlike other information retrieval tasks, in cross-lingual link discovery recommended anchors have to have correct offsets, and then the anchors can be displayed properly in the manual assessment tool for assessment. So it is important all anchor offsets are correctly specified. Participants are required to verify the experiment results before submitting their runs, and a run that should pass the validity check will be accepted. 5.8.2 VALIDATION TOOL

For the purpose of checking run validity, a submission validation tool as a part of the evaluation framework was developed for the CLLD evaluation task. A GUI snapshot of the tool is given in Figure 5-2. With the validation tool, the recommended anchors can be self verified by link discovery task participants. By going through all the recommended anchors which are displayed in the pane at the left hand side, participants of the CLLD evaluation task can easily tell


89

whether they have been mispositioned. If an anchor is specified with an incorrect offset, the highlighted text will include incomplete word(s). For example, a mispositioned anchor, “Commonwealth of Nat” contains only the first three characters of the word “Nations”. In the figure, the highlighted anchor “Commonwealth of Nations” (in yellow) is correctly positioned, and one (英联邦) of its associated candidate links is displayed in the pane at the right hand side.

Figure 5-2: Crosslink Run Validation Tool

5.9

ASSESSMENT

Evaluating Link Discovery is awkward because of the number of degrees of freedom in judging the relevance of suggested anchors and the target links. The algorithm must identify relevant anchors and relevant target articles. Each anchor might occur several times within the source article, and in subtly different linguistic forms. It is unreasonable to score each instance in a single run, but also unreasonable not to score different linguistic variants in different runs. The best approach to measuring this imprecision is currently unclear but has been studied 90


in the INEX Link Discovery Track (Huang, et al., 2008; Huang, Geva, et al., 2009; Huang, et al., 2010) (where it changed from year to year). However, what was discovered at INEX was that performance could be measured in two possible ways: automatically (using the Wikipedia itself as the ground truth); and manually (in the TREC paradigm). 5.9.1 AUTOMATIC ASSESSMENT

In automatic assessment, the ground-truth (query relevance) is a set of links derived from existing link structure in Wikipedia articles through triangulation. These come from two sources: First, all the mono-lingual links from the translation of the source article are considered relevant. Second, all the cross-lingual links from the mono-lingual links from the source article are considered relevant. So in total there are actually two sets of Wikipedia ground-truth links: one from the source articles; and the other one from the articles’ counterpart in the target language. For instance, if a source article is “Martial Art” in English, and needs to be linked to Chinese pages, then: 1) The Chinese set of ground-truth links are those links out of the Chinese “Martial Art” (武術) article (the English article counterpart). For example, anchor (武術) → article(武術) anchor (保全) → article(保全) anchor (八极拳) → article(八极拳) ... 2) And the English set of ground-truth links are the Chinese counterparts (through triangulation) for all links out of the English article. For example, anchor (Systema) → article(Systema) → article(西斯特瑪) anchor (Bruce Lee) → article(Bruce Lee) → article(李小龙) Chapter 5: Cross-lingual Link Discovery Evaluation Framework

91

anchor (Knockout) → article(Knockout) → article(擊倒) ... It is accepted that the ground truth may be incomplete. For example, it may contain links from the English version to which there is no appropriate anchor to the CJK versions. It may also not contain the kinds of links that users click. However, it is reasonable to believe that this will not adversely affect the relative rank order of CLLD systems. The Wikipedia ground truth is easy to get, but does not necessarily reflecting user preferences optimally. Much of it is automatically generated. Even Wikipedia links manually generated by the content contributors, may be disliked by most assessors. So the evaluation with automatic assessment may not really reflect the true performance of CLLD systems. Evaluation using automatic assessment is also likely biased towards links already in Wikipedia. Huang et al. (2009) suggest that manual assessment of monolingual link discovery could result in substantially different results. 5.9.2 MANUAL ASSESSMENT 5.9.2.1

CRITERIA FOR MANUAL ASSESSMENT

An alternative approach to automatic assessment is manual assessment in the style of TREC. Human assessors are employed to examine each recommended link and to make a judgement call on the relevance (or otherwise) of the target. Human assessors can, of course, judge not only the link but can individually assess the relevance of the anchor and the target – a target might be relevant but the anchor not so (and vice versa). In addition to the usual criticisms of the imperfection of evaluation using manual assessment in traditional information retrieval tasks (e.g. various tracks at INEX or TREC), there are new issues with cross-lingual link discovery.

For

example, it is difficult to find assessors skilled enough to read the source articles in one language (e.g. English) while also having a good understanding of the target document and its alternate language. Note that the skill level (in both languages) required to do this is higher than that of an ordinary user reading both document.

92


For manual assessment the anchors and targets are shown to a human. For this reason it’s important to include the offset and length of the anchor in the run. For the experiments herein the zero-based byte offset was used (lengths were also specified in bytes). As a validity check this was compared against the text that was also specified in the run (invalid matches were discarded). As can be expected, pooling was used. 5.9.2.2

MANUAL ASSESSMENT TOOL

In manual assessment, an assessment tool was developed for the task. A GUI tool is provided to assessors for efficient anchor and link assessment. All the prospective anchors and the corresponding links are pooled. An assessor inspects each anchor and its corresponding prospective links, accepting or rejecting them one by one. This is not dissimilar to the assessment approaches used in CLIR evaluations. Given an anchor and its context (the “query”), the assessor judges the relevance of the target document.

The assessment can be done by designated independent

assessors. The design of the assessment tool is similar to that of the submission validation tool.

The GUI snapshots of Chinese-to-English, English-to-Chinese, English-to-

Japanese and English-to-Korean assessment tool are given in Figure 5-3, Figure 5-4, Figure 5-5, and Figure 5-6 respectively. There isn’t any functionality difference in assessment of any language pair cross-lingual link discovery.


93

Figure 5-3: Crosslink Manual Assessment Tool (Chinese-to-English)

Figure 5-4: Crosslink Manual Assessment Tool (English-to-Chinese)

94


Figure 5-5: Crosslink Manual Assessment Tool (English-to-Japanese)

Figure 5-6: Crosslink Manual Assessment Tool (English-to-Korean)


95

5.10

EVALUATION METHODS

Evaluation of CLLD approaches and systems was performed according to the already accepted INEX methods of file-to-file (F2F) and anchor-to-file (A2F). In file-to-file evaluation the performance of the link discovery algorithm in finding articles that should be linked-to is measured regardless of the relevancy of anchors themselves. For instance, if a word “cooked” in the Custard article is marked irrelevant in assessment but the anchor is correctly linked to “pressure cooker” which is in qrel, this link will be still considered relevant. F2F evaluation is perfect for automatic assessment of CLLD because appropriate anchors cannot necessarily be extracted from the corpus whereas appropriate target articles can. In anchor-to-file evaluation, the correctness of anchors is also considered. Any given anchor can either be relevant or not-relevant to the article; if an anchor is notrelevant then under evaluation the link is considered non-relevant even if the target article is relevant. The term non-relevant in reference to anchor text means that a user will not see a need for the anchor text to be linked, either because it does not describe an important concept in the context where it appears, or it is simply considered trivial. Manual assessment should be used to get a good ground truth for anchor-to-file evaluation, and this ground truth can also be used for file-to-file evaluation. 5.10.1 LINK PRECISION AND RECALL

As with other information retrieval evaluations, precision and recall are the two underlying key metrics to measure system performance. Precision and recall (BaezaYates & Ribeiro-Neto, 1999) in traditional information retrieval task are calculated based on each information request. In CLLD precision and recall have similar definitions, but they are computed for each article-based topic with relevancy of anchors considered. Also, they are subtly differently defined for the two different evaluation methods (F2F and A2F).

96


File-to-File Evaluation

(5.1)

and

(5.2)

The precision and recall are computed at each anchor (recall that 5 targets per anchor were permitted).

Anchor-to-File Evaluation

For anchor-to-file evaluation a similar precision definition to that used in INEX 2009 (Huang, et al., 2010) was used. Both precision of anchor and precision of target are considered. The score of the anchor is defined as:

(5.3)

That is, if anchor i is relevant and it has at least one relevant target, then fanchor(i) = 1. Otherwise the score is 0. For each target of an anchor, if that target document, j, is relevant to the anchor then it receives a link score flink of 1, otherwise 0. Thus:

(5.4)


97

Finally, the anchor-to-file precision and recall with respect to an article are:

(5.5)

(5.6)

where n is the number of identified anchors; N is the number of relevant anchors; k is the number of returned targets for anchor i; and ki is the number of targets recommended for this anchor. Although the computation of precision and recall in the CLLD evaluation is similar to that in the traditional information retrieval evaluation, the meaning of computed values should be interpreted differently, especially for anchor-to-file evaluation. For example, in traditional information retrieval evaluation, a score of 0.3 precision for the top 10 retrieved documents returned for a search request can be considered that there are total three correct hits, but 0.3 precision in the anchor-to-file link discovery evaluation could mean there are at least three relevant anchors and each with at least one relevant link for the top 10 identified anchors. As discussed previously in task definition section 5.4, all anchors and links should be sorted according to a degree of importance or relevance to the topic article. The ranked anchor list then can be examined from the top one with the highest relevance score to the bottom one with the least. Therefore, the precision and recall scores could vary at each cut off point. Assume that there is a topic article Atopic which needs to be linked to the target collection, and the query relevance set of good anchors obtained from the manual assessment for the topic contains following anchors and links:

98


{a2 → (d131, d234), a3 → (d314), a5 → (d1, d33, d352), a8→ (d3), a10 → (d13, d23), a11 → (d41, d389), a12 → (d88)}

In the above query relevance set, there are a total of eight relevant anchors and twelve relevant links. Also, assume that a cross-lingual link discovery algorithm named Algorclld is newly designed and used to identify a set of candidate anchors and for each anchor suggest a few links for that topic article Atopic. To distinguish the relevant links (anchor and target pairs), three symbols with different meanings are used for identifications of relevant anchors or links. They are showed as below:

* —where only the anchor is relevant —where only the target is relevant, but the associated anchor is not —where the target is relevant, and the associated anchor is also relevant

The recommended anchors and their associated links are sorted according to the relevance degree. They are given as follows:


99

{a1 → (d131), a2 → (d13, d234, d350), a3 → (d323), a4 → (d123, d315), a5 → (d1, d33, d235), a6 → (d13, d23, d35), a7 → (d12, d24, d36, d231, d389), a8→ (d3), a9 → (d19, d99, d101, d203, d450), a10 → (d13, d23), a11 → (d4, d39, d375, d399), a12 → (d88, d293)}

This set of anchors with links can be symbolized as Ra. There are in total twelve anchors and thirty two links recommended in the set. The anchors and links which are relevant to the topic article Atopic are marked with special symbols (, and) after them. Note that those links with symbol means they are relevant to the topic article but their associated anchors are not. Thus, there are eleven links in total relevant to the topic, but only seven of them are associated with the anchors that are also considered relevant to the topic. Therefore, for file-to-file evaluation, the precision and recall for the Atopic using the CLLD algorithm Algorclld are calculated as below:

100


For anchor-to-file evaluation, the precision and recall should be computed as below:

5.10.2 SYSTEM EVALUATION METRICS

For the experiments herein, Precision-at-N, R-Prec, and Link Mean Average Precision (LMAP) were the metrics used to quantify performance of cross-lingual link discovery. For both evaluation types, Link Mean Average Precision (LMAP) is defined as:

(5.7)

where n is the number of topics (source articles used in evaluation); m is the number of identified items (articles for F2F or anchors in A2F); and Pkt is the precision of the top K items for topic t. This metric (LMAP) is analogous to the traditional MAP that is used for measuring the performance of document retrieval, but in cross-lingual link discovery Chapter 5: Cross-lingual Link Discovery Evaluation Framework

101

the precision Pkt at the top K items is computed using the equation 5.1 or 5.5 according to the evaluation type (either F2F or A2F). Of particular note is that the top K items in F2F evaluation are the links selected from ranked anchors one by one until the number of links reach K. For instance, the top 10 links from the example anchor set Ra presented in the previous section 5.10.1 are listed below:

a1 → d131, a2 → d13, a2 → d234, a2 → d350, a3 → d323, a4 → d123, a4 → d315, a5 → d1, a5 → d33, a5 → d235

R-Prec is defined as:

(5.8)

where n is the number of topics; and Pt@R is the precision at R where R is the number of unique items in the qrels of topic t. Similarly, Precision-at-N is computed using the average precision for all topics (source articles) at a pre-defined position N in the results list. Values of N were chosen as: 5, 10, 20, 30, 50, and 250.

102


5.10.3 EVALUATION TOOL 5.10.3.1

USER INTERFACE

A GUI program is employed for evaluation. It can be easily used to compute performance scores (Precision-at-N, R-Prec, and LMAP) of different CLLD systems using qrels either from Wikipedia ground-truth or manual assessment. With this evaluation tool, a comparative plot of precision-recall curves for different systems can also be generated. The GUI snapshot of this tool is given in Figure 5-7.

Figure 5-7: Crosslink Evaluation Tool


103

5.10.3.2

INTERPOLATED PRECISION-RECALL PLOT

It is an easy way for performance comparison when interpolated precisionrecall curves of different CLLD systems or approaches can be created and put together. An example of an interpolated precision-recall plot for sample runs provided to participants is given in Figure 5-8. The sample runs were generated by the author of this thesis using a simple CLLD approach with the page name matching (PNM) algorithm which will be further discussed in section 6.2.2.

Figure 5-8: An Example of Interpolated Precision-Recall Plot

The computation of interpolated link precision is similar to that of interpolated precision used in the evaluation of a traditional information retrieval task (Manning, Raghavan, & Schütze, 2008). For each experimental run, the interpolated link precision is measured at 20 recall levels: {0.05, 0.10, 0.15, 0.20, ..., 1.00}. The interpolated precision-recall curve of a run can reflect the relationship between precision and recall through the anchor / link ranking.

104


As the recommended anchors or links should be ordered based on the degree of relevance, for each topic the ranked anchor / link list can be denoted as:

(5.9)

where k is the kth anchor / link in the ranked link list; n is the number of identified anchors / links in the list. Generally, from link

to

the link precision and recall of top K links can be

computed and used to plot a precision-recall curve (but with jiggles). The “saw-tooth shape” (Manning, et al., 2008) of the curve results from the fact that the precision can go either up or down when the (k+1)th link comes in. With the interpolated precisions taken at a series of predefined recall points, a relatively smooth precisionrecall curve is able to be produced. The interpolated link precision for each experimental run, denoted as (used for both types of evaluations (F2F and A2F)) at a certain recall level R, is defined as:

(5.10)

where R is the predefined recall level, one of the values in {0.05, 0.10, 0.15, 0.20, ..., 1.00}; R’ is the recall level smaller than R, and when R=0.5, R’=0; r is the already computed recall and

; n is the number of topics used for evaluation;

is the precision at recall r for topic t. Interpolated link precision is actually the highest precision found at each recall level. After such precision normalisation with equation 5.10, if there is no precision found at a certain recall level the precision for that recall level is assigned to zero.


105

5.11 SUMMARY In this chapter, a CLLD evaluation framework for link assessment and system benchmarking was described in details, which includes the evaluation methodology, the cross-lingual link discovery task definition, the run submission specification, the evaluation metrics, and the assessment and evaluation tools. It is important for the study of CLLD to have a standardised evaluation framework which can help benchmarking performance of various CLLD systems and indentifying good CLLD realisation approaches. The evaluation methods and the evaluation framework described in this chapter have been utilised to quantify the system performance at NTCIR-9 Crosslink task. Furthermore, this framework can be easily adapted for the evaluation of cross-lingual link discovery of any other language pairs. With the standard test collections and evaluation data set developed for the CLLD task, systems or applications for realising CLLD can be built or further refined to provide a better automated cross-lingual linking for knowledge management and sharing.

106


Chapter 6: Chinese / English Cross-lingual Link Discovery Chapter 6 examines automated cross-lingual hypertext identification for Wikipedia, and focuses on the particular case of cross-lingual link discovery— Chinese / English link discovery (C/ELD). The work includes the study of the effects of natural language processing—Chinese segmentation and machine translation on CLLD. The system performance was benchmarked using the evaluation framework discussed in Chapter 5. The experimental results show that a link mining approach that mines the existing link structure for anchor probabilities performs the best. The outcome of this research will significantly help improving user experience of knowledge sharing in a multi-lingual environment. The techniques described here for automated cross-lingual link generation can assist multi-lingual uses in the case where a particular topic is not covered in all languages, is not equally covered in all languages, or is biased in one or more languages; as well as for language learning. This chapter covers details of: overview of Chinese / English link discovery (section 6.1), C/ELD methods (section 6.2), C/ELD implementations (section 6.3), C/ELD experiments (section 6.4), results and discussion (section 6.5). And this chapter is summarised in section 6.6.

Chapter 6: Chinese / English Cross-lingual Link Discovery

107

6.1

OVERVIEW

6.1.1 CHINESE AND ENGLISH COLLECTIONS

The Chinese and English document collections presented in section 5.6 from the evaluation framework are used for the Chinese / English link discovery experiments. From Table 5-4 of section 5.6, it can be seen that articles in Wikipedia are not necessarily symmetrically language cross linked because most links are hand inserted. Consequently the number of English to Chinese cross linked articles is not the same as the number of Chinese to English cross linked articles. After merging the language links from both corpora, in total there are 200,825 unique Chinese / English cross links, which can be used to produce a document title mapping table, Tlang. Based on this statistics, the relative size of the two collections and the proportion that are cross linked are shown in Figure 6-1.

Figure 6-1: The size of Chinese and English Wikipedia

Given the relatively low level of cross language links in Chinese Wikipedia it is reasonable to assume that many good cross language links are missing from the collections.

108


6.1.2 A TWO-STEP APPROACH

A straightforward way of providing anchored cross-lingual links in Wikipedia can consist of two phases: 1) detecting prospective anchors in the source document; and 2) for each anchor, identifying relevant documents in the target language. Once the anchor is identified, a link, a→d, is created (where a is the anchor in a source, d is the target document). 6.1.3 NATURAL LANGUAGE PROCESSING IN CHINESE-TO-ENGLISH LINK DISCOVERY

Chinese Wikipedia is a collaborative effort of contributors from different areas where Chinese is spoken with different knowledge backgrounds. They cite modern and ancient sources combining simplified and traditional Chinese text, as well as regional variants. Therefore, in order to link Chinese documents to English documents while considering the linguistic complexity in the Chinese Wikipedia articles, it is necessary to break the Chinese into separate words (to segment the Chinese). Chinese segmentation breaks the long strings of characters into n-gram words. Presumably, this is a particularly critical step in Chinese-to-English cross-lingual link discovery because it affects not only the identification of the anchors but the ability to translate an anchor into English. The error rate of the translation is dependant on the quality of the segmentation (Chang, Galley, & Manning, 2008). 6.1.4 CLLD REALISATION

Inspired by the monolingual link discovery approaches used at INEX, the mono-lingual English link discovery methods were adapted in the cross-lingual link environment. The methods used for the cross-lingual link discovery experiments are: link mining method (Itakura & Clarke, 2008) and page name matching method (Geva, 2008). With the same anchor identification strategy as that of link mining method, a cross-lingual information retrieval approach is also attempted. So the methods that will be all experimented for realising bidirectional Chinese / English link discovery are listed as follows: 

Link Mining (LM)



Page Name Matching (PNM)



Cross-lingual Information Retrieval (CLIR)


109

The details of each method in realising Chinese / English link discovery will be explained in following sections. 6.2

CHINESE / ENGLISH LINK DISCOVERY METHODS

6.2.1 FINDING LINKS WITH LINK MINING 6.2.1.1

MONO-LINGUAL LINK PROBABILITY

Wikipedia contains a rich set of existing anchored links. These links contain pairs of specified anchor texts and the associated target documents. Hence, the existing link information can be used to recommend new links. Itakura & Clarke (Itakura & Clarke, 2008) calculate the anchor weight (or link probability), using: (6.1) where the numerator is the link frequency, lf, of anchor a pointing to document d; and the denominator is the document frequency (df) of anchor a in the corpus. The computed  score indicates the probability that a given phrase is an anchor and linked to a specific target document. So with this method, when an anchor is identified, the target document is also determined. Mihalcea & Csomai (Mihalcea & Csomai, 2007) and Milne & Witten (Milne & Witten, 2008) also use a similar method to weight phrases. With computed link probabilities of anchor candidates, better links can be created for Wikipedia; or any documents can be linked with Wikipedia. Generally, to link a document of the same language: first, all possible n-gram substrings in the source document are computed; next, for each n-gram text its score is looked-up. Then, these anchor candidates are sorted on the score; last, an arbitrary number (based on a threshold, or alternatively a density) of highly ranked links is then chosen. In the case of overlapping anchors, the longest anchor is chosen. All scores of existing anchored links can be pre-calculated and stored in a link table, Tlink. This table of mono-lingual anchor-to-target (a→d) pairs can be created by mining the existing link structure of Wikipedia.

110


6.2.1.2

CROSS-LINGUAL LINK PROBABILITY

To make the link mining method work with cross-lingual linking, a bridge needs to be built between anchor candidates and prospective documents in another language. One way to use the link mining approach discussed previously is: First, mine in one language to create a list of candidate anchors; Second, “translate” those anchors into the second language; Then with the translations target documents can be identified using the monolingual link recommendation method. To build such a language bridge, a table of documents existing in both Chinese and English could be used. Such a table, Tlang, can be generated from the page-topage language links presenting in Wikipedia. This is a form of cross-lingual document name triangulation in Wikipedia. For example, a Chinese page is a good target for an English anchor if there exists a link from the anchor to the English document and from the English document to the Chinese document. The relationship of the Chinese / English triangulation is illustrated in Figure 6-2 and Figure 6-3.

Figure 6-2: Cross-lingual triangulation (English-to-Chinese)


111

Figure 6-3: Cross-lingual triangulation (Chinese-to-English)

6.2.2 FINDING LINKS WITH PAGE NAME MATCHING

An alternative approach for cross-lingual link discovery is title matching (also known as name-matching, and entity matching). Geva (Geva, 2008) first uses the page name matching algorithm for mono-lingual link discovery. The core idea of this PNM algorithm is: first, a page title table, a list of titles of all documents in Wikipedia is built; secondly, for a new document a list of all possible n-gram substrings is created; then from that list the longest that are also in the page title table are chosen as the anchors, and the targets are the documents with the given title. To use this in cross-lingual link discovery, it is necessary to first construct a table of documents for a language pair (e.g. English / Chinese pair). Then, for a new document of a language, identify all substrings that match document titles in source languages as the anchors. The targets are the corresponding documents of target language. The name matching method is simple, but has been proven to be effective. It is particularly useful if no pre-existing links exist in the document collection (a scenario in which link mining cannot be used). If the link-graph contains no islands, each document contains only one incoming link, and all anchors are document titles; then the link mining link table

112


covers all document titles and all weights are equal. That is, it is the page title table from the title matching algorithm. So title matching is a special case of link mining. 6.2.3 FINDING LINKS WITH CROSS-LINGUAL INFORMATION RETRIEVAL

The cross-lingual information retrieval approach to cross-lingual link discovery involves identifying anchors in one language, translating them into the target language, and then using them as search terms in a ranking search engine. The top ranked documents are chosen as targets to the anchors. To identify the anchors, the same anchor detection strategies discussed previously might be used.

This includes anchor mining and document titles.

Alternatively a dictionary might be used. A flowchart of the above proposed CLLD methods is given in Figure 6-4.

Figure 6-4: A Flowchart of Proposed CLLD Methods


113

6.2.4 COMPARISON OF CLLD METHODS

A comparison of the different link discovery approaches is presented in Table 6-1. The cross-lingual information retrieval approach can find links never seen before. The link mining method produces more accurate results. The page name matching is particularly helpful when there are no pre-existing links, but the available links for recommendation are very limited.

Table 6-1: Pros and Cons of Three Link Discovery Approaches Method Pros

Cons

LM

More accurate, less noisy

Only finds links already in the corpus

Simple, effective

Only finds links matched with the page

PNM

CLIR

title Finds

links

not

seen May be noisy

elsewhere in the corpus

LM—Link Mining; PNM—Page Name Matching; CLIR—Cross-lingual Information Retrieval

6.3

IMPLEMENTATION OF CHINESE / ENGLISH LINK DISCOVERY

6.3.1 CROSS-LINGUAL LINK PROBABILITY

To realise Chinese / English link discovery two kinds of tables are needed. The first is the table of anchor-to-target pairs from link mining,

(

for Chinese and Tlink-english for English). The second kind of table named

is the

corresponding document titles in both languages extracted from the language links. Due to different natural language processing needs, the implementations for Chinese-to-English and English-to-Chinese link discovery are discussed separately. Chinese-to-English implementations are numbered in (C2E-#) pattern; and Englishto-Chinese’s are (E2C-#).

114


6.3.1.1

CHINESE-TO-ENGLISH LINK PROBABILITY

For Chinese-to-English cross-lingual document linking two additional issues— segmentation and translation were also examined. Several different link discovery methods were proposed based on these tables, segmentation, translation, and link mining:

(C2E-1). Linking without machine translation and segmentation

First, a mono-lingual Chinese link table,

by mining the Chinese

corpus for all Chinese anchor-target pairs is built. All rows for which there is no English document corresponding to the Chinese target are removed from this table. This information comes from

.

Table 6-2: Extracts from Tlang 花旗银行

Citibank

椰子蟹

Coconut crab

奧黛莉·朵杜 Audrey Tautou 米高佐敦

Michael Jordan

迈克尔·乔丹 Michael Jordan

Several entries from Tlang are given in Table 6-2. From the table it can be seen that more than one page exists for some entries, for example for Michael Jordan, however, after redirection, both end on the same page. Then the link frequency and document frequency of all n-grams anchors in Tlink-chinese is computed. The document frequency of an n-gram anchor is the number of documents that contain the n-gram regardless of whether or not it is seen as an anchor. Several entries from Tlink-chinese are given in Table 6-3. Each document has a unique id, a link frequency, and a document frequency. Chapter 6: Chinese / English Cross-lingual Link Discovery

115

Finally, a candidate list of links is produced from this using the technique of Itakura & Clarke. Table 6-3: Extracts from Tlink-chinese Anchor

ID

lf

df

花旗银行

53090

42

46

德国邮政银行 167203 4

4

椰子蟹

536691 10

10

奧黛莉·朵杜

80578

9

英国

39793

5212 6866

英國

39793

7938 11134

11

(C2E-2). Linking with segmentation

This approach is to use Implementation (C2E-1) Linking without machine translation and segmentation, but to first segment the source document using a Chinese word segmentation algorithm. This prevents anchors from crossing Chinese word boundaries. The NGMI segmentation approach discussed in section 3.4 is used. (C2E-3). Linking with machine translation

In this approach two link mining tables are generated; one for Chinese, Tlinkchinese,

and one for English, Tlink-english. The anchors are then translated into English

using Google Translate24. This method is similar to Implementation (C2E-1), but the candidate target link of an anchor is not from Tlang,, but from Tlink-english. Only those anchors that have translations in Tlink-english are used.

24

http://code.google.com/apis/language/translate/overview.html

116


The link probability for the Chinese anchors is taken from Tlink-english. That is, the English link probabilities are used for calculating  scores for the Chinese anchor texts. (C2E-4). Link with both translation and segmentation

Implementation (C2E-3) is used; however the text in the source document was segmented before anchor identification was performed. (C2E-5). Link mining with machine translation (2)

Implementation (C2E-3) is used; however, anchors are sorted with the  scores computed by using the Chinese link probability table, Tlink-chinese, 6.3.1.2

ENGLISH-TO-CHINESE LINK PROBABILITY

As with Chinese-to-English link discovery, to implement English to Chinese document linking two tables are also needed. The first is the table of anchor-totarget pairs from link mining, Tlink-english. The second is the table of corresponding document titles in both languages extracted from the language links Tlang. To generate the table Tlink-english, link mining technique is utilised: first, English Wikipedia is trawled and all anchor target pairs are extracted; then the collection is re-trawled looking for the frequency of the anchor phrases used either as a link or in plain text. Note that the same anchor text may be linked to different destinations in different instances where it appears and so it is necessary to identify the most likely link. Three different implementations of English-to-Chinese document linking are given below. (E2C-1). Linking with triangulation

First, the table Tlink-english, which is created using the English document collection obtained from evaluation forum INEX 200925, is built.

25

http://www.inex.otago.ac.nz/


117

All rows are removed from this table for which there is no English document corresponding to the Chinese target. This information comes from Tlang, which is the same table as used in Chinese-to-English link discovery. Then the link frequency and document frequency of all n-gram anchors in Tlinkenglish

are computed. The document frequency of an n-gram anchor is the number of

documents that contain that n-gram regardless of whether or not it is seen as an anchor. Several entries from Tlink-english are given in Table 6-4. Each document has a unique id, a link frequency, and a document frequency.

Table 6-4: Extracts from Tlink-english Anchor

ID

lf

df

Citybank

231026 431 485

Enya

9482

Coconut Crab

810400 5

287 420 5

Audrey Tautou 342753 69

90

Wombat

35

33864

35

Finally, a candidate list of links is produced from this table. (E2C-2). Linking with machine translation

In this implementation two link mining tables— Tlink-chinese and Tlink-english are used. English anchors are translated into Chinese using Google Translate. This method is similar to Implementation (E2C-1), but the candidate target link of an anchor is not from Tlang,, but from Tlink-chinese. Only those anchors that have translations in Tlink-chinese are used. The link probability for the English anchors is taken from Tlink-chinese. That is, the Chinese link probabilities are used for calculating  scores for the English anchor texts. 118


(E2C-3). Linking with translation (2)

Implementation (E2C-2) is used; however, anchors are sorted with the  scores computed by using the English link probability table, Tlink-english, 6.3.2 CROSS-LINGUAL NAME MATCHING

Thank to the simplicity of the page name matching method, it is easy to adapt it to discover documents of different languages. With this method only the crosslingual page name mapping table, Tlang, is needed, and it can be further expanded by including the mapping of other languages. For the submissions of NTCIR-9 Crosslink task, table Tlang is extended to include Japanese and Korean document titles. Table 6-5 shows the extracts of page title mapping for CJK and English Wikipedia documents. Table 6-5: Extracts from Tlang with Title Mapping of all CJK Languages English

Chinese

Japanese

Koran

Citibank

花旗银行

シティバンク、エヌ・エイ

한국씨티은행

Coconut

椰子

ココナッツ

코코넛

Scent of Woman

a 女人香

セント・オブ・ウーマン/ 여인의 향기 (1992 年電影) 夢の香り

Michael Jordan

米高佐敦

マイケル・ジョーダン

마이클 조던

Enya

恩雅

エンヤ

엔야

For both Chinese-to-English and English-to-CJK link discovery, the candidate list of links is produced by searching for n-grams in the source document that are also in the source language column of Tlang, then linking to the corresponding target language document. However, if the number of cross-linked pages in Wikipedia is low, then this method will have low recall.

Chinese-to-English Link Discovery (C2E-6). Name matching without translation or segmentation

This implementation for the Chinese-to-English link discovery using page name matching method does not involve either translation or Chinese segmentation. Chapter 6: Chinese / English Cross-lingual Link Discovery

119

English-to-Chinese Link Discovery (E2C-4). Name matching for English anchors

This implementation for the English-to-Chinese link discovery using page name matching method requires no translation for the candidate anchors. Additionally, two more runs for NTCIR-9 were produced using this method for English-to-Japanese, and English-to-Korean subtasks. 6.3.3 CROSS-LINGUAL INFORAMTION RETRIEVAL

Anchors are identified using the link mining approach using the link table either Tlink-chinese or Tlink-english. The anchors are translated into target language using Google Translate. Then an information retrieval system is used to identify candidate target documents from the target Wikipedia corpus. The settings of the cross-lingual information retrieval system are the same as discussed in section 4.4.2.

Chinese-to-English Link Discovery (C2E-7). Cross-lingual information retrieval for Chinese anchors

Anchor candidates are searched using the table Tlink-chinese. (C2E-8). Translation, segmentation, and information retrieval

This implementation is that of Implementation (C2E-7) except that the text was segmented before anchors were identified.

English-to-Chinese Link Discovery (E2C-5). Cross-lingual information retrieval for English anchors

This is a plain implementation of cross-lingual information retrieval method for English-to-Chinese link discovery without any other extra steps in natural language processing. 6.4

EXPERIMENTAL RUNS

There are two sets of Chinese / English experimental runs generated with the implementations of CLLD methods discussed in section 6.3.

120


6.4.1 CHINESE-TO-ENGLISH RUNS

For Chinese-to-English link discovery experiments, there are 8 runs generated; to this one additional run was added: a search-engine based run that only searched document titles and not the full-text. These runs are outlined in Table 6-6.

Table 6-6: Information of Chinese-to-English Runs Run Name

Description

LinkProb

Implementation (C2E-1) (anchor weight 

PNM

Implementation (C2E-6) (name matching)

LinkProb_S

Implementation

(C2E-2)

(Implementation

(C2E-1)

with

segmentation) LinkProbEn

Implementation (C2E-3) (translation)

LinkProbIR

Implementation (C2E-7) (IR, full text)

LinkProbEn_S

Implementation

(C2E-4)

(Implementation

(C2E-3)

with

(C2E-8)

(Implementation

(C2E-7)

with

segmentation) LinkProbIR_S

Implementation segmentation)

LinkProbEn2

Implementation (C2E-5) (Ranking with Tlink-chinese)

LinkProbIR_Title LinkProbIR, (Implementation (C2E-7), title only)

Figure 6-5 shows a snippet of the Chinese article on the Hong Kong ten-dollar note from run LinkProb (Implementation (C2E-1)). The anchors and the titles of the linked English documents are shown. It is clear that the algorithms do identify topically relevant target documents and appropriate anchors.


121

Figure 6-5: An Example of Chinese-to-English Document Linking

Of particular note is that none of the algorithms specifically handle Chinese language variants. The link mining algorithms do, however, learn which terms link to which target documents. That is, the management of language variants is implicit in the algorithms. As an example, run LinkProb identified both 英国 (simplified Chinese) and 英國 (traditional Chinese) as suitable links to “United Kingdom”. 6.4.2 ENGLISH-TO-CHINESE RUNS

For English-to-Chinese link discovery experiments, five runs were prepared and submitted to the NTCIR-9 Crosslink Task. The runs’ information and their system descriptions are given in Table 6-7.

122


Table 6-7: Information of English-to-Chinese Runs Run ID

Description

QUT_PNM_ZH

This is a run using the PNM algorithm given in section 6.2.2, and the cross-lingual title-to-target table is generated from the NTCIR 9- Crosslink: Chinese Wikipedia Corpus

QUT_LinkProbIR_ZH

Use the anchors recommended by link probability, and retrieve relevant links using a search engine with anchors as query terms

QUT_LinkProbZh2_ZH Same as QUT_LinkProbZh_ZH , except for that anchors are sorted based on Chinese link probability table. QUT_LinkProbZh_ZH

Use two set of link probability tables (one Chinese; one English mining from English Wikipedia corpus from INEX), and tables are connected by translation. Anchors are sorted based on English link probability table.

QUT_LinkProb_ZH

Use link probability for recommendation

6.5

anchor sorting and link

RESULTS AND DISCUSSION

The results of experiments are discussed for English-to-Chinese and Chineseto-English separately. Precision-at-N, R-Prec, and Link Mean Average Precision (LMAP) are the main metrics used to quantify the performance of the different methods. 6.5.1 CHINESE TO ENGLISH LINK DISCOVERY

The LMAP, R-Prec, and P@N scores for the different methods are given in Table 6-8. Runs are scored on the extracted Wikipedia ground-truth and sorted on LMAP. Precision and recall curves are given in Figure 6-6. Overall, all runs except PNM use the same anchor identification strategy (that of Itakura & Clarke). So, the difference in the performance of those runs can be attributed to the segmentation, translation, and lookup (IR) method. The best performing run, LinkProb has the best combination of strategies (and not a different method of choosing anchors)


123

Table 6-8: Performance of Chinese-to-English Experimental Runs Run ID

LMAP R-Prec P@5

P@10 P@20 P@30 P@50 P@250

LinkProb

0.168

0.250

0.800 0.694

0.546

0.471

0.386

0.138

PNM

0.123

0.204

0.667 0.567

0.499

0.426

0.351

0.113

LinkProbEn2

0.095

0.195

0.456 0.428

0.338

0.300

0.247

0.117

LinkProbEn

0.085

0.154

0.489 0.394

0.315

0.263

0.211

0.103

LinkProb_S

0.059

0.141

0.411 0.322

0.268

0.234

0.201

0.080

LinkProbEn_S

0.033

0.101

0.233 0.186

0.144

0.133

0.118

0.066

LinkProbIR_Title 0.029

0.101

0.100 0.081

0.096

0.105

0.107

0.071

LinkProbIR

0.028

0.100

0.122 0.100

0.121

0.111

0.112

0.065

LinkProbIR_S

0.010

0.056

0.061 0.081

0.064

0.070

0.066

0.032

Figure 6-6: The Interpolated Precision-Recall Curves for the Different Methods

124


Segmentation

In all cases non-segmented runs out performed the segmented variant of the run. Contrary to intuition, segmentation interferes with anchor identification. This reflects both the non-perfect performance of any segmentation algorithm, and the links themselves being unlikely to be ambiguous in context (because they are namedentities). Translation

The Five runs that used translation performed worse than LinkProb and PNM. Run LinkProbEn2 and run LinkProb indentified the same set of initial candidate Chinese anchors and used the same link ranking strategy. LinkProbEn2, however, performs worse than LinkProb. This suggests that the performance deteriorates as a consequence of the translation process. A failure analysis of the runs suggests that the problem is caused by translation error. Table 6-9 lists some of the anchor candidates (column 1) that were incorrectly machine translated (column 2) and the preferred target document seen through link mining (column 3).

Table 6-9: Example Translation Errors in the Runs Anchor

MT

Wiki

兵形

Bing-shaped

Pawn structure

黑田清子

Black 田清子

Sayako Kuroda

資治通鑑

Mirror

Zizhi Tongjian

社稷

Boat

Soil and grain

白骨精

White-Boned Demon

手枪骑兵

Pistol cavalry


Bai Gu Jing

Reiter

125

The failure in translation is similar to that caused by segmentation. Without perfect knowledge of all entities the translation software cannot produce perfect results. Such results cannot be expected because the entity list cannot be closed. The incorrect translations seen in Table 6-9 are not seen in runs LinkProb and PNM because in those runs the translation is done using the language links in Wikipedia (through triangulation). Recall that the triangulation used to generate the runs is quite different from the triangulation in the assessments. In the former case candidate links are mined from Chinese Wikipedia and the English target documents are then found from the Chinese target documents. In this case the mining occurred after the topic was removed from the collection. In the latter case the candidate document (topic) is found in both Chinese and English Wikipedia, links are followed and triangulated, or alternatively the English version was found and links were extracted. The result suggests that the mined mapping table Tlang used in runs LinkProb and PNM, is a better translation table than classical machine translation. This is hardly surprising as it is domain specific, and entity list (rather than phrasal text). An alternative which is not tested was a combination of the two approaches – using machine translation if an entity could not be translated. Chinese-to-English Document Linking

It can be seen from both Table 6-8 and Figure 6-6 that run LinkProb performed best when scored using LMAP, R-Prec, and P@N. Given that the number of candidate links in Tlang used by the cross-lingual page name matching algorithm is much smaller than Tlink-chinese used by the link mining method the good performance of PNM is surprising but encouraging. The cross lingual information retrieval approaches (LinkProbIR, LinkProbIR_S and LinkProbIR_Title) have the lowest performance scores on all metrics. This is because the search engine is good at identifying documents and not entities (document titles). The performance of run that used only titles LinkProbIR_Title is higher that of the other two IR runs, but remains lower than other approaches because many anchors are exact phrases but the search engine considered these to be bags of words.

126


Run LinkProbEn2 ranked third performing better than LinkProbEn.

The

difference between the two runs was the source of the link probability  score. In the former the probability came from the Chinese language corpus, but in the latter it came from the English corpus. This suggests that Chinese is a better predictor of which English documents to link to than is English. Through a statistics analysis (t-test) on Average Precision, P@5, P@10 of all topics for runs LinkProb and PNM, p-values of 0.00004, 0.003, and 0.0007 were obtained respectively. The t-test indicates that run LinkProb using Implementation (C2E-1) has significant improvement in performance comparing with the second best run PNM. So Implementation (C2E-1) was the best algorithm tested for Chinese-English cross-language link discovery. As the experiments are the first for the solving the Chinese-to-English document linking problem, the LMAP, R-Prec, and P@N scores of run LinkProb are the best results to date. 6.5.2 ENGLISH TO CHINESE LINK DISCOVERY AT NTCIR-9

The team name for the experimental runs submitted to the Crosslink task of NTCIR-9 for the competition with other participating teams is “QUT”. The scores of seven different runs, which were computed using the evaluation tool with the official qrels, are given in Table 6-10 and Table 6-11. In Table 6-10 runs are sorted on LMAP, and in Table-11 runs are sorted on P@5. The scores in both tables are separated in two groups (file-to-file and anchor-to-file evaluations). Precision and recall curves are given in Figure 6-7. 6.5.2.1

EVALUATION OF LINK MINING RUNS

The ranking of the English-to-Chinese runs in two types of evaluations (F2F and A2F) is different. Even so, it can be seen from Table 6-10, Table 6-11 and Figure 6-7 that run QUT_LinkProb_ZH performed the best in all evaluations. It indicates this run has the best combination of strategies of anchor ranking, translation and link recommendation. For other runs the Wikipedia ground-truth evaluation prefers cross-lingual page name matching method for automatic link discovery, but the evaluation with the manual assessment results finds the link mining methods using either Chinese or English source as link predictor can contribute more relevant links. Chapter 6: Chinese / English Cross-lingual Link Discovery

127

Table 6-10: LMAP and R-Prec Scores of English-to-Chinese Runs in both F2F and A2F Evaluations Run ID

LMAP

R-Prec

metric scores computed with qrel from Wikipedia ground-truth LinkProb_ZH

0.179

0.244

PNM_KO

0.122

0.208

PNM_ZH

0.088

0.166

f 2 PNM_JA f

0.076

0.143

LinkProbZh2_ZH

0.069

0.154

LinkProbZh_ZH

0.059

0.148

LinkProbIR_ZH

0.023

0.067

metric scores computed with qrel from manual assessment LinkProb_ZH

0.115

0.133

LinkProbZh_ZH

0.094

0.119

LinkProbZh2_ZH

0.090

0.117

0.087

0.016

PNM_KO

0.043

0.043

PNM_ZH

0.030

0.033

LinkProbIR_ZH

0.008

0.026

a 2 PNM_JA f

128


Table 6-11: P@N Scores of English-to-Chinese Runs in Both F2F and A2F Evaluation Run ID

P@5

P@10

P@20

P@30

P@50

P@250

metric scores computed with qrel from Wikipedia ground-truth LinkProb_ZH

0.776

0.588

0.480

0.404

0.319

0.132

PNM_JA

0.624

0.504

0.394

0.333

0.262

0.079

PNM_ZH

0.592

0.472

0.362

0.307

0.242

0.064

f 2 PNM_KO f

0.552

0.460

0.384

0.321

0.244

0.062

LinkProbZh2_ZH

0.360

0.284

0.248

0.221

0.187

0.082

LinkProbZh_ZH

0.304

0.208

0.168

0.161

0.156

0.082

LinkProbIR_ZH

0.184

0.160

0.118

0.109

0.084

0.044

metric scores computed with qrel from manual assessment LinkProb_ZH

0.336

0.308

0.294

0.288

0.277

0.172

LinkProbZh_ZH

0.320

0.244

0.260

0.273

0.269

0.158

LinkProbZh2_ZH

0.312

0.312

0.304

0.299

0.271

0.155

a 2 PNM_ZH f

0.208

0.204

0.214

0.220

0.187

0.045

PNM_KO

0.136

0.200

0.220

0.217

0.193

0.047

PNM_JA

0.128

0.124

0.108

0.096

0.077

0.020

LinkProbIR_ZH

0.104

0.104

0.072

0.073

0.070

0.033


129

a)

b)

Figure 6-7: The Interpolated Precision-Recall Curves of Runs (plot a) is the f2f evaluation using Wikipedia ground-truth; plot b) is the a2f evaluation using manual assessment result.)

130


There

is

no

obvious

performance

difference

between

run

QUT_LinkProbZh2_ZH and QUT_LinkProbZh_ZH in both F2F and A2F evaluations. And also the different ranking of these two runs in both evaluations suggests that either the English corpus or Chinese corpus can be used as a good source of link predictor. However, the relatively low ranking of the above discussed two runs indicates the machine translation adopted for connecting the identified anchors and the crosslingual target documents results in a worse performance of link finding if compared with that of the best run QUT_LinkProb_ZH which utilises the translation using Wikipedia cross-lingual page name triangulation. 6.5.2.2

EVALUATION OF PAGE NAME MATCHING RUNS

Given the limited number of page-to-page cross-lingual links existing in Wikipedia and resulting relatively small size of Tlang used by the cross-lingual page name matching algorithm, the reasonable performance of all PNM runs (PNM_ZH, PNM_JA and

PNM_KO) for all three language subtasks is surprising but

encouraging. 6.5.2.3

EVALUATION OF CLIR RUNS

The cross lingual information retrieval approach (run: QUT_LinkProbIR_ZH) has the lowest performance scores of all metrics in both evaluations. This is because the search engine is good at identifying relevant documents and not entities (document titles). However, it is interesting to see that the QUT_LinkProbIR_ZH runs, even with the worst performance, contribute the highest number of unique relevant documents in English-to-Chinese subtask when evaluated with the qrels from manual assessment according to the official assessment results of the NTCIR Crosslink task. This result is encouraging but not surprising. As expected, the cross-lingual information retrieval approach may not be able to accurately locate the exact match of target links with the suggested anchors, but it provides an opportunity for other also interesting and relevant links to be seen by information seekers. 6.5.2.4

COMPARISON WITH OTHER TEAMS

The performance scores of all participating teams with the three system evaluation metrics (LMAP, R-Prec and Precision-at-N) can be obtained from the Chapter 6: Chinese / English Cross-lingual Link Discovery

131

official evaluation results 26 of the NTCIR-9 Crosslink task. The evaluation results show that the two different rankings, when sorted on LMAP and R-Prec scores, of all submitted runs are very similar, so the comparison of QUT runs with others herein is performed merely using the LMAP and Precision-at-5 scores. As expected, the QUT’s best run is QUT_LinkProb_ZH in all evaluations. The LMAP scores and the rankings of run QUT_LinkProb_ZH are given in Table 6-12. The third row of the table shows the LMAP scores of the best runs in each evaluation. It needs to be noted that the best run in different evaluations (file-to-file or anchor-to-file) with different qrels (from automatic assessment or manual assessment) could be different.

Table 6-12: LMAP Scores and Rankings of Run QUT_LinkProb_ZH in Evaluations for English-to-Chinese Link Discovery Automatic F2F

Manual F2F

LMAP Ranking LMAP

Manual A2F Ranking LMAP

Ranking

0.373

1st

0.308

1st

0.157

1st

0.179

13th

0.202

10th

0.115

4th

Table 6-12 shows that: run QUT_LinkProb_ZH didn’t score very well in the file-to-file evaluation with either Wikipedia ground-truth or manual assessment results; but this run achieved the fourth position in anchor-to-file evaluation in which the relevancy of anchors is taken into consideration. Even tough with relatively poor performance in evaluations using metric LMAP, QUT runs showed encouraging results in early link prediction if measured with Precision-at-5 metric. The detailed discussions of the QUT system performance measured with Precision-at-5 metric are presented as follows.

26

http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/NTCIR/Evaluations/CrossLink/ntc9CROSSLINK-eval.html

132


File-to-File Evaluation with Wikipedia Ground-Truth

Table 6-13 lists the top six runs sorted on Precision-at-5 scores in file-to-file evaluation with Wikipedia ground-truth. Team QUT is second when ranked in team, and run QUT_LinkProb_ZH is ranked fourth with only 0.056 differences with the best run HITS_E2C_A2F_01.

Table 6-13: The Top Six Runs in F2F Evaluation with Wikipedia Ground-Truth (Measured with Precision-at-5) Run-ID

P@5

HITS_E2C_A2F_01

0.832

HITS_E2C_A2F_02

0.808

HITS_E2C_A2F_03

0.784

QUT_LinkProb_ZH

0.776

KMI_SVM_ESA

0.728

KMI_SVM_ESA_TERMDB

0.712

File-to-File Evaluation with Manual Assessment Results

Table 6-14 shows the top six runs sorted on Precision-at-5 scores in file-to-file evaluation with manual assessment results. The ranking of run QUT_LinkProb_ZH in this evaluation become number one.

Table 6-14: The Top Six Runs in F2F Evaluation with Manual Assessment Results (Measured with Precision-at-5) Run-ID

P@5

QUT_LinkProb_ZH

0.808

HITS_E2C_A2F_03

0.752

HITS_E2C_A2F_02

0.752

HITS_E2C_A2F_01

0.752

KMI_SVM_TERMDB

0.752

WUST_A2F_E2C_03

0.744


133

Anchor-to-File Evaluation with Manual Assessment Results

Table 6-15 shows the top six runs sorted on Precision-at-5 scores in anchor-tofile evaluation with manual assessment results. Team QUT is in second when ranked in team, and run QUT_LinkProb_ZH is ranked fourth among all runs.

Table 6-15: The Top Six Runs in F2F Evaluation with Manual Assessment Results (Measured with Precision-at-5) Run-ID

P@5

KMI_SVM_TERMDB

0.376

KMI_SVM_ESA_TERMDB

0.368

KMI_SVM_ESA

0.360

QUT_LinkProb_ZH

0.336

QUT_LinkProbZh_ZH

0.320

QUT_LinkProbZh2_ZH

0.312

Overall, the QUT English-to-Chinese runs have medium performance when compared to the other good runs submitted to the task. However, the relatively high scores of run QUT_LinkProb_ZH when measured with the Precision-at-5 metric suggest that the QUT CLLD system with the link probability method is particularly effective at early link precision. Unique Relevant Links Evaluation

The largest number of unique relevant links which users might think deserve further reading were contributed by the QUT runs. A comparison of number of unique links found by top three teams is given in Table 6-16. With the automatic assessment using Wikipedia ground-truth, QUT runs found 95 unique links only two links less than the number found by team UKP. Of particular note is that with the manual assessment QUT was able to discover 1103 unique links, approximately 79% of total unique links found by all teams, and it is close to eight times greater than the number of unique links contributed by team KMI who is ranked second. The numbers of unique links discovered by QUT runs in both automatic and manual assessment link sets are given in Table 6-17 and Table 6-18 separately. From 134


Table 6-17, it can be seen that all runs except for page name matching run can contribute a good number of unique links if counted against the relevant link set from automatic assessment. Table 6-18 shows that run QUT_LinkProbIR_ZH generated by the cross-lingual information retrieval approach found the largest number of unique links if counted against the relevant link set from manual assessment. It indicates that the cross-lingual information retrieval system used in the CLLD experiments is capable to provide high quality link retrieval. The lowest LMAP, RPrec, and P@N scores also suggest that both identified anchors and recommended links in run QUT_LinkProbIR_ZH were poorly ranked even with a large number of good links discovered. Therefore, it is necessary to adjust the strategies for anchor weighting and noise link removal to achieve a better performance with the CLIR method.

Table 6-16: A Comparison of Number of Unique Links Found in En-2-Zh Evaluation Type

Total Unique Link Found 1st

2nd

3rd

QUT(95)

KMI(27)

Automatic 245

UKP (97)

Manual

QUT (1103) KMI(152) UKP(88)

1397

Table 6-17: Number of Unique Relevant Links Found by Each QUT Run in Automatic Assessment Link Set Run-ID

Unique Link # Total

QUT_LinkProb_ZH

48

QUT_LinkProbIR_ZH

35

QUT_LinkProbZh2_ZH 32 QUT_LinkProbZh_ZH

29

QUT_PNM_ZH

2


95

135

Table 6-18: Number of Unique Relevant Links Found by Each QUT Run in Manual Assessment Link Set Run-ID

Unique Link #

QUT_LinkProbIR_ZH

562

QUT_LinkProb_ZH

526

QUT_LinkProbZh_ZH

261

Total

1103

QUT_LinkProbZh2_ZH 256 QUT_PNM_ZH

6.6

48

SUMMARY

In this chapter, several methods for automatically identifying anchors in Chinese / English document and targeting documents of a different language were proposed and tested. These methods include: link mining, page name matching, and cross-lingual information retrieval. For Chinese-to-English link discovery, the realisation methods cover the testing of effectiveness of word segmentation on the Chinese-to-English document linking. In the Chinese-to-English link discovery experiments, the best algorithms used link mining to compute a list of anchor / target probabilities, and it did not use Chinese language segmentation or machine translation.

Although Chinese

segmentation and machine translation are two essential steps in Chinese to otherlanguage information retrieval, the experiment results suggest that they are not absolutely necessary for link discovery. This is because segmentation is implicit in anchor mining and translation is implicit in cross-language triangulation. In the English-to-Chinese link discovery experiments, evaluation results show the link mining method with Wikipedia cross-lingual document name triangulation (run: QUT_LinkProb_ZH) also performed the best among all implementations, and achieved encouraging results in the overall evaluations of the Crosslink task. This method requires pre-mining on the existing link structure of the English Wikipedia. 136


In order to compute a list of English anchor / target probabilities, additional English Wikipedia corpus from INEX 201027 was employed for link mining. Also the official results of the NTCIR-9 Crosslink task shows the English-toChinese link discovery submissions presented in this chapter contribute the most unique links for the test topics in the evaluation using manual assessment results. In future the performance of the QUT system could be further improved if links of different implementations are properly combined and re-ranked with a better anchor weighting strategy.

27

http://www.inex.otago.ac.nz


137

Chapter 7: The Effectiveness of Cross-lingual Link Discovery This chapter discusses the effectiveness of cross-lingual link discovery in assisting easy cross-lingual information access in Wikipedia, and emphasises the importance of manual evaluation in quantifying the performances of CLLD systems. Section 7.1 outlines the background information of effective cross-lingual link discovery. Section 7.2 describes the assessment difficulties encountered when performing evaluation for the CLLD task. Section 7.3 presents the official evaluation results at NTCIR-9 Crosslink task. Section 7.4 discusses the effectiveness of crosslingual link discovery. Then this chapter is summarised in section 7.5.

Chapter 7: The Effectiveness of Cross-lingual Link Discovery

138

7.1

INTRODUCTION

An anchor is a snippet of text that is relevant to the topic of the article and should be linked to a related article so that the reader can gather further information (or receive an explanation). Wikipedia anchors are often manually chosen and can target only one destination page. But there could be four types of links in Wikipedia: 

Mono-lingual article-to- article (see also) links;



Mono-lingual anchor-to-article links;



Cross-lingual article-to-article (language) links;



Cross-lingual anchor-to-article links.

Wikipedia links are usually monolingual; the target page is in the same language as the source page and the anchor. Although article-to-article cross lingual links are not uncommon (they are listed on the left hand side of Wikipedia web pages as “languages” links), cross-lingual links from anchor to destination are rare. HTML supports linking independent of language (indeed, it does not know anything about the language (or otherwise) of the target), but there are two fundamental problems that have inhibited the evaluation of cross language linking. First, considerably greater effort is required to find link targets in a second (or subsequent) language (recall most links are inserted manually by humans). Second, none of the dominant web browsers directly support multiple links per anchor (although they can be programmed to do so), so each anchor is linked to a single target typically in the source language and there is no native way to add alternative (language) targets to the same anchor - monolingual linking is preferred on the assumption that the reader would prefer to stay in a single language. Currently, adding multiple language targets to a single anchor requires human support to find the links and browse support to display them – none of which exist. Looking beyond these existing limitations, it is clear that multiple targets per link is beneficial – it can be seen on many social media websites such as Facebook28

28

http://www.facebook.com/


139

where a user clicks an icon and is presented with a pop-up menu. It is also clear that cross-lingual hypertext linking is beneficial for (at the very least) language versions of Wikipedia that are sparse in coverage –although there are currently 3.8 million English Wikipedia articles, there are only 6,859 in Māori29. It is unreasonable (even unethical) to restrict access to knowledge simply on a lingual basis. The study of anchored cross-lingual linking with multiple targets (one-to-many linking) is an essential addition to Wikipedia. In Chapter 6, a number of approaches to realising English / Chinese crosslingual link discovery were proposed and experimented with. Also, at NTCIR-9 Crosslink task, submissions by different CLLD systems were received and pooled for manual assessment. Together with Wikipedia ground-truth, performances of various CLLD systems were then evaluated. However, the importance of manual assessment in effective CLLD evaluation must be stressed. 7.2

MANUAL ASSESSMENT

7.2.1 LINK POOLING FOR ASSESSMENT

In total 57 runs from 11 teams were submitted to the NTCIR-9 Crosslink task. General statistics of runs are broken down by language in Table 7-1. The first column shows the task, the second the number of submitted runs, the third shows the average number of links per topic. For example, in the English to Chinese run (En-2Zh) there were 25 runs averaging 2969 links per topic. The other tasks were Englishto-Japanese (En-2-Ja) and English to Korean (En-2-Ko). Table 7-1: Average Number of Links in Pooling

Task

29

Runs Average links per topic

En-2-Zh

25

2969

En-2-Ja

11

666

En-2-Ko

21

924

http://mi.wikipedia.org/wiki/Hau_K%C4%81inga

140


All the prospective anchors and the corresponding targets were pooled. Unfortunately, anchors in some submitted runs did not pass the anchor validity check; they were subsequently discarded from the pool. The pool was assessed to completion (there were no valid and unassessed links in any runs). 7.2.2 HUMAN ASSESSORS

In the manual assessment stage, recruiting enough volunteers with different language backgrounds is a very challenging task. Conveniently, there are many students with differing backgrounds at Queensland University of Technology (QUT) many possessing (at least) bi-lingual skills. Students make good assessors for two reasons: 1) they are highly educated and so can reasonably be expected to understand Wikipedia articles; and 2) they possess good language skills in their native language and English and so can be reasonably expected to be able to read articles in multiple languages. All assessors were compensated with movie tickets. Summary information on the assessors is given in Table 7-2: column 1 lists the task; column 2 the number of assessors; and column 3 their level of education. For example, the one assessor for the English to Japanese task was a Post-doctoral scholar. Table 7-2: Assessors Information

Task

Assessors Description

En-2-Zh

15

PhD students, some undergrads

En-2-Ja

1

Postdoc

En-2-Ko

5

Undergraduate students

It was initially thought that finding assessors for the English-to-Chinese task would be easy because of the relatively large proportion of Chinese students at QUT. However assessing all English-to-Chinese links was a challenge because: 

The English-to-Chinese task saw the largest average number of links per topic requiring more time to assess (approximately three hours per topic).



Students considered themselves too busy to help.



Motivation was low as compensation was low.


141

Due to the shortage of assessors three of the original topics were never assessed. Asking for runs on more topics than are finally assessed by a participating team is a normal part of the INEX paradigm. Finding assessors for the Japanese topics was also difficult, due to the lack of availability of English-Japanese bilingual speakers. An organiser of the NTCIR-9 Crosslink task was left as the only assessor after the initially identified two additional volunteers eventually dropped out. Korean assessors were readily available in the form of undergraduate students. They were given instructions and assessed topics under the supervision of at least one task organiser. It took an assessor approximately one hour to finish assess one thousand links to completion. Each topic was assessed by a single assessor but assessors could complete multiple topic assessments. During assessment the assessor could mark either the anchor or the target as relevant or non-relevant. If an anchor was assessed as non-relevant then that anchor’s target articles were assessed as non-relevant. 7.2.3 OVERLAPPING ANCHORS

Due to different methods used in different systems for anchor identification, pooled anchors might be overlapped. There is no hard specification with respect to the relevancy of overlapped anchors. All overlapped anchors are still judged one by one, and it is up to assessor(s) to decide if an overlapped anchor or entire overlapped anchors are relevant. The decision of whether or not to provide overlapping anchors in articles will be up to the applications that realise cross-lingual link discovery in a knowledge base according to user’s own preference. And this is not the focus of this thesis. 7.2.4 ASSESSMENT TOOL

As discussed in section 5.9.2.2, a tool with a friendly GUI included in the evaluation framework was developed for the manual assessment of pooled links. It is shown in Figure 7-1 with an English-to-Chinese link. In the left pane the English source document (the assessment topic) is shown. In the right pane the Chinese

142


target document is shown. The assessor clicked either the right or left mouse button to mark the link as relevant or non-relevant.

Topic with highlighted anchors

Target document Click to assess (Right click irrelevant; Left click relevant)

Figure 7-1: The NTCIR-9 Crosslink Manual Assessment Tool

With the tool, assessors inspected each anchor and its corresponding links, accepting or rejecting each. This method of assessment is not dissimilar to the assessment approaches used in CLIR evaluations, and is similar to that used at INEX. 7.2.5 THE WIKIPEDIA GROUND-TRUTH RUN

In the INEX (mono-lingual) Link-the-Wiki track, links in the source (topic) article were added to the assessment pool and manually assessed (Huang, et al., 2010). Due to the way the automatic assessments were generated in the NTCIR-9 Crosslink task, it was not possible to do this in these experiments. For example, manually assessing targets extracted through the triangulation methods outlined in section 5.9.1 would require assessment without anchors – something of questionable utility since anchor selection is part and parcel of CLLD.


143

7.2.6 LINKS FOUND IN MANUAL ASSESSMENT

A summary of the assessments is presented in Table 7-3. The first column lists the assessment type, the second column lists the number of unique links, and the third lists the overlap between the manual and automatic sets. For example, there were 1,681 links found through automatic triangulation of the 25 topics in the English to Korean Wikipedia, but 2,786 relevant links in the pool; the same pattern can be seen for English to Chinese. In the English to Japanese assessments the number of relevant link in the manual set is fewer than were automatically identified – this is most likely because the average number of links per topic was smaller because of the smaller pool size and the smaller number of submitted runs than is seen the other tasks.

Table 7-3: Links in the Result Sets of Two Different Assessments Assessment set

Relevant links

En-2-Zh automatic

2116

Overlapping

1134 En-2-Zh manual

4309

En-2-Ja automatic

2939

En-2-Ja manual

1118

En-2-Ko automatic

1681

781

821 En-2-Ko manual

2786

7.2.7 THE VALIDITY OF AUTOMATIC ASSESSMENT

As there were two types of evaluation methods used in the NTCIR-9 Crosslink task, questions may arise as to which type of assessment will deliver more reliable evaluation results. From Table 7-3 above, it can be seen that many links from the Wikipedia ground-truth and the manual assessment overlap and the qrels share a number of the same links, which means that both kinds of assessment have a certain correlation as to what target documents should be linked. 144


Figure 7-2: Interpolated precision-recall graph showing F2F evaluation of Wikipedia ground-truth runs on their own manual assessments

So the remaining unanswered question will be how would the human assessors think of the links from the Wikipedia ground-truth?

Figure 7-2 shows interpolated

precision-recall curves (corresponding to English-to-Chinese, English-to-Korean, and English-to-Japanese subtasks from top to bottom) of the manual assessments of the Wikipedia ground-truth when assessed at the file-to-file level. The graph indicates that in this case, the Wikipedia ground-truth itself was used to create runs for evaluation. As one set of Wikipedia ground truth links from the counterparts of topics in the other language contain no associated anchors, so the assessment was done without considering the relevance of anchors (using the title as the anchor for all links). The resulting qrel was then used to compute the score against the Wikipedia ground-truth runs and generate the interpolated precision-recall curves. The assessments for each language subtask were done separately by three assessors from different language background. It is interesting to see that the huge gaps between those curves indicate: the Chinese assessor thinks most Wikipedia ground-truth links are relevant to the topics; however, both the Japanese and Korean assessors considered a significant number of links irrelevant. The links marked as irrelevant could be resulted from the fact that they were discriminated due to lack of Chapter 7: The Effectiveness of Cross-lingual Link Discovery

145

context without being provided with anchors during the assessment. In any case, the above results demonstrate two things: that relying simply on existing links in Wikipedia for information browsing cannot be satisfactory; and a good CLLD algorithm that can recommend meaningful anchors and relevant links dynamically will certainly be helpful for knowledge discovery in Wikipedia. LINK EVALUATION

7.3

7.3.1 EVALUATION TYPES AND MEASURES

The English-to-Chinese subtask saw the most runs, the largest number of links per topic, and the largest number of relevant links; consequently the effectiveness of cross-lingual link discovery is discussed in that context herein. For a full account of the experiments the reader is referred to the proceedings of NTCIR-930. File-to-file (F2F) evaluation was performed using both the automatic and manual assessment sets, but for reasons already given, anchor-to-file (A2F) evaluation could only be performed using the manual assessments. LMAP was the preferred metric as it is analogous to the traditional MAP which is well understood by the IR community. 7.3.2 EVALUATION RESULTS

Precision-recall curves for the English-to-Chinese runs are shown in Figure 7-3 and Figure 7-4. The scores of the top runs are shown in Table 7-4. From these figures and this table it can be seen that the top performing run under automatic fileto-file assessment was the HITS run (Fahrni, Nastase, & Strube, 2011), however HITS was out-performed by runs from UKP (Kim & Gurevych, 2011) and QUT when manual assessment was used. This does not necessarily mean that one run is better than another; it means that under different evaluation paradigms the runs exhibit different orderings. The HITS run is the best at identifying links similar to those already present in the Wikipedia, but the UKP run is better at identifying links with topical relevance to the source article.

30

http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/NTCIR/01-NTCIR9-OV-CROSSLINKTangL.pdf

146


Figure 7-3: Interpolated Precision-Recall Graph Showing En-2-Zh F2F Evaluation against the Automatic Assessments

Figure 7-4: Interpolated Precision-Recall Graph Showing En-2-Zh A2F Evaluation against the Manual Assessments Chapter 7: The Effectiveness of Cross-lingual Link Discovery

147

Table 7-4: LMAPs of Teams in Two Evaluations Automatic F2F evaluation

Manual A2F evaluation

Run ID

LMAP

Run ID

LMAP

HITS

0.373

UKP

0.157

UKP

0.314

QUT

0.115

KMI

0.260

HITS

0.102

IASL

0.225

KMI

0.097

QUT

0.179

IASL

0.037

WUST

0.108

WUST

0.012

ISTIC

0.032

ISTIC

0.000

However, the topical relevance of both the anchors and the targets are considered in anchor-to-file evaluation and so it is reasonable to conclude that it is a more rigorous evaluation method than the file-to-file evaluation with Wikipedia ground-truth (and hence LMAP scores are lower). But, automatic evaluation does not require assessors and could be conducted on a large number of topics (source documents) making it more comprehensive. 7.3.3 COMPARISON OF CLLD ALGORITHMS

Good CLLD algorithms were seen in the NTCIR-9 Crosslink task. There are similarities in some approaches, but they all adopt unique features in their own implementations. As seen in Table 7-4, four CLLD approaches from the top three teams in two kinds of evaluations are chosen for discussion. The summarisations of these methods are given as follows: 

Team HITS (Fahrni, et al., 2011) implement their CLLD system with a strong disambiguation process. Candidate anchors are recognised through matching all possible n-grams in the topic document with the phrases in a pre-constructed lexicon. A threshold is then set using prior probability to filter out those with less keyphraseness. Next, a graph-based disambiguation

148


process is performed to remove those anchor candidates that might cause ambiguity. Finally, remaining anchors are sorted based on the relevance decided by a trained binary classifier. Their approaches achieve very good results in the overall competition with other teams. A unique feature of the system is that the multilingual index is expanded with other unofficial resources (Wikipedia dumps from different dates, different language corpora, etc). 

Team UKP (Kim & Gurevych, 2011) follow a common procedure to cross link documents in four steps: 1) anchor selection; 2) anchor ranking; 3) anchor translation; 4) target discovery. They find that a strategy combining Word N-gram for anchor selection, anchor probability for anchor ranking and Wikipedia page triangulation for translation can deliver most effective crosslingual link recommendation in Wikipedia. Their approach is the top performer in the evaluations using manual assessment results particularly in the English-to-Chinese and English-to-Korean subtasks.



Team QUT (Tang et al., 2011) use a link mining approach that mines the existing link structure for anchor probabilities and relied on the cross-lingual article title triangulation for anchor “translation”. Their best implementation has a decent performance when measured using Precision-at-5 metric, and also they contribute the largest number of unique relevant links that users might think they deserve further reading in the English-to-Chinese subtask (Tang, Geva, Trotman, Xu, & Itakura, 2011).



Team KMI (Knoth, Zilka, & Zdrahal, 2011) magnetise those articles having similar concepts to the topics as target documents through a document attracting strategy which can utilise methods including Explicit Semantic Analysis (ESA), Terminology, or both.

The discovered links will be

discarded if there are no anchors that can be mapped to the English Wikipedia articles through triangulation. Then, links are ranked by a Support Vector Machine (SVM). Their approach is proven to be able to discover unique links.


149

Table 7-5: Comparison of different implementations of four CLLD systems Team

Anchor Source

Anchor Ranking

Disambiguation

Wikipedia Titles

Prior probability

Maximum weighted algorithm

Noun phrases Named entities Wikipedia Anchors Titles N-Grams

BM25 Anchor probability Anchor strength

N/A

Triangulation Bilingual dictionary Machine translation

Various of anchor selection methods, and various of anchor text translation methods were tested

Wikipedia Links

Link Probability Page Name Matching CLIR

N/A

Triangulation Machine translation

Mainly relying on link probability for anchor selection and ranking

Wikipedia Titles

Link Ranking

N/A

Targets with similar concepts in a different language were grouped through ESA or Terminology

Finding relevant cross-lingual target documents for topic first; links were ranking with a SVM classifier

HITS

UKP

QUT

Translation edge Triangulation clique

KMI


150

Key Features Very large multilingual Concept Repository with rich index; anchors were disambiguated

A direct comparison of the above CLLD approaches is given in Table 7-5. The algorithms used by team HITS, UKP and QUT discover links in a direction of recognising candidate anchors in source articles first, then find the associated links for them; but differently the KMI’s implementation (Knoth, et al., 2011) initialises link recommendation by locating the relevant target documents first, then finds the appropriate anchors for the already discovered links. Although many different names are given to the method of weighting an anchor in these approaches, they actually share a similar meaning. For example, prior probability, link probability and anchor probability described in different approaches all refer to the similar computation of weighting a phrase for the probability of being a good anchor in the source document. With the prior knowledge of the link structure in Wikipedia, the probability equation is defined as:

(7.1)

where the numerator is the link frequency ( lf) of anchor a pointing to document d; and the denominator is the document frequency (df) of anchor a in the corpus. This anchor weighting method has been widely adopted in mono-lingual link insertion in Wikipedia—namely wikification or wikifying (Itakura & Clarke, 2008; Mihalcea & Csomai, 2007; Milne & Witten, 2008) 7.3.4 UNIQUE RELEVANT LINKS

The performance in terms of precision has been discussed in the previous sections. In this section the breadth of the algorithms is discussed. That is, an algorithm that correctly identifies links similar to those already in the collection is of interest, but so too are algorithms that identify novel (and relevant) links not present (even at the expense of some prevision). This is the traditional precision / recall trade off that is well known in IR, but of particular interest in CLLD because it allows the hyperlink graph to grow in new unconstrained ways and (perhaps) in ways more useful to a user than simply taking then to the same familiar places. Table 7-6 presents the statistics of unique relevant links discovered in the NTCIR-9 Crosslink runs. The first column lists the assessment set (automatic or Chapter 7: The Effectiveness of Cross-lingual Link Discovery

153

manual), the second column lists the total number of unique links discovered across all runs, and the third column lists the most diverse group and the number of unique and relevant links they identified. For example, UKP found 97 of the 245 unique relevant links found in all the runs, which in turn amounts to 11.6% of the relevant links in the automatically extracted assessment set. Of particular note is that QUT runs contribute 1103 unique relevant links to the manual assessment set, which is about 79% of total unique relevant links found. This suggests that the QUT runs are the most diverse, preferring their own links to those already present in Wikipedia. Without manual assessment, no knowledge of the relevance of these links would have been revealed.

Table 7-6: Unique Relevant English-to-Chinese Links Compared with Total (%)

Team with highest (#) of unique relevant links

Automatic

245 (11.6%)

UKP (97)

Manual

1397 (32.4%)

QUT (1103)

Figure 7-5: A Prospective Anchor Found for the Topic "Croissant"

156


7.4

DISCUSSION: CLLD IN ACTION

In section 7.3.2 and 7.3.4, the performance of the runs is quantified in various ways, but how good are these systems? Can users satisfy their information needs? Figure 7-5 shows a snippet of the English Wikipedia article "Croissant". The boxed text reads “crema pasticcera” which is linked neither to the English “Custard” article nor to the Italian “Crema pasticcera” article. When a multi-lingual user encounters such a term they may ask: 

What is “crema pasticcera”?



Is it in my own language or languages I read?

These questions cannot easily be answered without further manual searching of Wikipedia or translation using a translation service. Recall from Figure 1-3 (Section 1.1) that finding information for “crema pasticcera” in Wikipedia is very difficult because: 

Users may not know this is Italian, even if they did,



The Italian article “Crema pasticcera” is not linked to the English article “Custard” (or the Chinese article “奶黄”).

Among all the submitted runs KMI’s runs (Knoth, et al., 2011) uniquely recommended “crema pasticcera” and correctly linked it to the Chinese article “奶黄 ”. KMI have developed an algorithm that solved the problem of the running example (Figure 1-3: Lost in Translation). In doing so they have also demonstrated the power (and importance) of CLLD in correctly expanding the Wikipedia to include cross lingual links. However, it should also be noted that without manual assessment this link would not have been assessed as relevant – and so their run additionally demonstrates the importance of manual assessment. In fact, the LMAP scores of the official evaluation results (Tang, Geva, et al., 2011) show that their run KMI_SVM_ESA_TERMDB was not very effective when measured using automatic file-to-file assessments (scoring 7th) but was effective (pacing 3rd) when measured using manual file-to-file assessments.


155

7.5

SUMMARY

It is important to have manual assessment: 1) Wikipedia ground-truth only has one link per anchor; 2) there are also many relevant links that can be discovered by participants using different methods. This chapter presented the assessment challenges, and the difficulties of assessor recruitment, along with the learned experiences and evaluation results from running the NTCIR-9 crosslink track. Particularly, effective system evaluation was focused on, and automatic and manual evaluations were explored. For automatic evaluation the ground-truth qrels were extracted from the Wikipedia through triangulation. Manual assessments were performed by multi-lingual assessors with a high level of education. Evaluation was both file-to-file and anchor-to-file. It is suggested that manual assessment could result in a more thorough evaluation but automatic assessment would suit in a broader evaluation. Evaluation of the runs shows that some of the algorithms used at NTCIR-9 were effective, finding links already in Wikipedia as well as previously unseen links, however no single algorithms was best at both.

Some algorithms produced a

disproportionally large number of unique relevant links suggesting that those teams responsible focused on diversification in their result sets.

156


Chapter 8: Conclusions This chapter concludes the research on Chinese / English cross-lingual link discovery. Section 8.1 draws overall conclusions based on conducted experiments and research outcomes. A brief description of academic contributions of this research is given in Section 8.2. Section 8.3 discusses the limitations of the finished work, and proposes future work which includes applications that can be developed from the findings of this study.

Chapter 8: Conclusions

155

8.1

CONCLUSIONS

Encyclopaedia of Web 2.0 such as Wikipedia, which covers over ten millions articles in almost every written language, gives the public open access to the free information and knowledge in multiple language domains. Cross-lingual link discovery is a new way of automated link recommendation between documents in different languages. If multiple target links for each anchor in articles can be provided, they can assist information seekers and knowledge contributors with easy access to cross-lingual materials in a knowledge base, especially in Wikipedia. With a particular emphasis on Chinese / English bi-directional link suggestions, this thesis studied the approaches to realising cross-lingual link discovery and the possible problems that might occur when linking Chinese to English documents. The experimentation for testing the effectiveness and the feasibility of the proposed CLLD methods was conducted using Wikipedia itself as the experimental document collection. In order to promote research on cross-lingual link discovery and attract international collaborations in this research filed, a new and the first cross-lingual link discovery evaluation track known as Cross-lingual Link Discovery (Crosslink) task was proposed, and formally accepted at NTCIR-9. To quantify the performances of various realisation methods and CLLD systems, an evaluation framework was developed, and subsequently utilised at NTCIR-9 Crosslink task. The major findings of this research are given as follows: During the study of the needs to handle the missing word boundaries issue in anchor identification when linking Chinese documents to English ones, a novel hybrid boundary-oriented segmentation method—NGMI was demonstrated. It can be used as either an unsupervised or a supervised segmentation method. The evaluation at the Chinese word segmentation task of CIPS-SIGHAN 2010 shows the supervised version of NGMI can achieve promising results in cross-domain text segmentation. However, in the later Chinese-to-English link discovery experiments, it is suggested that Chinese segmentation is not absolutely needed for Chinese-to-English link discovery. Even though, a proper Chinese segmentation method is still required in English-to-Chinese link discovery when the cross-lingual information retrieval method is used to suggest links, because segmentation is critical to Chinese text 156


indexing and a proper segmentation strategy can ensure high document retrieval precision. As cross-lingual link discovery allows multiple target links for each anchor, it perfectly matches the nature of cross-lingual information retrieval which can return a list of relevant cross-lingual documents for a given query which in this case is an identified anchor. So CLIR can accordingly be adapted for cross-lingual link discovery, and the accuracy of anchor translation will critical to the performance of CLIR and the quality of the recommended links. General machine translation can already achieve very good translation results, but VMNET can further improve the translation accuracy. The results from the CLIR experiments indicate that VMNET is capable of providing high quality translated query terms. A CLIR system which can be embedded into a CLLD system as a sub-component could achieve good results for link discovery by using the VMNET for translation and simple technique (bigrams and unigrams for Chinese text) for text indexing. Subsequently, a few methods that cover the study of the effects of natural language processing and cross-lingual information retrieval on cross-lingual link discovery were experimented with. The approaches to CLLD that were tested include: link mining, page name matching, and cross-lingual information retrieval. The link mining method with Wikipedia cross-lingual document name triangulation performs the best among all implementations. The CLLD system implemented with the link mining method achieved encouraging results in both in-house evaluation and the official evaluation of NTCIR-9 Crosslink task. The cross-lingual information retrieval method used in CLLD performed the worse in most evaluations, but among all the submissions to the NTCIR-9 Crosslink task this method contributes the most unique links for the provided topics in the evaluation using manual assessment results. Although Chinese segmentation and machine translation are two essential steps in Chinese to other-language information retrieval, the experiment results suggest that they are not absolutely necessary for cross-lingual link discovery. This is because segmentation is implicit in anchor mining and translation is implicit in cross-language triangulation. The evaluation of cross-lingual link discovery uses a gold standard dataset from Wikipedia ground-truth and manual assessment results at either file-to-file or anchor-to-file level. To have a complete judgement of the effectiveness of various Chapter 8: Conclusions

157

CLLD methods and the implemented systems, it is essential to have manual assessment: 1) One set of Wikipedia ground-truth has only one link per anchor; 2) the other set of Wikipedia ground-truth contains links only; and 3) there are also many relevant links other than ground-truth links discovered by participants. So it can be inferred that manual assessment could result in a more thorough evaluation of various CLLD approaches and systems but automatic assessment is suitable in a broader evaluation. 8.2

CONTRIBUTIONS

This thesis contributes to the research on cross-lingual information access with a new cross-lingual link discovery task. This thesis brings cross-lingual link discovery into information management of a knowledge base. With an emphasis on Chinese / English cross-lingual document linking, a CLLD system implemented from the proposed realisation methods can help break language barriers in knowledge sharing particularly for easy cross-lingual information access between Chinese and English Wikipedia. Specifically, this thesis: 

provides an alternative segmentation method which is simple and a generic approach that can be used for n-gram Chinese text segmentation.



provides a improved machine translation method for name entity translation with high precision by using a voting mechanism on multiple translation sources.



develops a standard evaluation dataset, submission specifications, evaluation metrics, assessment and evaluation toolkits which forms a crosslingual evaluation framework



applies the link probability and page name link discovery methods in multilingual environment, and experimented with the proposed CLLD methods at the NTCIR-9 Crosslink task.

The major contribution of this thesis is the provision of a standard evaluation framework for the cross-lingual link discovery research. It is important in CLLD research to have such a framework which helps benchmarking performance of various CLLD systems and indentifying good CLLD realisation approaches. The 158


evaluation methods and the evaluation framework described in this thesis have been utilised to quantify the system performance at the NTCIR-9 Crosslink task which is the first information retrieval track of this kind. 8.3

LIMITATIONS AND FUTURE WORK

As this is the first study on cross-lingual link discovery, limitations are inevitable in the depth of this CLLD research. First, although the Chinese segmentation method—NGMI can effectively segment n-grams words in Wikipedia, the algorithm that uses NGMI for text segmentation is not efficient enough because it requires computing the confidence score for every possible boundary. This problem could be resolved by reducing the boundary confidence computing through emphasis on computation for the boundaries with ambiguity only. Secondly, the cross-lingual information retrieval method used in CLLD only relied on a traditional machine translation service without using VMNET. As VMNET involves a translation source from Wikipedia itself and can deliver accurate translation results, the performance of CLIR method for CLLD could be potentially improved by adapting VMNET in CLLD realisation. Thirdly, as there is no a word sense disambiguation (WSD) process involved in the proposed CLLD methods, the accuracy of both anchor identification and translation may not be satisfactory. The performance comparison with other participating teams in the evaluation of the NTCIR-9 Crosslink task suggests this. Moreover, there is lack of semantic analysis of content when cross-linking documents of different languages. The cross-lingual linking algorithms including the link mining method which performs the best among all implementations didn’t carry out semantics analysis on the content of articles, so the performance could be less effective. For example, in Chinese articles non-Chinese people’s names are often referred to or mentioned by surname only. There certainly is room for performance improvement if the senses of identified anchors can be disambiguated with appropriate WSD methods and semantics analysis is to be involved in determining the similarity between source document and target document. Fourthly, in this research focused information retrieval for more efficient crosslingual information access is yet to be studied. Focused information retrieval can Chapter 8: Conclusions

159

help identify the appropriate anchor text in web document and link it straight to the cross-language content with optimal access. Optimal access in a document means giving users the best entry point (BEP) for starting reading. A BEP is a place or position in the document which is the perfect beginning for getting the right information (Reid, Lalmas, Finesilver, & Hertzum, 2006b). In this thesis, either the implantation of cross-lingual document linking or the evaluation of CLLD approaches is limited in file-to-file or anchor-to-file levels, which means it is assumed that the beginning of articles is the default BEP. In future, BEP recommendation for cross-lingual link discovery in Chinese / English Wikipedia should be realised for more efficient information access as recent research undertaken by Kamps et al. (2007) showing that the majority of best entry points (BEPs) chosen by the assessors in a structured document retrieval experiment are not the beginning of the article, which means a focused entry point is preferred rather than just the beginning of the document. Furthermore, due to resources and time constraints, manual assessment on Chinese-to-English link discovery experiments can not be completed in this study. It is hoped that manual assessment on the Chinese-to-English runs can be finished, or a Chinese-to-English link discovery task is held for providing a manual assessment on the pooled runs from all participating teams. So crosslink tasks that link CJK documents to English ones are planned for the evaluation round at NTCIR-10 in 2012. Lastly, the evaluation of the runs submitted to the NTCIR-9 Crosslink task shows that some of the algorithms used were effective, finding links already in Wikipedia as well as previously unseen links, however no single algorithms was best at both. Some algorithms produced a disproportionately large number of unique relevant links suggesting that those teams responsible focused on diversification in their result sets. Further work is needed by these groups including the QUT team to produce algorithms capable of reliably identifying a large numbers of diverse and relevant cross-language links – and it is also expected that such work will be seen at NTCIR and other evaluation forums in the near future.

160


Bibliography Adafre, S. F., & De Rijke, M. (2005). Discovering missing links in Wikipedia. 9097.

Baeza-Yates, R. A., & Ribeiro-Neto, B. A. (1999). Modern Information Retrieval: ACM Press / Addison-Wesley.

Chan, Y.-C., Chen, K.-H., & Lu, W.-H. (2007). Extracting and Ranking QuestionFocused Terms Using the Titles of Wikipedia Articles. In NTCIR-6 (pp. 210215): NTCIR.

Chang, P.-C., Galley, M., & Manning, C. D. (2008). Optimizing Chinese word segmentation for machine translation performance. Paper presented at Proceedings of the Third Workshop on Statistical Machine Translation, Columbus, Ohio.

Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. (1997, 1997). Chinese text retrieval without using a dictionary. In SIGIR '97: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42-49): ACM.

Chen, A., Jiang, H., & Gey, F. (2000). Combining multiple sources for short query translation in Chinese-English cross-language information retrieval. 17-23.

Chen, H.-H., Hueng, S.-J., Ding, Y.-W., & Tsai, S.-C. (1998). Proper name translation in cross-language information retrieval. 232-236.

Dai, Y., Loh, T. E., & Khoo, C. S. G. (1999). A new statistical formula for Chinese text segmentation incorporating contextual information. 82-89.

Denoyer, L., & Gallinari, P. (2006). The Wikipedia XML Corpus. Retrieved from

Dong, L., & Watters, C. (2004). Improving Efficiency and Relevance Ranking in Information Retrieval. In WI '04: Proceedings of the 2004 IEEE/WIC/ACM

Bibliography

161

International Conference on Web Intelligence (pp. 648-651): IEEE Computer Society.

Ellen, M. V. (2007). TREC: Continuing information retrieval's tradition of experimentation. Commun. ACM, 50(11), 51-54. doi:10.1145/1297797.1297822

Fachry, K., Kamps, J., Koolen, M., & Zhang, J. (2008). Using and Detecting Links in Wikipedia. In (pp. 388-403).

Fahrni, A., Nastase, V., & Strube, M. (2011). HITS’ Graph-based System at the NTCIR-9 Cross-lingual Link Discovery Task. In Proceedings of NTCIR-9.

Feng, H., Chen, K., Kit, C., & Deng, X. (Singer-songwriters). (2005). Unsupervised Segmentation of Chinese Corpus Using Accessor. On: Springer Berlin / Heidelberg.

Ferrández, S., Toral, A., Ferrández, Ó., Ferrández, A., & Muñoz, R. (2007). Applying Wikipedia’s Multilingual Knowledge to Cross–Lingual Question Answering. In Natural Language Processing and Information Systems (pp. 352-363).

Getoor, L. (2003). Link mining: a new data mining challenge. SIGKDD Explor. Newsl., 5(1), 84-89.

Getoor, L., & Diehl, C. P. (2005a). Introduction to the special issue on link mining. SIGKDD Explor. Newsl., 7(2), 1-2.

Getoor, L., & Diehl, C. P. (2005b). Link mining: a survey. SIGKDD Explor. Newsl., 7(2), 3-12.

Geva, S. (2008). GPX: Ad-Hoc Queries and Automated Link Discovery in the Wikipedia. In Focused Access to XML Documents (pp. 404-416).

Huang, D., Xu, Y., Trotman, A., & Geva, S. (2008). Overview of INEX 2007 Link the Wiki Track. In Focused Access to XML Documents (pp. 373-387).

162

Bibliography

Huang, W., Geva, S., & Trotman, A. (2009). Overview of the INEX 2008 Link the Wiki Track. In S. Geva, J. Kamps & A. Trotman (Eds.), Advances in Focused Retrieval (Vol. 5631, pp. 314-325): Springer Berlin / Heidelberg.

Huang, W., Geva, S., & Trotman, A. (2010). Overview of the INEX 2009 Link the Wiki Track. In S. Geva, J. Kamps & A. Trotman (Eds.), Focused Retrieval and Evaluation (Vol. 6203, pp. 312-323): Springer Berlin / Heidelberg.

Huang, W. C., Trotman, A., & Geva, S. (2007). Collaborative knowledge management: Evaluation of automated link discovery in the Wikipedia. In SIGIR 2007 Workshop on Focused Retrieval.

Huang, W. C., Trotman, A., & Geva, S. (2009). The importance of manual assessment in link discovery. Paper presented at Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, Boston, MA, USA.

INEX.

(2009). INEX 2009 Link-The-Wiki Track. Retrieved http://www.inex.otago.ac.nz/tracks/wiki-link/wiki-link.asp

from

INEX. (2010). INEX 2010 Link-The-Wiki Task and Result Submission Specification Retrieved from http://www.inex.otago.ac.nz/tracks/wikilink/runsubmission.asp?action=specification

Institute of Computing Technology, C. A. o. S. (Singer-songwriter). ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System). On.

Itakura, K., & Clarke, C. (2008). University of Waterloo at INEX2007: Adhoc and Link-the-Wiki Tracks. In Focused Access to XML Documents (pp. 417-425).

Jenkinson, D., & Trotman, A. (2008). Wikipedia Ad Hoc Passage Retrieval and Wikipedia Document Linking. In (pp. 426-439).

Junlin, Z., Le, S., & Jinming, M. (2005). Using the Web corpus to translate the queries in cross-lingual information retrieval. In Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE '05. Proceedings of 2005 IEEE International Conference on (pp. 493-498). Bibliography

163

Kamps, J., Koolen, M., & Lalmas, M. (2007). Where to start reading a textual XML document? In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 723-724): ACM.

Kamps, J., Lalmas, M., & Pehcevski, J. (2007). Evaluating relevant in context: document retrieval with a twist. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 749-750): ACM.

Kim, J., & Gurevych, I. (2011). UKP at CrossLink: Anchor Text Translation for Cross-lingual Link Discovery. In Proceedings of NTCIR-9.

Kit, C., Pan, H., & Chen, H. (2002). Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study. 1-7.

Knoth, P., Zilka, L., & Zdrahal, Z. (2011). KMI, The Open University at NTCIR-9 CrossLink. In Proceedings of NTCIR-9.

Kwok, K. L., Dinstl, N., & Deng, P. (2001). English-Chinese CLIR using a simplified PIRCS system. In HLT '01: Proceedings of the first international conference on Human language technology research (pp. 1-4): Association for Computational Linguistics.

Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the18th International Conference on Machine Learning (pp. 282-289).

Lalmas, M., & Reid, J. (2003, 2003). Automatic identification of best entry points for focused structured document retrieval. In CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management (pp. 540-543): ACM.

Lin, Y.-C., & Hung, P.-H. (2002). Probabilistic named entity verification. In COLING-02 on COMPUTERM 2002 (pp. 1-7): Association for Computational Linguistics.

164

Bibliography

Lu, C., Xu, Y., & Geva, S. (2007). Translation disambiguation in web-based translation extraction for English-Chinese CLIR. 819-823.

Lua, K. T. (1990). From character to word - An application of information theory. Computer Processing of Chinese & Oriental Languages, 4(4), 304-312.

Lua, K. T., & Gan, G. W. (1994). An application of information theory in Chinese word segmentation. Computer Processing of Chinese & Oriental Languages, 8(1), 115-124.

Ma, W.-Y., & Chen, K.-J. (2003). A bottom-up merging algorithm for Chinese unknown word extraction. In Proceedings of the second SIGHAN workshop on Chinese language processing (pp. 31--38): Association for Computational Linguistics.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval: Cambridge University Press.

MDBG. CC-CEDICT download. Retrieved http://www.mdbg.net/chindict/chindict.php?page=cc-cedict

from

Melo, G. d., & Weikum, G. (2010). Untangling the cross-lingual link structure of Wikipedia. Paper presented at Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden.

Mihalcea, R., & Csomai, A. (2007). Wikify!: linking documents to encyclopedic knowledge. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 233-242): ACM.

Milne, D., & Witten, I. H. (2008). Learning to link with wikipedia. Paper presented at Proceeding of the 17th ACM conference on Information and knowledge management, Napa Valley, California, USA.

NTCIR Project. (2010). Tools. http://research.nii.ac.jp/ntcir/tools/tools-en.html

Bibliography

Retrieved

from

165

Oard, D. (2002). When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. Cross-Language Information Retrieval: A Research Roadmap. Workshop at SIGIR-2002. In In Proceedings of "Cross-Language Information Retrieval: A Research Roadmap", a SIGIR-2002 workshop: ACM.

Peng, F., Feng, F., & McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields.

Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. Paper presented at Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia.

Reid, J., Lalmas, M., Finesilver, K., & Hertzum, M. (2006a). Best entry points for structured document retrieval--Part I: Characteristics. Information Processing & Management, 42(1), 74-88. Retrieved from http://www.sciencedirect.com/science/article/B6VC8-4G0YTNB1/1/09f3e2d3aa0a2238f9a6afc9d329e9f6.

Reid, J., Lalmas, M., Finesilver, K., & Hertzum, M. (2006b). Best entry points for structured document retrieval--Part I: Characteristics, Information Processing & Management. Formal Methods for Information Retrieval, 42(1), 74-88. Retrieved from http://www.sciencedirect.com/science/article/B6VC84G0YTNB-1/1/09f3e2d3aa0a2238f9a6afc9d329e9f6.

Reid, J., Lalmas, M., Finesilver, K., & Hertzum, M. (2006c). Best entry points for structured document retrieval--Part II: Types, usage and effectiveness. Information Processing & Management, 42(1), 89-105. Retrieved from http://www.sciencedirect.com/science/article/B6VC8-4FXV7701/1/b4ce74be8f30e2cc8ecac15c89d55a4a.

Robert, W. P. L., & Kwok, K. L. (2002). A comparison of Chinese document indexing strategies and retrieval models. ACM Transactions on Asian Language Information Processing (TALIP), 1(3), 225-268. doi:http://doi.acm.org/10.1145/772755.772758

Robertson, S. E., & Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129-146.

166

Bibliography

Robertson, S. E., & Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. Paper presented at Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland.

Sakai, T., Shima, H., Kando, N., Song, R., Lin, C.-J., Mitamura, T., & Sugimoto, M. (2010). Overview of NTCIR-8 ACLIA IR4QA. In Proceedings of NTCIR-8 (pp. 63-93).

Schenkel, R., Suchanek, F., & Kasneci, G. (2007). YAWN: A Semantically Annotated Wikipedia XML Corpus. In 12. GI- Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW 2007).

Second International Chinese Word Segmentation Bakeoff - Result Summary. (2005). Retrieved from http://www.sighan.org/bakeoff2005/

Shi, L., Nie, J.-Y., & Cao, G. (2008). RALI Experiments in IR4QA at NTCIR-7. In NTCIR-7 (pp. 115-124).

SIGHAN. (2005). Second International Chinese Word Segmentation Bakeoff Result Summary. Retrieved from http://www.sighan.org/bakeoff2005/data/results.php.htm

Sorg, P., & Cimiano, P. (2008). Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach. In AAAI 2008 Workshop on Wikipedia and Artifical Intelligence.

Sproat, R., & Shih, C. (1990). A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese & Oriental Languages, 4(4), 336-351.

Su, C.-Y., Lin, T.-C., & Wu, S.-H. (2007). Using Wikipedia to Translate OOV Terms on MLIR. In NTCIR-6 (pp. 109-115): NTCIR Project.

Sun, M., Shen, D., & Tsou, B. K. (1998, 1998). Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 17th international conference on Computational linguistics (pp. 1265-1271): Association for Computational Linguistics. Bibliography

167

Tang, L.-X., Cavanagh, D., Trotman, A., Geva, S., Xu, Y., & Sitbon, L. (2011). Automated Cross-lingual Link Discovery in Wikipedia. In Proceedings of NTCIR-9 (pp. 512-519).

Tang, L.-X., Geva, S., Trotman, A., Xu, Y., & Itakura, K. Y. (2011). Overview of the NTCIR-9 Crosslink Task: Cross-lingual Link Discovery. In Proceedings of NTCIR-9 (pp. 437-463).

Tatsunori Mori, K. T. (2007). A method of Cross-Lingual Question-Answering Based on Machine Translation and Noun Phrase Translation using Web documents. In NTCIR-6 (pp. 182-189): NTCIR-6.

Teahan, W. J., McNab, R., Wen, Y., & Witten, I. H. (2000). A compression-based algorithm for Chinese word segmentation. Comput. Linguist., 26(3), 375-393.

The Stanford Natural Language Processing Group. (2010). Stanford Log-linear PartOf-Speech Tagger. Retrieved from http://nlp.stanford.edu/software/tagger.shtml

Trotman, A., Alexander, D., & Geva, S. (2011). Overview of the INEX 2010 Link the Wiki Track In S. Geva, J. Kamps, R. Schenkel & A. Trotman (Eds.), Comparative Evaluation of Focused Retrieval (Vol. 6932, pp. 241-249): Springer Berlin / Heidelberg.

Trotman, A., Geva, S., & Kamps, J. (2007). Report on the SIGIR 2007 workshop on focused retrieval. SIGIR Forum, 41(2), 97-103.

Trotman, A., & Lalmas, M. (2006). Why structural hints in queries do not help XML-retrieval. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 711-712): ACM.

Tung, C. H., & Lee, H. J. (1994). Identification of unknown words from a corpus. Computer Processing of Chinese & Oriental Languages, 8(1), 115-124.

University of Pennsylvania. (2010). POS tags. http://bioie.ldc.upenn.edu/wiki/index.php/POS_tags 168

Retrieved

from

Bibliography

Vercoustre, A.-M., Thom, J. A., & Pehcevski, J. (2008). Entity ranking in Wikipedia. In SAC '08: Proceedings of the 2008 ACM symposium on Applied computing (pp. 1101-1106): ACM.

Wang, L.-J., Li, W.-C., & Chang, C.-H. (1992, 1992). Recognizing unregistered names for Mandarin word identification. In Proceedings of the 14th conference on Computational linguistics (pp. 1239-1243): Association for Computational Linguistics.

Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., . . . Steinberg, D. (2007). Top 10 algorithms in data mining. Knowl. Inf. Syst., 14(1), 1-37.

Yang, K.-C., Ho, T.-H., Chienq, L.-F., & Lee, L.-S. (1998, 1998/05). Statistics-based segment pattern lexicon-a new direction for Chinese language modeling. In (Vol. 1, pp. 169-172 vol.161).

Yang, L., Ji, D., & Tang, L. (2004, 2004). Document re-ranking based on automatically acquired key terms in Chinese information retrieval. In COLING '04: Proceedings of the 20th international conference on Computational Linguistics: Association for Computational Linguistics.

Ye, S., Chua, T.-S., & Liu, J. (2002). An agent-based approach to Chinese named entity recognition. In Proceedings of the 19th international conference on Computational linguistics (pp. 1-7): Association for Computational Linguistics.

Yeung, C.-m. A., Duh, K., & Nagata, M. (2011). Assisting cross-lingual editing in collaborative writing. SIGWEB Newsl.(Spring), 1-5. doi:10.1145/1942800.1942804

Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to Ad Hoc information retrieval. Paper presented at Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, United States.

Zhang, K., Liu, Q., Zhang, H., & Cheng, X.-Q. (2002, 2002). Automatic recognition of Chinese unknown words based on roles tagging. In Proceedings of the first

Bibliography

169

SIGHAN workshop on Chinese language processing (pp. 1-7): Association for Computational Linguistics.

Zhang, Y., & Vines, P. (2004). Using the web for automated translation extraction in cross-language information retrieval. Paper presented at Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, United Kingdom.

Zhang, Y., Vines, P., & Zobel, J. (2005). Chinese OOV translation and posttranslation query expansion in chinese--english cross-lingual information retrieval. ACM Transactions on Asian Language Information Processing (TALIP), 4(2), 57-77. doi:http://doi.acm.org/10.1145/1105696.1105697

Zobel, J., & Moffat, A. (2006). Inverted files for text search engines. ACM Comput. Surv., 38(2).

170

Bibliography