learning semantic role labeling via bootstrapping with ...

LEARNING SEMANTIC ROLE LABELING VIA BOOTSTRAPPING WITH UNLABELED DATA

RASOUL SAMAD ZADEH KALJAHI

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY UNIVERSITY OF MALAYA KUALA LUMPUR

DECEMBER 2010

LEARNING SEMANTIC ROLE LABELING VIA BOOTSTRAPPING WITH UNLABELED DATA

RASOUL SAMAD ZADEH KALJAHI

SUBMITTED TO THE FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, UNIVERSITY OF MALAYA, IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE DEGREE OF MASTER OF COMPUTER SCIENCE

DECEMBER 2010

ABSTRACT Semantic role labeling (SRL) has recently attracted a considerable body of research due to its utility in several natural language processing tasks. Current state-of-the-art semantic role labeling systems use supervised statistical learning methods, which strongly rely on hand-crafted corpora. Creating these corpora is tedious and costly with the resulting corpora not representative of the language due to the extreme diversity of natural language usage. This research investigates self-training and co-training as two semi-supervised algorithms, which aim at addressing this problem by bootstrapping a classifier from a smaller amount of annotated data via a larger amount of unannotated data. Due to the complexity of semantic role labeling and a high number of parameters involved in these algorithms, several problems are associated with this task. One major problem is the propagation of classification noise into successive bootstrapping iterations. The experiments shows that the selection balancing and preselection methods proposed here are useful in alleviating this problem for self-training (e.g. 0.8 points improvement in ‫ܨ‬ଵ for the best setting). In co-training, a main concern is the split of the problem into distinct feature views to derive classifiers based on those views to effectively co-train with each other. This work utilizes constituency-based and dependency-based views of semantic role labeling for co-training and verifies three variations of these algorithms with three different feature splits based on these views. Balancing the feature split to eliminate the performance gap between underlying classifiers proved to be important and effective. Also, co-training with a common training set for both classifiers performed better than with separate training sets for each of them, where the latter degraded the base classifier while the former could improve it by 0.9 ‫ܨ‬ଵ for the best setting. All the results show that much more unlabeled data is needed for these algorithms to be practically useful for SRL. i

ACKNOWLEDGEMENTS This dissertation would have not been possible without the great support of my kind supervisor Professor Mohd Sapiyan Baba. I would like to express my gratitude to him, who patiently taught me how to do research, tolerated my faults, and guided me into the right way; I can never forget him. And many thanks to Assoc. Professor Diljit Singh for his fruitful classes of Research Foundation, Dr. Rohana Mahmud who taught me AI and NLP in her memorable classes. I should also thank other scholars outside University of Malaya, who answered my emails: Mihai Surdeanu (Stanford University), Martha Palmer (University of Colorado), and Xavier Carreras (MIT) for their valuable advices, and Richard Johansson (University of Trento) and Joakim Nivre (Uppsala University) for their great support for LTH converter and MaltParser. Nothing will repay my parent’s patience and helps during these two years of my study abroad. But, I owe this degree to my younger brother Riza. He did more than what a brother does; I may never compensate. My wonderful uncles have always been a great support in all aspects; many thanks to all of them. I never forget the help and encouragement I received from my friends Mustafa, Amin, Rasul, and Nasir. My special thanks go to my friend Babak, who paved my way to master study here. And thanks to all whose names were missed, as it is not possible to include all in this page.

To all who share this dissertation with me, and wish to deserve all this kindness.

ii

TABLE OF CONTENTS ABSTRACT ACKNOWLEDGEMENT TABLE OF CONTENTS LIST OF FIGURES LIST OF TABLES LIST OF APPENDICES

i ii iii vi viii ix

CHAPTER 1 Introduction 1.1. Background …………………………………………………….………. 1.2. Problem Statement …………………………..…………………………. 1.3. Research Objectives …………..……………...………………………… 1.4. Research Method ………………..…………………………..…………. 1.5. Significance of the Research ………………………………..…………. 1.6. Structure of Thesis ……………………………...……………………….

1 1 5 7 8 9 10

CHAPTER 2 Semantic Role Labeling 2.1. Introduction …………………………………………………………..... 2.2. History ………………………………………….…………………........ 2.3. Corpora …………………………………………...……………………. 2.3.1. FrameNet …………………………………….....…………….…… 2.3.2. PropBank ……………………………………….…………….…… 2.3.3. Other Corpora ………………………………….…………….…… 2.4. Syntax in SRL ………………………………………………………….. 2.4.1. Constituency Syntax …………………………...…………………. 2.4.2. Dependency Syntax ………………….…………..……………….. 2.5. Architecture of SRL Systems …………………………..……………… 2.5.1. Predicate Identification ……………………………...……………. 2.5.2. Argument Identification …………………………….…………….. 2.5.3. Pruning …………………………………………………….……… 2.5.4. Argument Classification ……………………………………..…… 2.5.5. Combined Argument Identification and Classification …...…….... 2.5.6. Global Optimization ….…………………………………………… 2.5.7. System Integration ………………………………………………... 2.6. Learning SRL ………...……………………………………………...…. 2.6.1. Classification Algorithms ………………………….……………... 2.6.2. Learning Features ………………………………………………… 2.6.2.1. Syntactic Features …………………………………………… 2.6.2.2. Semantic Features …………………………………………… 2.7. Summary …..……………………………………………………………

11 11 11 15 15 17 19 21 21 23 26 26 27 29 30 30 31 32 34 34 35 37 38 39

CHAPTER 3 Learning Semantic Role Labeling Using Unlabeled Data 3.1. Introduction …………………………………………………….………. 3.2. Learning with Unlabeled Data ………….……………..……….………. 3.3. Bootstrapping ………………………………………………….………. 3.3.1. Self-training ………………………………………………….……

41 41 42 43 44 iii

3.3.2. Co-training …………………………………………………...…… 3.4. Using Unlabeled Data in SRL ………………….………..…….………. 3.4.1. Unsupervised Approaches …………………......…………….…… 3.4.2. Semi-supervised Approaches ………………….…….……….…… 3.4.3. Self-training and Co-training …………………..…………….…… 3.5. Conclusion …….……………………………………………….……….

46 48 48 50 51 54

CHAPTER 4 Self-training and Co-training for Semantic Role Labeling 4.1. Introduction …………………………………………...……….………. 4.2. Syntactic Formalism ………………………..………….………………. 4.3. SRL Corpora …………………………………………………………… 4.3.1. Labeled Corpus ……………………………………..…………….. 4.3.2. Unlabeled Corpora ………….……………..………..…………….. 4.3.2.1. In-domain Unlabeled Corpus ….….……………………..…… 4.3.2.2. Out-of-domain Unlabeled Corpus ….…………….…..……… 4.4. Architecture …………………………………………………...……….. 4.5. Learning ………….…………………………………………………….. 4.5.1. The Classifier …………...….…………………………….……….. 4.5.2. Features …………...….……………………………………..…….. 4.5.2.1. Constituent-based Features ………..………………..……..… 4.5.2.2. Dependency-based Features ………..…………...……....…… 4.5.2.3. General Features ………………..…..……………...………… 4.5.3. Self-training Algorithm …………...….…………………..……….. 4.5.3.1. Explicit Parameters ………………….…………..…...……… 4.5.3.2. Implicit Parameters …………………...………………...…… 4.5.4. Co-training Algorithm …………...….………………………...….. 4.5.4.1. The views and Feature Splits ………………………….…..… 4.5.4.2. The Algorithm …………………....……………....………..… 4.6. Evaluation ………..…………………………………………….............. 4.6.1. The Test Data ……………………………………………….…….. 4.6.2. Evaluation Metrics …………………………….…………….…….

56 56 57 58 58 61 62 62 64 67 67 69 70 71 72 74 74 76 78 78 79 81 82 83

CHAPTER 5 Experiments and Analysis of Results 5.1. Introduction ………………………………………………….………… 5.2. Characteristics of the Data ………………………..……………….…… 5.3. Characteristics of the Classifiers …………………………………..…... 5.4. The Effect of Supervision ……………………………………………… 5.5. Self-training ……………………………………………………..……... 5.5.1. The Effect of Selection Balancing ………………………………... 5.5.2. The Effect of Preselection ………….…………………………....... 5.5.2.1. Random Preselection ….….…………………..………….…… 5.5.2.2. Simplicity-based Preselection ….………………..…….…….. 5.5.3. The Effect of Base Classifier Performance ………….………..…... 5.5.3.1. The Effect of Improved Syntax ….….……………….……….. 5.5.3.2. The Effect of Global Optimization ….……………………..… 5.5.4. Out-of-domain Self-training …………………………………….... 5.5.4.1. Original Size Dataset ….….……………….…………………. 5.5.4.2. Extended Dataset ….…………………………………………

85 85 85 88 90 92 94 96 96 97 98 99 101 102 103 104 iv

5.5.5. Constituent-based vs. Dependency-based Self-training ………….. 5.5.5.1. In-domain Self-training ….….………………………………... 5.5.5.2. Out-of-domain Self-training ….………………..…………….. 5.6. Co-training ………………………………………………………...…… 5.6.1. Co-training with Common Training Set ………………………….. 5.6.1.1. Agreement-based Selection ….….…………..…….…………. 5.6.1.2. Confidence-based Selection ….……………….…..….……… 5.6.2. Co-training with Separate Training Sets …………….…………..... 5.6.3. Out-of-domain Co-training …………………………..………….... 5.6.3.1. Original Size Dataset ….….……………….………………….. 5.6.3.2. Extended Dataset ….………………………………….……… 5.7. Comparison and Discussion ………………...…………………...…..…

106 106 108 110 111 111 114 118 121 121 123 124

CHAPTER 6 Conclusion 6.1. Introduction ………………………………………………….……...…. 6.2. Summary and Conclusion ……………………………………………… 6.2.1. Semantic Role Labeling ………………………….……………….. 6.2.2. Self-training and Co-training ………….…………………….......... 6.2.3. A Bootstrapping Framework for Learning SRL ………………….. 6.2.4. Self-training and Co-training for SRL ………….……………….... 6.3. Contributions …………..…………………………….………………… 6.4. Future Work ………………………………….…………..……………..

128 128 128 128 130 130 131 134 135

v

LIST OF FIGURES Semantic arguments of “wash” Semantic roles consistency against syntactic variation Semantic roles consistency across different languages The REMOVING frame Frameset for “wash” The parse tree of a PropBank sentence annotated with semantic roles The partial parse of a PropBank sentence annotated with semantic roles The dependency parse tree of a sentence annotated with semantic roles Employed SRL architecture Pruning a Sample Parse Tree Parameterized Self-training algorithm Parameterized Co-training algorithm The flow of experiments Learning curve of constituent-based classifier on test.wsj Learning curve of constituent-based classifier on test.brown Learning curve of dependency-based classifier with on test.wsj Learning curve of dependency-based classifier on test.brown The effect of selection balancing (test.wsj) The effect of selection balancing (test.brown) The effect of random preselection (test.wsj) The effect of random preselection (test.brown) The effect of simplicity-based preselection (test.wsj) The effect of simplicity-based preselection (test.brown) The effect of improved syntax (test.wsj) The effect of improved syntax (test.brown) The effect of global optimization with cha parses (test.wsj) The effect of global optimization with cha parses (test.brown) The effect of global optimization with chare parses (test.wsj) The effect of global optimization with chare parses (test.brown) Out-of-domain self-training (test.wsj) Out-of-domain self-training (test.brown) Out-of-domain self-training with extended unlabeled dataset (test.wsj) Out-of-domain self-training with extended unlabeled dataset (test.brown) Figure 5.22 In-domain dependency-based vs. constituent-based self-training (test.wsj) Figure 5.23 In-domain dependency-based vs. constituent-based self-training (test.brown) Figure 5.24 Out-of-domain dependency-based self-training (test.wsj) Figure 1.1 Figure 1.2 Figure 1.3 Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8 Figure 5.9 Figure 5.10 Figure 5.11 Figure 5.12 Figure 5.13 Figure 5.14 Figure 5.15 Figure 5.16 Figure 5.17 Figure 5.18 Figure 5.19 Figure 5.20 Figure 5.21

2 2 2 16 19 23 23 24 64 65 75 80 85 90 90 90 90 94 94 96 96 97 97 99 99 101 101 101 101 102 102 104 104 106 106 107 vi

Figure 5.25 Out-of-domain dependency-based self-training (test.brown) Figure 5.26 Out-of-domain dependency-based self-training with extended unlabeled dataset (test.wsj) Figure 5.27 Out-of-domain dependency-based self-training with extended unlabeled dataset (test.brown) Figure 5.28 Agreement-based co-training with UBUS feature split (test.wsj) Figure 5.29 Agreement-based co-training with UBUS feature split (test.brown) Figure 5.30 Agreement-based co-training with UBS feature split (test.wsj) Figure 5.31 Agreement-based co-training with UBS feature split (test.brown) Figure 5.32 Agreement-based co-training with BS feature split (test.wsj) Figure 5.33 Agreement-based co-training with BS feature split (test.brown) Figure 5.34 Confidence-based co-training with UBUS feature split (test.wsj) Figure 5.35 Confidence-based co-training with UBUS feature split (test.brown) Figure 5.36 Confidence-based co-training with UBS feature split (test.wsj) Figure 5.37 Confidence-based co-training with UBS feature split (test.brown) Figure 5.38 Confidence-based co-training with BS feature split (test.wsj) Figure 5.39 Confidence-based co-training with BS feature split (test.brown) Figure 5.40 Co-training with separate training sets with UBUS feature split (test.wsj) Figure 5.41 Co-training with separate training sets with UBUS feature split (test.brown) Figure 5.42 Co-training with separate training sets with UBS feature split (test.wsj) Figure 5.43 Co-training with separate training sets with UBS feature split (test.brown) Figure 5.44 Co-training with separate training sets with BS feature split (test.wsj) Figure 5.45 Co-training with separate training sets with BS feature split (test.brown) Figure 5.46 Out-of-domain agreement-based co-training with original size unlabeled dataset (test.wsj) Figure 5.47 Out-of-domain agreement-based co-training with original size unlabeled dataset (test.brown) Figure 5.48 Out-of-domain confidence-based co-training with original size unlabeled dataset (test.wsj) Figure 5.49 Out-of-domain confidence-based co-training with original size unlabeled dataset (test.brown) Figure 5.50 Out-of-domain confidence-based co-training with extended unlabeled dataset (test.wsj) Figure 5.51 Out-of-domain confidence-based co-training with extended unlabeled dataset (test.brown)

107 108 108 110 110 112 112 113 113 115 115 116 116 117 117 118 118 119 119 120 120 121 121 121 121 123 123

vii

LIST OF TABLES Table 2.1 Table 2.2 Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 Table 4.7 Table 5.1 Table 5.2 Table 5.3

Commonly Used Thematic Roles List of Adjunctive Argument Labels in PropBank A typical training sentence with annotations Domain Distribution of Original and Selected OANC Corpus Constituent-based Features Dependency-based Features General Features A typical in-domain test sentence with annotations A typical out-of-domain test sentence with annotations Characteristics of the Data Performance of the Base Classifiers Characteristics of the Training Data

12 18 61 63 70 72 73 82 82 86 88 92

viii

LIST OF APPENDICES APPENDIX A APPENDIX B APPENDIX C APPENDIX D

List of Abbreviations Feature Sets Supervised Learning Curves The Flow and Processing Time of Experiments

ix

Chapter 1

Introduction

CHAPTER 1 Introduction

1.1 Background Understanding is tied with meaning, and understanding natural language primarily relies on its meaning. Semantics is the study of meaning of natural language, which forms the basis of meaning-oriented linguistic analysis. Computational semantics aims to automate the process of this analysis at different levels and can be considered as the basis of meaning interpretation in natural language processing (NLP). While some approaches concentrate on deeper analysis of language meaning to represent it in a machine-executable format, some others focus on the shallow structure of the meaning to provide other related tasks with this kind of knowledge. The former usually narrow their scope to specific domains to employ related domain knowledge in the representation of meaning (e.g. translating natural language queries into SQL), whereas the latter target more general domains to cover broader ranges of applications. Shallow approaches are concerned with the meaning of the words forming the sentence and relations between them, and deal with lexical knowledge and syntax of the language. An important such relation lies between a predicate and its arguments, which forms the predicate-argument structure and is represented by semantic roles of the arguments. For example, in Figure 1.1, the role of "Rasul", "the dishes", and "last night" in the event of "washing" expressed by the predicate "washed" are WASHER, WASHEDTHING, and TIME, respectively.

1

Chapter 1

Introduction

[Rasul] washed [the dishes] [last night]. WASHER predicate WASHEDTHING TIME Figure 1.1: Semantic arguments of “wash” [Rasul] washed [the dishes] [at home] [last night]. WASHER predicate WASHEDTHING LOCATION TIME [Last night], [at home], [the dishes] were washed [by Rasul]. TIME LOCATION WASHEDTHING predicate WASHER Figure 1.2: Semantic roles consistency against syntactic variation [Rasul] washed [the dishes] [at home] [last night]. WASHER predicate WASHEDTHING LOCATION TIME [Rasul] [dünən gecə] [evdə] [qabları] yudu. WASHER TIME LOCATION WASHEDTHING predicate Figure 1.3: Semantic roles consistency across different languages

This representation is consistent across various grammatical formations of the sentence. For example, Figure 1.2 shows that modifying the structure of the sentence by changing the position of adjuncts and the voice of the verb (predicate), which changes the syntactic subject (from "Rasul" to "the dishes") and object (from "the dishes" to "Rasul") does not affect the roles of the arguments. Moreover, the consistency is retained when the sentence is translated to other languages. As Figure 1.3 depicts, despite the change of the words and their positions, semantic roles of corresponding arguments are still identical in the translated sentence into Azerbaijani Turkic. Corresponding phrases have the same roles in both languages. In addition, as it can be observed, semantic roles determine the structure of the event expressed by predicate, including WHO, WHAT, WHERE, WHEN, WHOM, etc. Extracting such information from a sentence is a significant step towards its meaning. 2

Chapter 1

Introduction

These characteristics of semantic roles make them useful in applications such as question answering, information extraction, summarization, and machine translation. To make use of semantic roles in practice, given a sentence, all the arguments of predicates forming that sentence must be identified and assigned a role. This task is known as semantic role labeling (SRL) or shallow semantic parsing. An SRL system first identifies all arguments of a predicate in the sentence, which are sequences of words. A semantic argument corresponds to a word or a phrase in the sentence and thus argument candidates are usually chosen from the output of a syntactic analyzer. Then, the role bearing arguments are determined from among these candidates. This stage in semantic role labeling is known as argument identification. After identifying arguments of the predicate, each of them is assigned a label representing its semantic role in respect with that predicate. This stage of SRL is called argument classification, since it classifies the identified arguments into classes of semantic roles. Recent successful work on applying statistical machine learning methods to NLP tasks motivated the researchers for utilizing these techniques in SRL. These methods automatically extract the knowledge encoded in corpora of examples through training process and then apply this knowledge to the corresponding application. When the corpus is manually enriched with the related linguistic knowledge by human experts, it is known as labeled or annotated corpus. The learning methods relying solely on such labeled corpora are known as supervised methods, and the output of these methods is a classifier ultimately used for the classification purpose in the underlying problem. The availability of FrameNet (Baker et al. 1998) as the first existing corpus annotated with predicate-argument relations was the main supportive factor of the introduction of statistical SRL by Gildea and Jurafsky (2002). The promising results of this seminal work attracted several researchers, resulting in a new area of NLP research. 3

Chapter 1

Introduction

These studies identified new problems and requirements, which led to the development of new SRL corpora the most prominent one being PropBank (Palmer et al. 2005). Since the emergence of SRL research field, the majority of work in this field have employed and studied supervised methods for learning semantic role labeling. Although supervised learning techniques have shown propitious results, they rely on hand-crafted corpora, which are costly and difficult to develop or sometimes not available. On the other hand, unsupervised methods do not require labeled data but use a large amount of unlabeled data, which is usually available at much lower cost. However, applying these methods has not shown to be straightforward, especially for complicated problems. In between supervised and unsupervised, lies semi-supervised learning paradigm. These methods use a large amount of unlabeled data besides labeled data. The unlabeled data is mainly used to compensate the scarcity of labeled data mentioned above. However it can also be used to further improve the learning performance even when a sufficient amount of labeled data is available. An interesting aspect of semi-supervised learning methods is the claim that it is the way human learns, i.e. by interpreting previously aggregated unlabeled data, upon availability of labeled data (Zhu 2005). Self-training (Yarowsky 1995) is a semi-supervised algorithm which has been well studied in the NLP domain and gained promising result. It first trains a base classifier on an initial training set of manually labeled data known as seed. Then it iteratively labels unlabeled data and select some of them to be added to the current training set. This process is known as bootstrapping. Co-training (Blum & Mitchell, 1998) is another bootstrapping algorithm in the same spirit with the difference that two or more classifiers instead of an individual classifier cooperate in labeling unlabeled data and extending their training set. Each classifier is trained on a set of features derived from a specific view of the training examples. 4

Chapter 1

Introduction

1.2 Problem Statement Semi-supervised learning can utilize unlabeled data to address various problems. These problems can be classified as follows: • Scarcity of labeled data: where sufficient amount of labeled data to train a state-of-the-art classifier is not available • Improvement of state-of-the-art: where sufficient amount of labeled data to train a state-of-the-art classifier is available but higher performance is required • Domain generalizability: where the available labeled data is limited to a specific domain, and unlabeled data from a different domain or a balanced set of domains is used to increase the generalizability of the classifier Self-training and co-training are employed in this work as semi-supervised learning algorithms to target the first and last problems described above. Although these algorithms are theoretically straightforward, they involve a large number of parameters and several possible variations, highly influenced by the specifications of the underlying task. There is not any deterministic was of tuning these parameters for the task in hand (Ng & Cardie 2003). Thus to achieve the best-performing variation and parameter set or even to investigate the usefulness of these algorithms for a learning task such as SRL, a thorough experiment is required. Previously, He and Gildea (2006) and Lee et al. (2007) used these two algorithms in semi-supervised learning of SRL and identified some problems hampering the improvement gained using unlabeled data. One problem is originated from unbalanced distribution of semantic roles in a typical SRL corpus, which makes the classifier to be biased towards more frequent

5

Chapter 1

Introduction

roles. We propose and verify a method that selects a more balanced set of labeled examples in each self-training iteration than previous work do. Another problem related to the selection of labeled data is that, due to lower quality of initial classifiers, it labeling confidence is not a reliable measure for selection. We propose and inspect a preselection method to help the selection process by preselecting simpler examples to be labeled in early iterations. Splitting the SRL problem into views to extract features for co-training such that those views satisfy those assumptions is not straight forward. Seemingly reasonable splits used by both previous works suffer from a big performance gap between classifiers of each view. We propose a new split based on two different syntactic formalisms and further investigate the effect of this gap beside the level of separation between views. Performance of the base classifier seems an effective factor in the quality of training set extended by bootstrapping process. We investigate whether any kind of performance improvement can benefit the process or other parameters are involved. Several variations of co-training algorithm have been used in terms of the way cotraining classifiers collaborate with each other. Each of the two previous works uses different methods for this purpose. The former uses a separate training set for each classifier and the latter uses a common training set for both of classifiers. We examine both of these methods together with different approaches of selecting newly labeled data, to figure out their advantages and disadvantages.

6

Chapter 1

Introduction

1.3 Research Objectives The main interest of this research is to address the problem of learning semantic role labeling with the support of unlabeled data using semi-supervised methods, namely selftraining and co-training. The objectives of the research are as follows: • To investigate self-training and co-training algorithms methods and their use in learning SRL and related problems identified by previous research • To design a framework for learning semantic role labeling via these algorithms based on the outcome of the investigations • To experiment the effect of parameters and variations of these algorithms on problems associated with learning semantic role labeling via unlabeled data when only a small amount of labeled data is available and: both labeled and unlabeled data are from the same domain, and the aim is to achieve a state-of-the-art SRL performance (in-domain bootstrapping) the labeled data is from a particular domain and unlabeled data is from multiple domains, and the aim is to achieve a state-of-the-art SRL system robust to domain variation (out-of-domain bootstrapping) • To experiment the solutions proposed by this research for some of the identified problems

7

Chapter 1

Introduction

1.4 Research Method To carry out the research, an appropriate framework for SRL is chosen in terms of syntactic input, architecture, classification algorithm and learning features to fit the bootstrapping problem. Next, the proposed variations of self-training and co-training algorithms are implemented. The classifiers are then trained on the labeled data and bootstrapped via unlabeled data. The resulted classifiers are tested on development test data to tune and improve the system mainly in terms of learning feature selection. The final classifiers are run through test data to evaluate the system and algorithms. The labeled training data is a random portion of the dataset used for CoNLL 2005 shared task on semantic role labeling (Carreras & Marquez 2005), which has dominantly been used in SRL research. Two types of data are used as unlabeled training data: the rest of CoNLL 2005 data as in-domain and the free version of American National Corpus (OANC) as out-of-domain unlabeled data. The test data is exactly those used in CoNLL 2005 shared task to evaluate systems and consist of one development dataset and two in-domain and out-of-domain test datasets. After labeling the test data with the trained classifiers, to evaluate the labeling performance, the three popular measures used in SRL context are used: precision, recall, and their harmonic mean ‫ܨ‬ଵ . In addition, one important measure in evaluating bootstrapping performance is the trend according which the performance of the bootstrapped classifiers in each iteration changes (improves or degrades). This measure is visual and judged based on the graph rendering the trend.

8

Chapter 1

Introduction

1.5 Significance of the Research In the recent decades, the growth of information has been accelerating, and its amount has been multiplying. The majority of this information is recorded as written text, especially with emergence and exponential growth of Web. Utilizing this overloaded information is undoubtedly out of the ability of human, and it will be worth not more than an archive of text, unless enough processing power with human intelligence in manipulating natural language is devised. Question answering, information extraction, summarization, machine translation, textual entailment, and coreference resolution are applications of natural language processing, which target this problem. As discussed in previous section, the information conveyed by semantic role labeling has been shown effective for these applications. Consequently, SRL has become an active field of NLP research dedicating several conferences and competitions to itself. The major drawback of current semantic role labeling systems is their reliance on costly annotated corpora. Considering the wide variety of natural languages around the world and the heterogeneity of text domains for each of these languages besides the developing nature of vocabulary, it is perceptible that the currently existing corpora are only a small fragment of what is required. The problem of domain generalization observed in previous SRL research asserts this fact. On the other hand, a vast amount of raw text is now digitally available due to the recent development of computers and wide spread use of their applications, especially the World Wide Web. To be able to effectively exploit these easily available resources in learning an important task like semantic role labeling via semi-supervised approaches will be a significant contribution to the problem of natural language understanding and consequently information utilization. 9

Chapter 1

Introduction

1.6 Structure of Thesis The rest of this research is organized in the following structure: Chapter 2 reviews semantic role labeling literature in terms of its components, architecture, and related problems. Chapter 3 reviews the methods of learning with unlabeled data focusing on selftraining and co-training algorithms and their use and problems in semantic role labeling. Chapter 4 describes the SRL framework employed in this work, the variations and parameters of self-training and co-training algorithms together with problems and proposed solutions investigated here, and the evaluation method to analyze the results. Chapter 5 presents the experimental settings, explains the experiments, and analyze and discuss the results. It finally compares the results with previous work and discusses it. Chapter 6 summarizes and concludes the research according to the objectives defined in this chapter. It then describes the contributions of the research and suggests directions of future work.

10

Chapter 2

Semantic Role Labeling

CHAPTER 2 Semantic Role Labeling

2.1 Introduction Semantic role labeling has been an active research field of computational linguistics in the last decade. It reveals the event structure encoded in sentence. This information has been shown useful for several NLP tasks and applications such as information extraction, question answering, summarization, coreference resolution, and machine translation. Consequently, a vast body of research has been dedicated to SRL studies aiming at achieving the state-of-the-art in this field. The promising results of these studies have been followed by several workshops and shared tasks devoted to semantic role labeling, affirming the increasing attention to this area of research. This chapter first briefly reviews the history of semantic roles, their applications, and semantic role labeling. Then it investigates the SRL task from various points of view and reviews the related work focused on each of these aspects. The corpora in use, syntactic views in SRL, the architecture of SRL systems, and learning approaches in SRL are discussed consecutively. The chapter is summarized at the end.

2.2 History Introduction of semantic roles dates back to two independent works by Gruber (1965) and Fillmore (1968). Gruber proposed some thematic roles such as Agent, Theme, Source, and Goal, and argued the methods of identifying patterns carrying these roles. Fillmore introduced the case grammar that analyzes the surface structure of the 11

Chapter 2


sentence represented by deep cases like Agent, Object, Benefactor, and Instrument with their correlation to grammatical functions such as subject and object. According to the case grammar, deep cases which a verb can select from its case frame, and a certain case can occur only once for each verb in a sentence. Schank (1972) proposed conceptual cases for the semantics of his Conceptual Dependency theory including OBJECTIVE, RECIPIENT, DIRECTIVE, and INSTRUMENTAL and stated that each sense of verb requires a set of conceptual cases that determine its conceptual structure. There has not been any consensus on a standard set of semantic roles, and they vary from verb-specific roles, like WASHER and WASHEDTHING in the above examples), to very general roles consisting of PROTO-AGENT and PROTO-PATIENT or PROTO-THEME (Dowty 1991). In the middle of this range, thematic roles (Gruber 1965, Fillmore 1968) are located, which include AGENT, THEME, INSTRUMENT, etc. More general roles abstract over some more specific roles, which carry the same semantic functionality. For example AGENT is a generalization of WASHER, BREAKER, and CLEANER, which represents an animate volitionally causing the events. Table 2.1 lists a set of thematic roles along with their brief definition taken from Jurafsky and Martin (2008).

Table 2.1: Commonly Used Thematic Roles Thematic Role AGENT EXPERIENCER FORCE THEME RESULT CONTENT INSTRUMENT BENEFICIARY SOURCE GOAL

Definition The volitional causer of an event The experiencer of an event The non‐volitional causer of the event The participant most directly affected by an event The end product of an event The proposition or content of a propositional event An instrument used in an event The beneficiary of an event The origin of the object of a transfer event The destination of an object of a transfer event 12

Chapter 2


One of the earliest studies dedicated to the assignment of thematic roles as an aspect of sentence comprehension was carried out by McClelland and Kawamoto (1986). They used a connectionist model and a set of semantic micro-features to train a model that learned the joint role of word order and semantic constraints in role assignment. A work that followed them in using connectionist architecture in thematic role assignment was Rosa and Francozo (1999), which proposed a system called Hybrid Thematic Role Processor (HTRP). It takes a sentence represented by the features of its words as input and identifies its predicate-argument structure, which they called the thematic grid of the sentence. Another early corpus-based approach to thematic knowledge acquisition was taken by Liu and Soo (1993). They used syntactic clues extracted from Penn TreeBank corpus (Marcus et al. 1993) and four other TreeBank Corpora (ATIS, MARI, MUC1, and MUC2) to hypothesize thematic roles arguments, and in the next stage, employed some heuristics based on linguistic constraints to further resolve the ambiguity of role assignment. The earliest works that employed semantic roles as part of the text understanding mechanism were some rule based systems. KERNEL (Palmer et al. 1993) was a text understanding system that used thematic roles such as agent, theme, and instrument to specify the Lexical Conceptual Clauses and Semantic Class Restrictions for verbs. This system used some mapping rules, resembling Fillmore's Thematic Hierarchy, to map the syntactic functions into thematic roles. The need for a machine-readable lexical resource for English in computational corpus-based studies of language motivated the creation of FrameNet lexical database (Baker et al. 1998). The availability of this resource provided the support and inspiration for the first study on statistical machine learning approach to semantic role labeling by Gildea and Jurafsky (2002). This work not only initiated a new trend of 13

Chapter 2


research in the field, but also reinforced the importance of predicate-argument structure in natural language processing tasks and applications. Around the same time, the development of the PropBank (proposition bank) corpus by Kingsbury and Palmer (2002) provided another foundation that current research on semantic role labeling has been built upon. PropBank explicitly labels the Penn English Treebank (Marcus et al. 1993) with semantic roles and has since been the dominant corpus used in SRL studies. These seminal works were immediately followed by other research on semantic role labeling. Gildea and Palmer (2002) studied the effect of syntactic parse in SRL accuracy using the early release of PropBank. Surdeanu et al. (2003) designed a new method for identifying predicate-argument structure and used it for their novel information extraction system. Gildea and Hockenmaier (2003) used Combinatory Categorial Grammar (CCG) instead of TreeBank style phrase-structure parses. Chen and Rambow (2003) used a decision tree learning paradigm and analyzed deeper syntactic features based on syntactic dependency structure resulted from the extraction of Tree Adjoining Grammar (TAG) from Penn Treebank. Pradhan et al. (2003; 2004) examined the use of Support Vector Machines (SVM) to improve the learning method. While these work experimented on PropBank, Flieschman et al. (2003) followed Gildea and Jurafsky (2002) in using FrameNet and employed a Maximum Entropy approach as the learning model to improve the classification accuracy. Other studies on semantic role labeling that used FrameNet as well were O'hara and Weibe (2003) and Thompson et al. (2003). The former investigated the semantic role annotation of prepositional phrases in both FrameNet and Penn Treebank, and the latter examined a generative model for learning semantic role assignment. Since these early studies, a significant body of research has been dedicated to SRL, and several evaluation tasks and workshops have been established to compare the 14

Chapter 2


performance of different approaches to the field. CoNLL 2004 and 2005 shared tasks (Carreras & Marquez 2004; 2005), SENSEVAL-3 task of SRL (Litkowski 2004), several tasks in SemEval-2007 (Pradhan et al. 2007; Marquez et al. 2007; Baker et al. 2007; Litkowski & Hargraves 2007), and more recently CoNLL 2008 and 2009 shared task (Surdeanu et al. 2008; Hajic et al. 2009) all cover semantic role labeling and comparative evaluation of the performance of SRL systems based on different targets. In addition to these studies on semantic role labeling systems, the application of semantic roles have also attracted scholars in fields such as question answering (Narayanan & Harabagiu 2004), information extraction (Surdeanu et al. 2003; Frank et al. 2007), summarization (Melli et al. 2005), and machine translation (Boas 2002; Gimenez & Marquez 2007; 2008; Wu & Fung 2009a; 2009b), due to the their characteristics described in section 1.1 in Chapter 1.

2.3 Corpora As other NLP tasks being based on statistical learning approaches, automatic labeling of semantic roles strongly relies on corpora. The domain coverage of the corpus, the level of annotation and its representation formalism, and the extent of example data determining the representativeness of the corpus all influence the performance of the task and its potential applications. Therefore, different corpora would impose different limitations on SRL systems and affect their performance in different ways.

2.3.1

FrameNet

FrameNet (Baker et al. 1998) consists of a database of semantic frames and a corpus of example sentences annotated based on these semantic frames. Each semantic frame is related to a concept (e.g. REMOVING) and describes the structures in which that concept appears in language. These structures in fact represent the predicate-argument 15

Chapter 2


structure of the predicates that their meaning is associated with the concept underlying the frame (e.g. "wash" in relation to REMOVING). Each of these predicates (verbal or nominal) pairing with each of its senses that invoke a related frame is called a lexical unit. The description inside a frame includes possible semantic roles for arguments of these lexical units known as frame elements (e.g. Agent). Frame elements of a frame are divided into core and non-core groups based on their centrality to the frame. The last release (Version 1.3) of FrameNet contains about 960 frames and 11,600 lexical units. Figure 2.1 shows a sample frame including some of its frame elements and lexical units. FrameNet annotates sentences from British National Corpus (BNC) with frame elements as semantic roles. BNC is a collection of written and spoken British English consisting of 100 million words, which covers a balanced range of domains. It is segmented into sentences and annotated with part of speech (POS) tags of phrases. In addition, in FrameNet II, more data from LDC North American Newswire corpora in American English is added into the corpus. For each FrameNet frame, first a number of these sentences containing a lexical unit associated with that frame are selected. Then the arguments of the predicate (lexical unit) and their boundary in the sentence are manually identified, and each argument labeled with the appropriate frame element, which is in fact its frame-specific semantic role in relation to the predicate. The current

REMOVING Frame Elements Core Agent Cause Source Theme

Non‐Core Cotheme Degree Distance Goal Manner Means ...

Lexical Units clear.v confiscate.v discard.v dislodge.v drain.v dust.v eject.v ejection.n

removal.n remove.v roust.v skim.v snatch.v take.v wash.v …

Figure 2.1: The REMOVING frame 16

Chapter 2


release of FrameNet (Version 1.3) contains about 150,000 annotations. In addition to Gildea and Jurafsky (2002) and other remarked initial studies employing FrameNet, some other researchers found it more appropriate to use for special facets of their experiments. FrameNet's support for nominal predicates enables an SRL system to label argument structure of such predicates in addition to verb predicates (Pradhan 2006). Moreover, the hierarchical organization of semantic frames and frame elements in FrameNet hypothesizes more generalization capability across its frames and roles (Matsubayashi et al. 2009). Sensval-3 (Litkowski 2004) and SemEval2007 (Baker et al. 2007) shared tasks on SRL aimed to demonstrate the advantages of FrameNet and encourage its use in similar research arguing that the SRL results using FrameNet have more potential of usability in formal representation of language.

2.3.2

PropBank

The proposition bank, PropBank (Palmer et al. 2005), annotates the Wall Street Journal (WSJ) part of the Penn Treebank (Marcus et al. 1993) with predicate-arguments structure. Propbank views each sentence as a number of propositions constructing that sentence. It only considers proposition expressed by verb predicates ignoring copula verbs. In annotating sentences, PropBank uses the Penn Treebank syntactic nodes (constituents) as semantic argument boundaries. Based on the VerbNet (Kipper 2005) categorization of verbs, it tries to retain consistency in the annotation of the verbs from same classes as a measure of semantic similarity. Propbank uses two groups of labels for semantic roles. The first group consists of 6 core role labels numbered from 0 to 5 (Arg0-Arg5) and an extra label, ArgA (for agent of an induced action). Arg0 and Arg1 usually labels Agent and Patient/Theme respectively. However the other numbered labels carry different roles for different 17

Chapter 2


Table 2.2: List of Adjunctive Argument Labels in PropBank Argument Label AM‐LOC AM ‐EXT AM ‐DIS AM ‐ADV AM ‐NEG AM ‐MOD AM ‐CAU AM ‐TMP AM ‐PNC AM ‐MNR AM ‐DIR AM ‐REC AM ‐PRD

Description location extent discourse connectives general‐purpose negation marker modal verb cause time purpose manner direction reciprocals predication

Example at home at $1000 in addition maybe never would because of the hurricane last night for efficiency strongly to the left together to one lane

verbs. The idea behind using numbers can be attributed to the difficulty of defining general role labels for all possible verbs and at the same time retaining the coherency of representation across different predicates (Carreras & Marquez 2004). The second group comprises general adjunct-like argument labels, AMs, which are consistent across the verbs and specify the function tag of the arguments ("M" is replaced by the proper tag) such as location, extent, cause, etc. A list of adjunct labels is shown in Table 2.2. An example annotation for core and adjunctive arguments were given in Figure 1.4. The predicate-argument structure of a verb varies for different senses of it, thus arguments of each sense may take a different set of semantic roles, which is called roleset. PropBank represents each of these rolesets in a Frameset. Every Frameset is accompanied by a number of annotated examples substituting syntactic variations of the associated verb usage. All the Framesets of a verb are collected in a Frame File. Figure 2.2 shows an example Frameset for a usage of verb wash. PropBank uses the entire Penn Treebank in its annotation making it a more representative corpus of language in comparison to FrameNet, which only attempts to exemplify the semantic role assignment by selecting only some illustrative examples for 18

Chapter 2


Frameset: wash.01 , (make) become clean Roles: Arg0: agent of washing Arg1: thing being washed, dirt Arg2: washing liquid Arg3: source Arg4: end state of Arg1 Example: with liquid [John] washes [his hair] [in cheap beer]. Arg0 Rel Arg1 Arg2 Figure 2.2: Frameset for “wash”

annotation (Marquez et al. 2008). However, PropBank is limited to a specific genre of text, newswire, as the context of WSJ, which causes problem in crossing to another domain of text. Although Arg2 to Arg5 do not often share a common representation across verbs, Arg0 and Arg1, and adjunctive argument roles are more generalizable than situation-specific frame elements of FrameNet. These specifications have inspired the major body of research on statistical SRL to use PropBank as the underlying corpus, and the focus of CoNLL shared tasks on SRL (Carreras & Marquez 2004; 2005; Surdeanu et al. 2008; Hajic et al. 2009) has been on this corpus.

2.3.3

Other Corpora

In addition to the corpora introduced in previous sections, there are other corpora annotating the predicate-argument structures with different characteristics. • Nombank (Meyers et al. 2004) aims to annotate the predicate-argument

structure of instances of about 5000 common nouns in the Penn Treebank II corpus consistent with PropBank annotation of verbs in the same corpus. It was

19

Chapter 2


used in CoNLL 2008 and 2009 shared tasks (Surdeanu et al. 2008; Hajic et al. 2009) together with PropBank for SRL of nominal predicates. • Brown (Kucera & Francis, 1967) is the Standard Corpus of American English

covering a wider genre of English text. It has recently been annotated with syntactic trees as part of Penn Treebank III and with predicate-argument structure as part of PropBank project. A portion of it (426 sentences) has been use by CoNLL 2005 shared task for out-of-domain test of SRL systems and, since then, has been adopted by most of the studies for this purpose. • VerbNet (Kipper 2005) is an English verb lexicon, which uses the Levin

(1993) verb classes to build its lexicon entries. Although it does not provide any annotated text, it is mapped to other annotated corpora such as FrameNet and PropBank by SemLink (Loper et al. 2007). The results of task 17 of SemEval-2007 (Pradhan et al. 2007) showed that labeling with VerbNet roles could be as accurate as labeling with PropBank, where VerbNet annotation was more informative. • OntoNotes (Hovy et al. 2006) is a corpus annotating a wide variety of text genres with syntactic structure, predicate-argument structure, and word senses, which covers three languages: English, Chinese, and Arabic. It is built upon Penn Treebank and PropBank in syntax and semantics respectively and provides the word sense disambiguation for nouns and verbs by linking each sense to an ontology and coreference. For other languages, Chinese PropBank (Xue & Palmer 2003) and Nombank (Xue 2006) and Korean PropBank (Palmer et al. 2006) are Chinese and Korean versions of PropBank and Nombank. AnCora (Aparicio et al. 2008) is a corpus of Catalan and Spanish, annotating various linguistic information including semantic argument 20

Chapter 2


structure and thematic roles. Moreover, CoNLL-2009 shared task on multilingual joint syntactic dependency parsing and semantic role labeling chose several corpora from seven languages such as English, Chinese, Catalan, Spanish, German, Czech, and Japanese and adapted them to be used in SRL subtask.

2.4 Syntax in SRL Semantic role labeling is built upon syntax. The usage of syntax in SRL is twofold: it is used to determine the boundary of argument candidates when they are selected as learning samples; it is used to extract learning features for argument identification and classification, and other stages of SRL. Due to this extreme dependency, the quality of the syntactic input strongly influences the performance of semantic role labeling in terms of its output quality. Several studies have explored this effect and confirmed that the major source of performance drop in SRL is the erroneous syntactic input. Gildea and Palmer (2002) investigated the effect of hand-annotated (goldstandard) syntactic parses available in PropBank in SRL by comparing it with the automatic parses. The results show that the lower accuracy of automatic parses directly affects the result of semantic role labeling. Various syntactic formalisms have been used and verified for semantic role labeling, including constituency syntax, dependency syntax, Combinatory Categorial Grammar (CCG), and Tree Adjoining Grammar (TAG). The coming sections describe constituency and dependency as two main formalisms used in SRL.

2.4.1

Constituency Syntax

Constituency syntax represents the structure of a sentence in terms of constituents, where each constituent is a phrase comprised of a sequence of words as appeared in the sentence. This representation can be in the form of a flat sequence of those constituents 21

Chapter 2


(phrase chunks) or in the form of a tree whose nodes are those constituents, each of which, in turn, consisting of smaller constituents. The former is known as partial or shallow parse and derived by a partial parser or phrase chunker (Carreras & Marquez 2003). The latter is known as full parse and is derived by a full parser (Charniak 2001). The constituency syntax is the most widely used formalism in semantic role labeling. The SRL framework using this syntactic representation is known as constituent-based framework, since semantic arguments are assigned to syntactic constituents, although there may not be a one-to-one map between an argument and a constituent. The PropBank corpus is built upon full parse trees of Penn Treebank corpus. It assumes that there is a one-to-one relation between semantic argument and syntactic constituent (node) in parse tree and assigns role labels to argument-bearing constituents. Figure 2.3 shows the parse tree for a sentence in PropBank corpus labeled with semantic roles of arguments of the predicate sold. When partial parses are used, semantic arguments are assigned to non-overlapping sequences of phrase chunks. Figure 2.4 depicts the same sentence in Figure 2.3 parsed by UPC chunker (Carreras & Marquez 2003) and annotated with semantic. As it can be seen, there is no one-to-one relation between phrases and arguments. Attempts have been made to compare these two representations in the sense of their effect on SRL performance. The idea behind using full parse trees is that they encode more information than partial parses that can be more useful for SRL. On the other hand, partial parses seem to be less harmful for SRL performance, given the fact that state-of-the-art partial parsers or chunkers are more accurate than state-of-the-art full parsers.

22

Chapter 2


S NP NP VP NP PP The luxury auto maker last year sold 1,214 cars in the U.S. Arg0 AM‐TMP Arg1 AM‐LOC Figure 2.3: The parse tree of a PropBank sentence annotated with semantic roles [The luxury auto maker] [last year][sold] [1,214 cars] [in] [the U.S.] NP NP VP NP PP NP

Arg0 AM‐TMP Arg1 AM‐LOC Figure 2.4: The partial parse of a PropBank sentence annotated with semantic roles

Despite the considerable difference between the performance of best systems in CoNLL 2004 shared task (Carreras & Marquez 2004), where all systems used only partial parses, and CoNLL 2005 shared tasks (Carreras & Marquez 2005), where full parse trees were used but with much larger training corpus, Surdeanu et al. (2007) show that when using the same amount of data for both full and partial syntax-based experiments, the difference is much less. They conclude that a comparable performance can be achieved using only partial parses. Moreover, experiments reveal that full and partial information affect the semantic role labeling differently. In the comparative study by Punyakanok et al. (2008), information in full syntactic trees was found more crucial in their heuristic-based pruning stage but not significant in argument classification.

2.4.2

Dependency Syntax

Dependency syntax represents the syntactic dependency relations between words in a sentence. The dependency is in the form of head-dependent link, where a word as head 23

Chapter 2


NMOD NMOD NMOD

ROOT ADV / AM‐LOC OBJ / Arg1 ADV / AM‐TMP NMOD NMOD SBJ / Arg0

PMOD NMOD

The luxury auto maker last year sold 1,214 cars in the U.S. Figure 2.5: The dependency parse tree of a sentence annotated with semantic roles

is modified by another word as dependent. For example, in the sentence “The luxury auto maker last year sold 1,214 cars in the U.S.”, the word “year” is an adverb which modifies the verb “sold”. Thus “year” is dependent here and “sold” is its head. A dependency parser produces the dependency tree of the input sentence. The tree is in fact a directed acyclic graph. The nodes are words, as head or dependent, and the arcs are the types of dependencies between nodes. The main verb of the sentence is considered as the root node. Abney (1989) argues that the dependency parsing resembles the parser in human mind. Dependency structure seems to be closer to the semantic relations inside the sentence. In addition, it is able to capture the long-distance dependencies in the sentence, which is useful in semantic role labeling. The utility of this formalism in SRL was affirmed by Johansson and Nugues (2007), where they observed that exploiting syntactic dependencies could improve the performance of semantic role labeling. The SRL framework using dependency syntax is known as dependency-based framework. Unlike constituent-based SRL, in which semantic roles are assigned to constituents, in this framework labels are assigned to words or more precisely dependency relations (Hacioglu 2004). This granularity is usually enough for some SRL-based applications such as information extraction. Therefore, considering the efficiency of dependency parsers (Nivre et al. 2007), dependency-based semantic representation tends to better suit such applications. These characteristics have recently raised dependency formalism as a rival of traditional well-studied constituent-based 24

Chapter 2


formalism for SRL. Figure 2.5 shows a dependency parse tree annotated with semantic roles of the predicate sold derived by MaltParser (Nivre et al. 2007). There is no hand-crafted English corpus for dependency syntax, and it is automatically derived from constituent-based corpora such as Penn Treebank using conversion algorithms, and the semantic annotation is projected from PropBank. In a general sense, the transformation is done by locating the head word of each constituent and setting other words as its dependents with dependency relations extracted from the information encoded in original tree (Magerman 1994; Johansson & Nugues 2007). In the first SRL system using dependency representation, Hacioglu (2004) observed that, although this approach outperformed the traditional approach, projecting the annotation result back to the PropBank data set significantly declined the performance. The reason was the mismatch between PropBank constituency, on which argument boundaries are based, and converted dependency Treebank structure (~8%). The use of dependency syntax in semantic role labeling was brought into more focus of attention when two consecutive shared tasks, CoNLL 2008 and 2009 (Surdeanu et al. 2008; Hajic et al. 2009) were dedicated to dependency-based semantic role labeling. The aim was to identify the effects of such formalism to the performance of SRL and to initiate further studies toward answering the question that which formalism is more appropriate for the task. Johansson (2008) shows that the dependency-based approach performs somewhat better than the constituent-based approach. More investigations are required to better exploit both features, especially considering the absence of an independent annotated corpus for the dependency syntax and its consequent problems explained earlier.

25

Chapter 2


2.5 Architecture of SRL Systems In practical view, an SRL system consists of a pipeline of steps to identify and label all the propositions comprising the sentence with their predicate-argument structure. First, all of the predicates are determined. Next, for each predicate, its arguments are identified (argument identification). Finally, each of these arguments is assigned the semantic role it fills in relation with the predicated (argument classification or labeling). However, there have been variations of this methodology in the literature such as ignoring, adding, or joining some steps. In this section, different stages in common SRL architecture along with architecture variations are reviewed.

2.5.1

Predicate Identification

Most of the studies have considered the predicates given, with or without their senses, but some have included this step completely in the process. For example, participants of the CoNLL-2005 shared task (Carreras & Marquez 2005) were given the predicates as target verbs, but their senses were present only in training and development data. In the CoNLL-2009 shared task (Hajic et al. 2009) the predicates were provided as input in the test data, but the disambiguation of their senses was done by participants. On the other hand, in the shared task of CoNLL-2008 (Surdeanu et al. 2008), systems were to identify both noun and verbal predicates in the input sentences and assign the appropriate PropBank or Nombank rolesets by recognizing their senses. When the part of speech (POS) tagging of the sentence is given as input, distinguishing verb predicates is not a difficult task, but it is influenced with the quality of tagging. However, since all nouns in a sentence are not predicates, identifying noun predicates is not straightforward. In both cases, however, the disambiguation of the sense is non-trivial. This can impose an additional difficulty for the semantic role 26

Chapter 2


labeling that include predicate identification phase, especially for nouns, thus degrading the performance of the systems.

2.5.2

Argument Identification

Given a sentence and a predicate inside it, to identify the semantic arguments of that predicate, first it should be decided that in which token granularity the sentence is processed to extract argument samples. This is more dependent on the role labeling scheme of the underlying corpus. In constituent-based SRL, which relies on PropBank labeling scheme, semantic arguments are assigned to parse tree constituents or phrase chunks. The granularities used for this SRL framework and the corresponding method of argument identification are describe as follows. • Constituent-by-constituent:

Samples

are

selected

from

parse

tree

constituents, with the assumption that there is one-to-one relationship between argument and constituent. Argument identification in this case is formulated as a binary classification problem, where the classifier identify a sample as argument or non-argument (Gildea & Jurafsky 2002). When using automatic parses, the resulting constituents do not completely match PropBank constituents. For example, there is only about 91% match with Charniak parses (Surdeanu et al. 2007). This mismatch affects the identification output and is a disadvantage of this method. • Phrase-by-phrase: Samples are selected from phrase chunks (or clauses) produced by a partial parser. However, since there is not always a one-to-one relationship between phrase chunks and arguments, each sample is treated as a part of an argument. Therefore, argument identification is formulated as IOB tagging problem in which the sample is assigned I if it is inside, O if it is 27

Chapter 2


outside, and B if it is the beginning of the argument boundary (Hacioglu et al. 2004). The same mismatch problem exists with automatic chunkers, but less severe than with full parsers. For example, there is about 95.5% match between Propbank constituents and CoNLL chunks (Surdeanu et al. 2007). • Word-by-word: Samples are selected from individual words in the sentence without the need for syntactic parse. Identification method is the same as that phrase-by-phrase tokenization (Punyakanok et al. 2004). Although this method is devoid of chunker errors, it is computationally more costly, since the number of words is more than phrase chunks in a sentence. As explained in section 2.4.2, in dependency-based SRL, semantic arguments are assigned to words or dependency relations. Therefore, the tokenization granularity can be considered as relation-by-relation (Hacioglu 2004). Since a one-to-one relation exists between a relation and an argument, the argument identification is formulated as a binary classification like in constituent-by-constituent approach. When the dependencybased corpus is created by conversion from constituent-based corpus produced from automatic parses, the same mismatch problem described above exists. One problem in argument identification, especially with constituent-byconstituent approach, is that the number of constituents in a parse tree which are not argument is much more than those mapping to a semantic argument. For example about 90% of the full parse tree constituents are not arguments (Pradhan 2006). This significant imbalance between non-arguments (negative samples) and arguments (positive samples) substantially degrades the performance of the classification algorithm as well as its efficiency. To overcome this problem, most of the researchers have successfully implemented a pruning step to filter out the obvious non-argument

28

Chapter 2


constituents and lighten the load of argument identification task. The next section describes this step.

2.5.3

Pruning

Pruning algorithms are usually based on some heuristics, which given the parse tree of the sentence, traverse it and collect the most appropriate nodes as candidate arguments. Xue and Palmer (2004) proposed an algorithm that starting from the predicate node itself and iterating up to the root, considers all the sisters of the current node as argument candidate except those in coordination with that node. This strategy has since been used by many other researchers, as it considerably reduced the training time of the learning algorithm (Punyakanok et al. 2008). An alternative pruning strategy was employed by Pradhan (2006). He pruned the high probable non-arguments by classifying the constituents into NULL and non-NULL and then pruning away the NULL class instances with probabilities higher than 0.9. The remaining candidates were then used as input to a combined identification and classification stage, where non-arguments are again classified as NULL (see section 2.5.3.1). He named this method as soft-prune, which was also used by Johansson and Nugues (2005). In spite of its usefulness in improving the learning efficiency, incorrectly pruning some positive arguments away due to heuristic nature of the algorithms negatively affects the performance of the SRL system (Pradhan 2006; Xue 2008). Due to this observation, Pradhan (2006) used the pruning stage only in the training of the classifiers to decrease the training time. In addition to constituent-based SRL, pruning has been adopted by dependencybased SRL systems. Ciaramita et al. (2008) performs a filtering on each sentence token 29

Chapter 2


to validate them as candidate arguments using the length of dependency paths between the argument and other constituents such as predicate or common ancestors.

2.5.4

Argument Classification

After identifying arguments, or more precisely argument boundaries, each of them must be assigned the semantic role it fills in relation with the predicate. This task is formulated as a multiclass classification problem, with classes being semantic role labels. Some machine learning methods like support vector machines (SVM) treat it as several binary classification problems, one for each argument role, such as one-vs-all (Pradhan et al. 2008), whereas some other methods like Maximum Entropy (ME) train a single multiclass classifier for all roles (Tutanova et al. 2008).

2.5.5

Combined Argument Identification and Classification

An alternative way to separately performing argument identification and classification is to combine these steps by adding an extra class for non-arguments to the set of target. In other words, the classifier assigns a label (e.g. NULL) for the samples that it does not identify as argument, and assign a semantic role label to the other samples. The approach taken by Pradhan et al. (2005), as explained in the previous section, is an example of this strategy. In addition Liu et al. (2005) used this architecture by considering syntactic constituents as argument candidates without any pruning or identification. They stated the speed and efficiency of their maximum entropy classifier regardless of the number of classes as their motivation for choosing this approach. In dependency-based SRL, Riedel & Meza-Ruiz (2008) combined all the three stages of predicate identification, argument identification and classification into one by developing a global probabilistic model of SRL using Markov Logic. They compared 30

Chapter 2


the results of this method and isolated-stages method and reported a significant improvement with the former approach, but in twice the time. It is worth noting that their best result architecture employed layers of inference and global scoring, but it is not stated whether those were also used in isolated-stage architecture or not. Nevertheless, there are reasons to separate the identification and classification subtasks, as was the case for CoNLL 2005 top-performed systems (Koomen et al. 2005; Haghighi et al. 2005). One rationale is the computational costs of classifying exponential number of potential arguments with additional class, especially when the pruning is not performed. Furthermore, studies show that different learning features have different impacts on each of these subtasks considering, for example, the fact that identification is more affected by syntactic features while classification is by semantic features (Pradhan et al. 2008). In other words, using different sets of features tuned specifically for each subtask can yield a better performance.

2.5.6

Global Optimization

The preliminary approach to the classification of arguments of a predicate considers them independent from each other and assigns role labels to arguments one by one regardless of what is assigned to the others. The outcome of this local classification may violate some linguistic constraints applied to the structure of the arguments of a predicate. For example arguments cannot overlap with each other or with the predicate, and a core argument cannot be duplicated for a predicate unless in the case of conjunctions (Punyakanok et al. 2008). To alleviate this shortcoming and to further exploit the relationship between arguments, global optimization of the labeling results has been proposed in the literature. These approaches jointly score all arguments of a predicate using various 31

Chapter 2


methods such as formulating it as constraint satisfaction problem, inference, and reranking. Punyakanok et al. (2008) use an inference procedure to find the best global role assignment from among the results of the local classification subject to some linguistic constraints, such as those mentioned above. They formulate the inference procedure as a constraint optimization problem to be solved by integer linear programming. The results of analyzing the effects of this global strategy show a performance gain over the original approach. Tutanova et al. (2008) exploit global features derived from inherent dependency among arguments of a predicate to learn a log-linear re-ranking model. This model gets top n most likely joint (global) role assignments, which are generated by a dynamic programming algorithm from the output of local classification, and selects the best assignment among them. They report a substantial performance improvement by comparing their global and local classification result. Several other studies on SRL with global consideration of semantic role assignments have been carried out using a variety of algorithms (Moschitti et al. 2008; Sun et al. 2009). While some of the systems employing these approaches reports trivial gains over the original architecture (Ciaramita et al. 2008), the experiments show that most of them obtain positive results regardless of the algorithms (Surdeanu et al. 2007).

2.5.7

System Integration

Semantic role labeling is a complex task comprised from various subtasks, each of which can employ different methods. Accordingly, a SRL approach or system may perform better than others from a specific point of view, while it suffers some disadvantages in some other aspects. For example, a system may be robust to out-ofdomain data but brittle against syntactic parse errors. Or, a system may perform well in 32

Chapter 2


identifying arguments, but it suffers from low argument classification accuracy. It seems that by integrating these systems the advantageous points of one system could cover the weakness of the others that leads to higher performance gains. The evaluations in CoNLL 2005 shared task (Carreras & Marquez 2005) indicated a positive result of such integration strategy, where all of the top-ranked systems used some sorts of combination. Surdeanu et al. (2007) performed a thorough study of combining the output of several independent SRL systems. In their methods, the output of three individual SRL systems, each of which exploiting features extracted from different syntactic input (partial of full) are collected in a candidate argument pool. These candidates are then re-scored by a learning procedure using local or global features. Finally, the best solution is selected from among them through an inference procedure based on some linguistic constraints. In addition to their own SRL systems, they experimented with top 10 SRL systems of the CoNLL 2005 shared task. They found that in all of these combination strategies, regardless of advantages and drawbacks of each of them, the combined system performed better than individual systems. Punyakanok et al. (2008) studied the integration of two SRL systems, one using the syntactic parse trees produced by Collins (1999) parser and the other using those of Charniak (2001) parser. They performed a joint inference using integer linear programming, the same approach they used for global inference in individual system, on the equally-weighted outputs of these two systems and reported a significant improvement over each of those systems. Finally, Tsai et al. (2005) integrated two SRL systems, each of which using a different learning framework, SVM (Support Vector Machines) and ME (Maximum Entropy). They used a weighted summation of the argument classification probabilities 33

Chapter 2


resulted from these two classifiers for the objective coefficients of their ILP inference and gained improvement over the individual systems.

2.6 Learning SRL Chapter 1 introduced supervised learning and discussed that most of the SRL studies have formulated the task as a supervised classification problem to utilize existing handcrafted corpora like FrameNet or PropBank. The aim of supervised classification in semantic role labeling is to learn a function as accurate as possible that maps training examples to their semantic role labels. This function is then used to classify each test examples as a semantic argument with the role it fills. The training and test examples are argument samples encoded as features. The following sections describe supervised classification algorithms and learning features used in SRL context consecutively.

2.6.1

Classification Algorithms

Semantic role labeling studies either investigate the application of a specific supervised classification method to SRL problem, or simply use any of these methods as part of the system to explore other dimensions of the problem. Varieties of supervised classification algorithms have been used, among which Support Vector Machines (SVM) and Maximum Entropy (ME) appear to attract more attention than others. These two classification algorithms are also seen as the leaning component of the topperforming systems in CoNLL competitions (Carreras & Marquez 2004; 2005, Surdeanu et al. 2008; Hajic 2009). However, these reports reveal that solely the choice of algorithm is not the dominant parameter, since similar algorithms also appear at the bottom of the rankings. Support vector machines are binary classifiers thus in order to be used for learning SRL, which is a multi class classification problem, the problem should be decomposed 34

Chapter 2


into several binary classifiers. One approach is to train a set of One vs. All (OVL) classifiers, each of which learns to distinguish a role from all others, and then the results are combined. SVMs are computationally expensive especially when the amount of data is large. Pradhan et al. (2005) reports a significant improvement in SRL performance achieved by using support vector machines. Maximum entropy, on the other hand, is originally a multiclass classifier. Another feature of this algorithm is that it outputs the probability by which each class is predicted for a given instance. In addition, it is more efficient in terms of training and classification time (Liu et al. 2005). For the first time, Fleischman et al. (2003) used maximum entropy for SRL, where they credited this algorithm as an effective parameter for their improvement over previous work. Other classifiers found in SRL literature are SNOW (Sparse Network of Winnows), Decision Tree, and Perceptron. It is worth noting that SNOW was used by the best performing system (Koomen et al. 2005) in CoNLL 2005 shared task.

2.6.2

Learning Features

Feature selection and design is a crucial part of every statistical learning approach. In fact, features encode the knowledge in the problem domain derived from underlying examples as input to the learning algorithm. Considering decision list as an example learning method (classifier) a decision rule is formed by a conjunction of feature-value pairs as the antecedent and the target class as the consequent. For example, in the context of semantic role labeling a rule can be as follows:

(Phrase Type = NP) and (Position = before verb) => Arg0

35

Chapter 2


Here, Phrase Type and Position are two typical features and NP" and before verb are their values respectively. The rule is derived from training examples and classifies test samples (argument candidates) whose phrase type is NP and its position is before verb as Arg0. This is only a simplified example and, in practice, the learning and classification process is not as simple. Several works has been dedicated to the investigation of feature selection and its relation to other aspects of SRL. Xue and Palmer (2004) argued that not the same set of features has the identical effect on both identification and classification stages. They found that, for example, the tree path between candidate and predicate as a feature is very informative for identification, but not very important in classification. This observation was confirmed by others such as Pradhan (2006), where he thoroughly studied the features and their salience for various aspects of SRL problem. The baseline set of the machine learning features for semantic role labeling was introduced by the seminal work of Gildea and Jurafsky (2002). These features can be classified as those representing the structure and context of the argument candidate, those representing the structure and context of the predicate, and those capturing the structure between the candidate and predicate. Following the same view of the problem, other researchers experimented novel features to cover the weakness of these preliminary features or to devise the learning algorithm with more effective knowledge (Surdeanu et al. 2003). Regardless of the context from which features are extracted, they encode either syntactic or semantic information lied on the context. The following sections describe syntactic and semantic features with some common features from each category.

36

Chapter 2


2.6.2.1 Syntactic Features Syntax is the major source of information for semantic role labeling. Syntactic features are derived from information conveyed by various syntactic formalisms, some of which were discussed in the section 2.4. The quality of syntactic features is affected by the quality of underlying syntactic parses. The baseline set of syntactic features was firstly introduced by Gildea & Jurafsky (2002) and then extended and revised by other scholars. The common syntactic features are described here. • Phrase Type: the syntactic category of the argument sample, such as NP (noun phrase) or PP (prepositional phrase), derived from partial or full parses. Xue and Palmer (2004) argue that this feature is more useful when accompanies the predicate itself. • Path: the sequence of phrase types, POS tags or dependency relations located on the path from the argument sample to the predicate derived from constituency or dependency parse tree (e.g. "VP↑S↓NP"). This feature is among the most useful especially for the identification subtask (Xue & Palmer 2004). • Voice: the of verb predicate, which can be either active or passive, and is derived from constituency or dependency parses (Zhao & Kit 2008). • Subcategorization: the CFG rule expanding the parent of predicate in the constituency parse tree. (e.g. "VP-> VBD NP PP"). Gildea and Jurafsky (2002) used this feature to capture the transitive or intransitive usage of verbs such as "open" in the proposition, which affects its predicate-argument structure. • Governing Category: the syntactic category of the earliest simple clause ("S") or verb phrase ("VP") dominating the argument sample in the constituency 37

Chapter 2


parse tree. A noun phrase with governing category of “S” or “VP” tends to be the subject or object of the verb respectively (correspondence between grammatical functions and semantic roles). • POS Tags: proposed by Surdeanu et al. (2003), POS tags of word tokens in various relations with the sample token or predicate derived from the output of POS tagger or parser. These features are utilized as a backoff solution for sparse lexical features (explained in the next section). • Syntactic Frame: the sequence of NPs and the predicate in the order those are appeared in the sentence, with the argument sample differentiated from others (e.g. “NP-np-v-np”, where the first NP represent the sample itself).

2.6.2.2 Semantic Features Semantic features are often derived from different forms of lexicalizations of predicate, argument sample, or any other token in relation with them, such as their surface forms or lemmatization. Studies show that while syntactic features are more important for identification of arguments, semantic features are more contributive to classifying arguments (Pradhan et al. 2008). Unlike syntactic features that are affected by parser inaccuracy, these features are often more invulnerable to such extraction errors. However, these two kinds of features seem to be more useful for machine learning when exploited together (Surdeanu et al. 2008). A problem of lexical features is with their less capability of generalization to new domains. Because of their sparsity, it often happens that the data from a new domain contains much unseen values for these features, which degrades the learning performance. Recently, Zapirain et al. (2009) addressed this problem by employing even deeper semantic features. They modeled selectional preference between predicate 38

Chapter 2


and argument to generalize over lexical features and gained significant enhancement. Some of the most widely used semantic features in literature are described here. • Head Word: the surface form or lemma of argument sample’s head word (e.g. “in” for “in the U.S.”) derived using some heuristics like Collins (1999) rules. In dependency-based SRL it can be the form or lemma of dependency head of the argument sample. Gildea and Jurafsky (2002) argue that it captures selectional restriction on assigning semantic roles. • Content Word: complementary feature for head word derived using a different heuristic to head word extraction rules introduced by Surdeanu et al. (2003). For example, in a prepositional phrase like “in the U.S.”, the head word is “in”, but the content word is “U.S.”, which is more informative. • Predicate: the surface form or lemma of the predicate itself (e.g. “sell” for “sold”) which is as important as head word. While Gildea & Jurafsky (2002) use the surface form of predicate in their work, Surdeanu et al. (2003) use both the surface form and lemma, since the latter relives the sparsity of the former by generalizing over it. • Named Entity Tags: the category of the argument sample when it contains a Named Entity (e.g. "U.S." is a named entity of type location), derived from the output of a Named Entity recognizer is especially useful for adjunct roles like AM-TMP and AM-LOC (Surdeanu et al. 2003).

2.7 Summary This chapter reviewed semantic role labeling literature in terms of its structure, steps involved, and related problems. It introduced and analyzed the corpora used for learning SRL. Despite the high costs and thorough efforts devoted to their development, none of 39

Chapter 2


the current corpora were found sufficient for training a SRL system whose output is sufficient for real-world applications in any domain of language. The role of syntax in SRL and several syntactic formalisms used as input for SRL systems were reviewed. It was shown that syntactic errors are a major source of SRL defects. Constituency and dependency syntax were focused in detail. The steps forming the pipeline architecture of SRL, and various related problems and proposed solutions were explained. In addition to the main stages of this pipeline, namely argument identification and classification, different pre and post-processing steps such as pruning and global optimization or other variations of the architecture employed by researchers were analyzed. Moreover, this chapter investigated learning aspect of semantic role labeling in terms of classification approaches and learning features. Support vector machines (SVM) and Maximum Entropy (ME), as the most popular classification algorithm used by SRL systems, were compared. According to the reviews, same algorithms under different SRL approaches have led to significantly different performances, showing its correlation to other specifications of the approach. It also introduced the most common learning features used in SRL studies and discussed their association with the architecture of the system. The next chapter further explores the learning problem, focusing on learning with unlabeled data.

40

Chapter 3

Learning Semantic Role Labeling Using Unlabeled Data

CHAPTER 3 Learning Semantic Role Labeling Using Unlabeled Data

3.1 Introduction Previous chapter introduced supervised learning methods and reviewed their application to semantic role labeling. It was discussed that FrameNet, Propbank, and other labeled corpora were the main support to supervised semantic role labeling. FrameNet only annotates some example sentences selected for each semantic frame. This annotation approach questions its representativeness of the language, which is necessary for statistical learning. Propbank, on the other hand, annotates all the sentences from WSJ corpus and remedies that problem to some extent. However, unlike FrameNet, its coverage is limited to the newswire text of WSJ. This domain dependence affects the performance of the systems using PropBank on any different domain of text (Carreras and Marquez, 2005). The reason for all the above defects is the inadequacy of labeled data causing the classifier to face new events (encoded as features) in the new data, which were unseen in the training data. Considering the cost and difficulty of manually creating such resources, it seems infeasible to build a comprehensive hand-crafted corpus of natural language for training robust SRL systems. This issue is not specific to SRL and is the major problem of statistical learning relied on hand-crafted training data, which is often insufficient. This chapter first introduces unlabeled data and methods of using it as a relief to this issue. Then, it focuses on bootstrapping among those methods and describes two 41

Chapter 3


bootstrapping algorithms, namely self-training and co-training. Next, it reviews semantic role labeling works tried to utilize unlabeled data. The end of this review is concentrated on two bootstrapping algorithms mentioned above. Finally, the chapter is concluded with a mention to the problems of applying these algorithms to SRL.

3.2 Learning with Unlabeled Data The availability of unlabeled or raw data in a larger magnitude and lower cost is a stimulus to aspire to learning methods replacing the labeled data with it. Natural language processing (especially written language) is a domain in which this ambition is even more appealing, since a vast amount of digitalized written text is now available and can be made ready to use after some trivial preprocessing steps. Although it may seem impracticable to learn from raw examples without any trace of labels associated with them, attempts have been made and various methods have been introduced towards it. Unsupervised methods are such techniques that formulate the learning problem as identifying how examples are similar or related together and form groups. They then assign a class or value to each group. This process is called clustering and is the counterpart of classification in supervised learning. While unsupervised techniques avoid any labeled data, there are other methods of learning with unlabeled data that use both labeled and unlabeled data, but normally in different scales. These methods are known as semi-supervised methods and have been inspired by the little gains and difficulties of applying unsupervised learning. In one approach, this learning scheme tries to enlarge its training set by extending the annotation of the labeled dataset, known as seed, into the unlabeled dataset. In an alternative way, the knowledge acquired from unlabeled data is used as additional

42

Chapter 3


features for the normal supervised learning with seed labeled data (Deschachat & Moens 2009). There are different objectives of using semi-supervised methods for learning. These objectives can be categorized as follows: • Exploiting unlabeled data for training when the there is not enough amount of labeled data available to achieve acceptable performance. In such cases, the amount of unlabeled data is much larger than the amount of labeled data required for obtaining the same result. • Exploiting extra unlabeled data to further improve the learning performance when enough labeled data is available to achieve state-of-the-art performance. (Lee et al. 2007) • Exploiting unlabeled data from a different domain than labeled data for domain adaption or from multi-genre domain to improve domain generalizability. The usual scenario in such cases is that the learner achieves satisfactory performance on the data from the same domain of labeled training data but falters when tested on out-of-domain data. In other words, enough labeled data is available for a specific domain but not for a different target domain or multigenre domain. As explained earlier, this is the major flaw of current SRL systems.

3.3 Bootstrapping As explained in previous section, one approach in semi-supervised learning is based on extending the training set by labeling unlabeled data using the annotation of labeled data. One method is to propagate the annotation from labeled data into unlabeled data based on similarities between them (Furstenau & Lapata 2009). In another method, a 43

Chapter 3


base classifier is trained on labeled data and then annotates unlabeled data. Then the newly annotated data is added to the training set to re-train the classifier using this augmented set. This procedure is known as bootstrapping and usually performed iteratively until a stop criterion is meet. In a variation of bootstrapping, a seed labeled data is replaced with a seed classifier (Collins & Singer 1999; Abney 2002). Self-training and co-training are two well-known bootstrapping algorithms, both of which have been introduced by NLP research.

3.3.1

Self-training

The work by Yarowsky (1995) has been cited as the introduction of self-training, although similar algorithms have been previously used by Hearst (1991) and Hindle and Rooth (1991). Yarowsky uses self-training for the problem of word sense disambiguation, where he describes his approach as an unsupervised method. However, due to using labeled data as seed, it is considered as semi-supervised learning. Yarowsky first annotates some occurrences of the word plant in a raw corpus of English text with two senses of it based on their collocation with the words life or manufacturing. This annotation covers 82 examples of living sense and 106 examples of manufacturing sense and leaves 7,350 examples unlabeled. He then trains a decision list classifier on these labeled data and uses it to classify the remaining unlabeled data. From among the newly classified data, the most reliable ones labeled with a probability higher than a certain threshold are selected and moved to the training set. The process is repeated until the algorithm converges, i.e. training parameters stop changing. According to the above description, several parameters are involved in the algorithm. These parameters are usually treated differently by different works based on

44

Chapter 3


the characteristics of the underlying task or via conclusion made by empirical parameter tuning experiments. Selection of the labeled data in each iteration to be added to the training set is the crucial point of self-training, in which the propagation of labeling noise into upcoming iterations is the major concern. One can select all of the labeled examples, but usually only a number of them, known as growth size, meeting a confidence threshold are selected to take the steps more cautiously. While no growth size was set by Yarowsky, Steedman et al. (2003) added only 20 sentences out of 30 sentences labeled by their syntactic parses. The selected labeled examples can be retained in unlabeled set to be labeled again in next iterations or moved so that they are labeled only once. Abney (2002) names the former as delibility and the latter as indelibility. He states that indelibility can prevent the self-trained classifiers from drifting away from the base classifier as the algorithm progresses. Indelibility is the dominant strategy approached by bootstrapping research. In each iteration, all unlabeled data or only a portion of it can be labeled. In the latter case, a specific number of unlabeled examples are selected and loaded into a pool and then labeled. In addition to efficiency of labeling in each iteration, using pool can be beneficial for the self-training performance. The idea is that when all data is labeled, since the growth size is often much smaller than the labeled size, a uniform set of examples preferred by the classifier is chosen in each iteration and leads to a biased classifier (Abney 2008). Limiting the labeling size to a pool and at the same time (pre)selecting divergence examples into it can remedy the problem. Unlike the original self-training work (Yarowsky 1995), which did not use pool, Clark et al. (2003) examined various sizes of pool for bootstrapping of POS taggers.

45

Chapter 3


Different stop criteria have been used in bootstrapping literature. One can iterate for a specific number of steps (Steedman et al. 2003) or until the entire unlabeled data is used (McClosky et al. 2008). As described in Yarowsky’s work, one can also continue until the algorithm converges (no further improvement is observed). Self-training has been applied to several NLP fields. Resolution (Ng & Cardie 2003), POS tagging (Clark et al. 2003), and parsing (McClosky et al. 2008) were shown to be benefited from self-training. These studies show that the performance of selftraining is tied with its several parameters and specifications of the underlying task.

3.3.2

Co-training

As a bootstrapping algorithm, co-training is intrinsically similar to self-training. The difference between co-training and self-training is that in co-training, instead of one, two or more different classifiers are cooperating with each other in labeling unlabeled data. The aim is to develop a higher-quality training data from unlabeled data than what either one of them can accomplish. Co-training was introduced by Blum and Mitchell (1998), where they applied the algorithm to the problem of web page classification. The major concern in co-training is to decide how the base classifiers differ from each other. Blum and Mitchell use two different views of the problem to train two different classifiers with the same Naïve Bayes algorithm. Specifically, one classifier is trained using features extracted from hyperlinks pointing to the sample web page, and the other is trained using bag-of-word features extracted from the content of that sample web page. The target is to classify web pages as being a course home page or otherwise. Blum and Mitchell (1998) hypothesize and give theoretical justification that, in order for co-training to be profitable, the two selected views should satisfy two assumptions: 46

Chapter 3


• Either of the views should be solely sufficient for the classification with a reasonable performance (redundancy assumption) • They should be conditionally independent, given the classification (conditional independence assumption). However, Abney (2002) argues that the independence assumption proposed by them is very strong and often is not hold by the data. He suggests a weaker independence assumption under which co-training can still be beneficial. There are two main parameters determining the way of cooperation between classifiers. One is the selection of newly labeled data to be added to the training sets of the classifiers, and the other is the training set of co-training. In terms of labeled data selection, Blum and Mitchell (1998) select and add the most confident labeling of each view to their common training set. There may be a conflict in this way when the same labeled samples are selected for both views, since the unlabeled data selected for labeling in each iteration is the same for both classifiers. However, there is no mention about this matter in the paper. In a more sophisticated approach, Clark et al. (2003) first select a random subset of newly labeled examples by one of the classifiers and calculate the agreement rate of classifiers on that subset. They repeat this process for n times to arrive at a subset that maximizes the agreement rate. The found subset is then added to the training set of other classifier. In terms of the training set of co-training, there can be two variations of the algorithm. In one variation, all classifiers use a common training set, so in each iteration, it is augmented with a set of newly labeled data common between the classifiers. In the other variation, each classifier uses a separate training set. The collaboration prerequisite in this case is that the labeled example by one classifier is added to the training set of the other classifier in each iteration. 47

Chapter 3


All the other parameters and algorithm variations discussed for self-training in previous section, applies for co-training too. All the studies agree that each of these parameters such as growth size, pool size, and confidence threshold strongly influence the performance of bootstrapping. However, there has not been suggested any deterministic way of setting these parameters (Ng & Cardie 2003). Therefore, often extensive, parameter tuning experiments are required to arrive at an appropriate set of parameter values. Considering the computational cost of the iterative bootstrapping procedure, this issue is a pitfall of such algorithms. All in all, whereas some NLP research applying co-training report successful results (Sarkar, 2001), some others preferred other algorithms over it (Ng & Cardie, 2003) or suggested further needs for studying the algorithm due to the large scale of the target problem (Pierce & Cardie 2001; Muller et al. 2002).

3.4 Using Unlabeled Data in SRL There have only been a few attempts in utilizing unlabeled data in learning semantic role labeling, and most of the research has been focused on other aspects and problems of this complex task using supervised methods. This section first introduces unsupervised and semi-supervised works pursued to learn SRL from unlabeled data and then focuses on the application of self-training and co-training algorithms to this task.

3.4.1

Unsupervised Approaches

The first attempt to use unlabeled data for learning SRL appeared in the first statistical study of SRL by Gildea and Jurafsky (2002). They used a word co-occurrence probabilistic model to cluster nouns in a large unannotated corpus. The co-occurred words in each cluster were considered to fill same roles due to semantic similarities. 48

Chapter 3


Therefore, unseen lexical items could be generalized by referring to their clusters and using the information of known words in the same clusters. This method increased lexical coverage more than twice. In an effort to avoid using any annotated SRL corpus, Swier and Stevenson (2004) used a bootstrapping approach to iteratively label unlabeled data and increase the amount of training data in each step. They used a seed classifier based on VerbNet syntactic frames and verb-argument structures to initially assign labels to a set of unlabeled data. Although they introduced it as unsupervised method, this way of using unlabeled data is generally assumed as semi-supervised approach, since it uses a seed classifier as a weak form of supervision (Abney 2008). Grenager and Manning (2006) developed a generative probabilistic model for unsupervised semantic role labeling, and furthermore, for automatically building a lexical resource of verb-argument structure from unlabeled data. This model was learnt using Expectation Maximization (EM) algorithm on a mass of unlabeled training data extracted from three corpora each of which encoding syntactic information from different sources: WSJ section of Penn Treebank including gold-standard syntactic parses, Brown corpus of WSJ parsed by Charniak (2000) parser, and Gigaword corpus of raw newswire text parsed by Stanford parser (Klein & Manning 2003). They suggested that much more unlabeled data may help to improve such system. More recently, Abend et al. (2009) investigated a more pure unsupervised method for argument identification stage, where they avoid using any level of annotation except POS tags. Their method was based on pointwise mutual information to gather the collocation statistics between predicate and its arguments, which were derived from a 35M set of raw English sentences. For both languages experimented, namely English and Spanish, they reported successful results. 49

Chapter 3

3.4.2


Semi-supervised Approaches

Since semantic role labeling is a relatively new field of research, application of semisupervised learning to this area has only recently been focused by scholar. In another attempt to increase the coverage of head word feature, Gildea and Jurafsky (2002) tried to generalize the head words seen in labeled training set to a broader range of vocabulary, using bootstrapping. The algorithm they used for this purpose was a variation of Expectation Maximization (EM). They used a set of unlabeled data 6 times bigger than their seed and iterated the algorithm only once. Their results showed an increase in coverage, but at the cost of accuracy of labeling. Overall, the performance, which they defined as the multiplication of coverage and accuracy improved. EM was also used by Thompson (2004) to estimate the parameters of his generative model of semantic role labeling problem. Due to negative results with first setting, he focused on only seven most common frames of FrameNet. He reported that the first iteration of EM was beneficial for the accuracy, but the following rounds were deteriorative, although always better than the baseline. He attributes the little gain to the shortcomings of the underlying generative model. Furstenau and Lapata (2009) used a graph-alignment method to discover the semantic and structural similarity between labeled and unlabeled instances, so that they could automatically project the annotation to unlabeled data set based on this similarity. Their aim was to expand the FrameNet labeled data with those unlabeled data which included verbs unseen in the FrameNet annotated sentences. The extended corpus was finally used to train the SVM-based classifier. The results showed a performance improvement for medium- and high-frequency verbs which were unseen in labeled data. In the same spirit, Deschachat and Moens (2009) employed a Latent Words Language Model to extract the word similarities from unlabeled data. In addition to 50

Chapter 3


expanding the small labeled data set taken from PropBank, they also carried out another experiment that used the learnt similarity as additional features for the semantic role labeling. Moreover, they varied the size of seed labeled data by randomly selecting samples from 5% to 100% of the whole data. The learning paradigm was an extension of Maximum Entropy Hidden Markov Model. They could outperform the baseline stateof-the-art supervised system especially when using the smaller amount of labeled training data.

3.4.3

Self-training and Co-training

There has been two works in SRL employing self-training and co-training as their semisupervised learning strategy. Since this research specifically studies self-training and co-training for bootstrapping, these works are described in detail here. The first application of self-training and co-training for SRL was reported by He and Gildea (2006). They addressed the problem of domain generalizability and unseen FrameNet frames by selecting a labeled dataset of 462,634 examples (argument samples) and an unlabeled dataset of 156,921 examples (about 1/3rd). The unlabeled examples were selected from different frames to labeled data. They also compared the performance of Maximum Entropy (ME) and Decision List (DL) in self-training, whereas they used only DL for co-training. In this comparison, ME outperformed decision list. Their co-training algorithm with decision list did not gain a significant improvement, depicting the bottlenecks of using DL for co-training of semantic role classifiers. They used release 1.1 of FrameNet which consists of 474 frame elements which are argument roles of FrameNet annotation scheme as describe in chapter 2. To reduce the complexity of the task, they generalized these roles to 15 thematic roles. 51

Chapter 3


The feature split for co-training was based on syntactic and semantic (lexical) views of the problem, where only tree path for syntactic view and head word form for semantic view were selected. Self-training used both of these features together. The reason for using only two features was to reduce the training time. In addition to the low performance of their base classifiers, the big performance gap between the two classifiers is worth noting. The ‫ܨ‬ଵ of the ME base classifier of self-training was about 44 points on seen frames, where as that of DL base classifier was less than 10 points. The head word and path based DL classifiers performed with around 5 and 33 ‫ܨ‬ଵ points respectively on seen frames (about 28 points gap). As expected, the performance of the base classifiers on unseen frames was worse: about 33 ‫ܨ‬ଵ points for the ME, 27 ‫ܨ‬ଵ points for the DL. However, the gap between headword and path based classifier was smaller; their ‫ܨ‬ଵ s were 8 and 27 points respectively, leading to a gap of 19 points. Their self-training algorithm labels all the unlabeled data in each round, meaning that no pool is used. However, it is not clear whether examples are relabeled or not (delibility and indelibility). With DL classifier, it adds 100 most confident predictions to the training set in each iteration and repeats for 900 rounds. They increase the growth size to 1000 and total iterations to 90 for the ME classifier due to its higher precision. The co-training is done similarly. A separate training set is used for each co-training view. The improvement gained by the progress of ME-based self-training was small and inconsistent. They reported that the NULL label (non-argument) had often dominated other labels in the examples added to the training set, which could be the main reason of this behavior. On the other hand, DL-based self-training did not improve the base classifier. They argue that this occurs because low recall of the base classifier hampers the performance improvement. 52

Chapter 3


Shortly after He and Gildea, Lee et al. (2007) studied these algorithms to verify their applicability to the problem of semantic role labeling. Their approach was different in a that they used PropBank as the labeled corpus and annotation scheme and Associated Press Worldstream (APW) corpus containing around 7 million argument samples (7 times more than labeled samples) as their unlabeled corpus. Considering the ratio of these data sets, their target has been to investigate the effect of extra labeled data in amplifying the state-of-the-art SRL performance. Moreover, according to the domain of datasets, where APW is from a more general newswire domain compared to PropBank from business and finance newswire, it seems that they have only partly been interested in domain generalization experiments. A ME classifier was used for both algorithm. Similar to He and Gildea, they simplified the task by addressing only core PropBank roles. The feature split was based on syntactic and lexical views as in previous work. However, they used a more comprehensive set of features to have enough classification performance. Moreover, the two views had three features in common including predicate lemma, suffix, and POS tag. The performance of the classifiers are substantially higher than previous work, however the big gap still exits. The ‫ܨ‬ଵ of the syntax-based and lexical-based base classifiers are 70.86 and 51.67 points (~19 points gap). The ‫ܨ‬ଵ of the base classifier of self-training, which uses full feature set, was 75.28 points. However, it is not clear why they report the co-training results with the performance of full feature set instead of individual sets. They give a theoretical justification for this split based on human method of understanding the sentence meaning and state that, however, the conditional independence assumption only weakly holds under this feature split. In contrast, the split used by He and Gildea seems to better obey this assumption, since there is no 53

Chapter 3


feature in common between views. On the other hand, according to the classifiers’ performances, the redundancy assumption seems to be better satisfied in the later work. These two works differ from each other in some algorithmic aspects. Unlike former, the latter incorporates a probability threshold in the selection of newly labeled data. It also explicitly mentions about the indelibility of the labeled data during the process. Finally, for co-training the latter uses a common training set for both classifiers. Although not compelling, their results show a slight improvement over base classifier for both algorithms. Self-training achieves very slightly better results than cotraining. They argue that this result traced back to their poor base classifier, which performed substantially weaker than state-of-the-art, and to insufficient amount of unlabeled data, which was about 7 times more than labeled data. They verified the effect of varying the size of labeled data for a supervised learning of their classifier and observed that after about half the data, the ‫ܨ‬ଵ improves only 1 point. They concluded that co-training can be used for SRL, but it requires a vast amount of unlabeled data.

3.5 Conclusion Cost and difficulty of corpora annotation and infeasibility of a comprehensive corpus representative of the language steer the attention of statistical NLP community to easily and immensely available unlabeled data. This is not only the concern of this research field but also all statistical learning problems relying on data. While unsupervised learning with only unlabeled data has been shown difficult to apply with little gain, semi-supervised learning tries to remedy this problem by utilizing some manually labeled data along with unlabeled data. Bootstrapping is a family of such algorithms that have been well studied in NLP area. Self-training and co-training 54

Chapter 3


are two bootstrapping algorithms introduced by this area of research itself and have gained promising results. Previous works applying these algorithms have identified several problems regarding their application to the underlying problem. One apparent observation is that their effectiveness is strongly dependent on the underlying task, along with several parameters involved in the algorithms, where finding best matching parameters, considering the large number of possible combinations and the computational cost of bootstrapping is laborious and time consuming. For a complex task like SRL, there are some other major problems that make the application of these algorithms more difficult: • The number of target classes (roles) is too high, and there is no balance between them in a typical dataset, which make selection process awkward. • Designing feature split satisfying co-training assumptions is difficult. Next chapter describes the methods that we use to further understand these problems and investigate some solutions to relieve them.

55

Chapter 4

Self-training and Co-training for Semantic Role Labeling

CHAPTER 4 Self-training and Co-training for Semantic Role Labeling

4.1 Introduction We employ a semantic role labeling framework, which uses machine learning methods to learn to identify semantic arguments of a given propositions in a sentence (argument identification) and the role filled by the arguments (argument classification). The learning algorithm is based on self-training and co-training which are bootstrapping semi-supervised methods and use both manually-labeled and unlabeled example sentences to train the classifier. This framework will be used to study self-training and co-training methods for SRL from the following points of views: • Investigating variations of self-training and co-training algorithms for semantic role labeling • Experimenting the effect of various parameters involved in the algorithms • Investigating a novel feature split for co-training based on two different views of SRL, namely constituent- and dependency-based, which at the same time can be regarded as system combination These experiments will target two common problems addressed by semisupervised learning using unlabeled data: • The utility of in-domain unlabeled data in training a classifier when the handannotated data is scarce 56

Chapter 4


• The utility of out-of-domain unlabeled data in training a classifier with an improved domain generalizability For co-training, we employ two difference SRL frameworks, based on: • the well-studied constituent-based formalism (Carreras & Marquez 2005) • dependency-based formalism which has newly attracted researchers (Surdeanu et al. 2008; Hajic et al. 2009) Rather than being a simple feature split for co-training, these two distinct views of the SRL problem constitute a system combination based on two different syntactic formalisms. The rest of this chapter is organized as follows: first, syntactic formalisms used is introduced; then the SRL corpora and consequently the annotation formalism chosen for the study are described; in the next section, the architecture of the semantic role labeling frameworks employed for the experiments is explained; then, the following section describes the features and the process of selecting them for learning component; after that, the classifier used and the self-training and co-training algorithms designed are explained; finally, the evaluation framework and measures used are discussed.

4.2 Syntactic Formalism It was mentioned that to implement co-training with two views of the problem, we chose to use two different SRL frameworks. These two frameworks differ in the underlying syntactic formalism and labeling granularity. In the constituent-based SRL, the granularity in which tokens are processed as argument candidates are phrases and chunks derived from partial phrase-structure parsing or parse tree constituents derived from full phrase-structure parsing. In this 57

Chapter 4


work, we use the latter option, thus the syntactic formalism for our constituent-based SRL is full Treebank parses. On the other hand, in the dependency-based SRL, the token granularity is the dependency relation between a word (called dependent or modifier) and the predicate (called the head) derived from dependency parses of the sentence. The next section will explain the process of producing these syntactic parses for the corpora chosen for the study.

4.3 SRL Corpora As a semi-supervised learning task, two types of corpora have been used in this work: labeled corpus and unlabeled corpora, which in turn, consists of in-domain and out-ofdomain corpora. Next sections describe these corpora.

4.3.1

Labeled Corpus

Chapter 2 introduced the corpora which have been used for statistical semantic role labeling. Focusing on English and only verb predicates, two of them, namely PropBank and FrameNet are the obvious candidates to be used as the labeled portion of training data in this work. We chose PropBank between them for the following reasons: • As previously mentioned in section 2.2.2, PropBank has been studied more than other SRL corpora and thus more comparative analysis is possible using this corpus. Moreover, the extensive work on this corpus, mainly introduced in chapter 2, has identified and investigated more problems and solutions related to PropBank which can be exploited in a complex experimental setting like semi-supervised semantic role labeling.

58

Chapter 4


• Since FrameNet uses frame-specific semantic roles (frame elements), the number of outcome classes for the classification problem is large. In a bootstrapping approach, this is computationally prohibitive for the learning task and increases the complexity of its process, which can in turn hinder the iterative improvement. In such studies, usually a limited set of frames are used or frame elements are generalized to more abstract thematic roles (He & Gildea 2006). On the other hand, by generalizing semantic roles to a few core roles encoded by numbered labels (Arg0, Arg1, …), PropBank includes only 52 semantic roles and adjuncts, which is more reasonable for the task. • Because this work deals with both constituent-based and dependency-based SRL, it needs appropriate annotated data for both of these formalisms. Two CoNLL shared task, 2005 and 2008, provide data built upon PropBank corpus and fulfils the requirements for these formalisms respectively. We use PropBank annotation paradigm and data, but indirectly by using its CoNLL 2005 and 2008 shared task editions, described here. As labeled data for constituent-based framework, the data for CoNLL 2005 shared task is used. This dataset consists of 39,832 sentences, each of which including the following annotations: • POS tags and partial syntactic parses derived by a partial parser • POS tags and full syntactic parses of Collins (1999) including lexical heads • POS tags and full syntactic parses of Charniak (2000) • Named Entity tags derived by Chieu and Ng (2003) • Propbank propositions including target verb predicates and their arguments boundaries along with their semantic roles 59

Chapter 4


From these annotations, we use the POS tags and constituency syntactic parses of Charniak (2000) as the syntactic input, since it has been the choice of most SRL systems and shown to be more useful for the performance (Carreras & Marquez 2005). We ignore any Named Entity tags as input to avoid the dependence on extra level of annotation as much as possible. Propbank proposition annotation is required as the major annotation level for semantic role labeling. This is the only gold-standard annotation used in this work. We reparsed the sentences using the newer reranking parser of (Charniak & Johnson 2005) due to the higher quality of its output that, consequently, affects the performance of system. The detailed comparisons are presented in chapter 5. As labeled data for dependency-based framework, we do not use the data for CoNLL 2008, and instead we derive the same annotations for CoNLL 2005 data. The main reason is that co-training algorithm requires the two frameworks having compatible input and output, but these data sets identify the semantic arguments in different granularity. This causes a difficulty affecting the performance of the cotraining. To avoid this issue, we use the CoNLL 2005 proposition annotation as the base and derive the dependency-based annotations using the methods described here. The issue and its treatment are later explained in detail when the co-training algorithm is presented. The specific annotation for the dependency-based system is the dependency parses of the sentences. CoNLL 2008 shared task data provides two analogous kinds of dependency parses for closed and open challenge respectively: • Parses derived by converting the gold-standard WSJ constituency parses into dependencies using the algorithm introduced by Johansson and Nugues (2007) • Parses generated by MaltParser (Nivre et al. 2007) 60

Chapter 4


Table 4.1: A typical training sentence with annotations No Word 1 2 3 4 5 6 7 8 9 10 11 12

POS

The DT luxury NN auto NN maker NN last JJ year NN sold VBD 1,214 CD Cars NNS in IN the DT U.S. NNP

Constituency Parse (S1(S(NP * * * * ) (NP * * ) (VP* (NP * * ) (PP* (NP * * )))))

Dependency Parse Head 4 4 4 7 6 7 0 9 7 7 12 10

Relation NMOD NMOD NMOD VMOD NMOD VMOD ROOT NMOD VMOD VMOD NMOD PMOD

Propbank Annotation Predicate Semantic Role ‐ (A0* ‐ * ‐ * ‐ *) ‐ (AM‐TMP* ‐ *) sell (V*) ‐ (A1* ‐ *) ‐ (AM‐LOC * ‐ * ‐ *)

For comparison purpose, we use both of these parses. However, we reproduce them instead of directly using CoNLL 2008 shared task data to match the other levels of annotation used here. To avoid relying on extra level of hand-crafted annotation other than semantic roles, we use the automatic constituency parses as input to the converter. The conversion tool is the LTH converter (Johansson & Nugues 2007) based on the same algorithm. Also, the automatic POS tags are provided as input to MaltParser. Table 4.1 depicts a sample training sentence with its annotations. POS and constituency parses are the output of Charniak (2000) parser, and dependency parses are the output of LTH converter. The column "Head" represents the head of each word with the number of the head word in the sentence (see the first column), and the column "Relation" represents the dependency relation of the word to its head. The predicate and semantic roles are gold-standard Propbank annotations.

4.3.2

Unlabeled Corpora

Several unlabeled but assorted text corpora are available and used by weakly supervised semantic role labeling or other NLP studies. Most of these corpora cover multiple 61

Chapter 4


domains of text in different levels, though they are often acquired from newswire text. North American News Text Corpus (NANC) is a 350 million words news corpus, while BNC and Brown are two other widely used large corpora coming from a more balanced distribution of text genres. As mentioned in section 4.1, we study with two kinds of unlabeled data for bootstrapping the semantic role labeler: in-domain and out-of-domain unlabeled data.

4.3.2.1 In-domain Unlabeled Data Since our target in using the in-domain unlabeled data is the situations in which only a small amount of labeled data is available, we only use a small fraction of CoNLL data. This enables us to exploit the remaining part of the corpus as unlabeled data. The only annotations used in this case are the same constituent-based and dependency-based syntactic parses described for labeled data.

4.3.2.2 Out-of-domain Unlabeled Data In this work, we use the free version of American National Corpus (ANC). ANC is a 22 million word cross-genre corpus and is considered the American analogous of the BNC. It comprises written English text from a broad range of domains such as news, technical, government, fiction, travel, journal, blog, etc, and spoken English transcribed from telephone and face to face communications and academic discourses. The free release, known as Open ANC (OANC), consists a 15 million word portion of ANC, which relatively retains the domain variety of the original ANC. Table 4.2 shows the domain distribution of the corpus for the written portion. To prepare the data, we first pre-processed the raw OANC text to filter out inconsistent and erroneous sentences. This procedure selects only sentences with a length between 3 and 100 words. In addition, as it can be found in Table 4.2, the 62

Chapter 4


Table 4.2: Domain Distribution of Original and Selected OANC Corpus Source

Domain

911 Report Berlitz Biomed Eggan ICIC OUP PLoS Slate Verbatim Web Data Total

government, technical travel guides technical fiction letters non‐fiction technical journal journal government

Original Words Count 281,093 1,012,496 3,349,714 61,746 91,318 330,524 409,280 4,238,808 582,384 1,048,792 11,406,155

Final Selected Sentence Count/Ratio 11,519 %3.8 38,818 %12.8 ‐ ‐ ‐ ‐ ‐ ‐ 13,473 %4.5 13,896 %4.5 174,456 %57 19,496 %6.4 33,632 %11 305,290

Biomed text forms a large portion of the data (~%30) and harms the domain balance of the corpus. (Note that the other dominating text, Slate, is a journal that covers a broader domain rather than a specific one like biomedicine.) Also, Eggan and ICIC contain only a very small excerpt of the text from two specific domains (~%1 together), which is insignificant in balancing the domain. Therefore, we omitted these portions of the corpus in preparing our unlabeled data. After the pre-processing step, we parsed the remaining sentences using the constituent-based parser. During parsing, some of the sentences which could not be parsed are removed. After parsing, according to the POS tags assigned by the parser, a post-processing step filtered out all the sentences without any kind of verb POS tag. Finally, these parses are used as input to the dependency converter and MaltParser to generate the two kinds of dependency parsers for the out-of-domain unlabeled data (as it was done for labeled and in-domain data). The last column of Table 4.2 shows the final status of the corpus after all the above procedures.

63

Chapter 4


Predicate Identification

Pruning

Global Optimization

Joint AI & AC

Figure 4.1: Employed SRL architecture

4.4 Architecture As discussed in chapter 2, semantic role labeling system generally use a pipeline architecture involving two or more stage, where each stage uses the output of the previous one as its input. The SRL architecture used in this study includes four steps: predicate identification, pruning, joint argument identification (AI) and classification (AC), and global optimization. Figure 4.1 graphically demonstrates this architecture.

I.

Predicate Identification: A simple predicate identification is performed only for the unlabeled data based on the POS tag assigned by the constituent-based parser: all constituents with a POS tag starting with “VB” are selected as the predicate. For labeled data, we use gold-standard predicates provided with the Propbank proposition annotation described in the previous section. It should be noted that the predicate identification was done for both in-domain and out-ofdomain unlabeled data. That is, in-domain unlabeled data which is derived from CoNLL labeled data omits the gold predicates. This step is common for both frameworks.

II.

Pruning: In this step, those argument samples (candidates) which are less probable to be an argument of the predicate are pruned out using a widely-used heuristic introduced by Xue and Palmer (2004). The algorithm starts with the verb predicate in hand and collects all its sibling excepts those coordinated with the predicate by a coordinating conjunction such as “and”, or “or”. In this 64

Chapter 4


S

VP NP NP VBD NP PP CD NNS IN NP DT NN NN NN JJ NN DT NNP Figure 4.1: Pruning a sample parse tree The luxury auto maker last year sold 1,214 cars in the U.S. Figure 4.2: Pruning a Sample Parse Tree

variation of the algorithm, we also ignore all punctuation sibling identified by their POS tag being one of “(“, “)”, “,”, “:”, “``”, “ '' “, and “.”, because we found it very effective by pruning more non-arguments with almost no extra missing of actual arguments. If the sibling is a prepositional phrase (PP), all its direct children are also collected. This step is repeated for the parents of current predicate by climbing up the parse tree toward the top node. Figure 4.2 shows an example application of this algorithm on a sample parse tree. Constituents selected by algorithm as argument candidates for the verb "sold" (not pruned) are circled. The pruning step is common for both frameworks, because our argument sample generation is based on constituency as described in the syntactic formalism section. III.

Joint Argument Identification and Classification: since the semi-supervised SRL is both computationally and algorithmically more complex than supervised SRL, these two generally separate subtasks are combined together. As described in Chapter 2, this is done by adding an extra class, NULL, to the set of role labels, representing the non-argument candidates in classification subtask. This step is performed in the same way for both frameworks. 65

Chapter 4

IV.


Global Optimization: A simple global optimization process is performed after labeling all arguments, based on two SRL constraints: a core argument role cannot be repeated for a predicate arguments of a predicate cannot overlap This procedure is done by a simple algorithm without using any constraint

optimization technique like ILP (Punyakanok et al. 2008) or re-ranking (Tutanova et al. 2008). In the cases that constraint is applied, the argument about which the classifier is less confident is reassigned the NULL label. Despite its simplicity, it could boost the performance of the system. This effect will be shown later in the experiments chapter. Since the two employed SRL frameworks are used for co-training with each other, the labeled sample of one is used as the training sample of the other. However, as described in the syntactic formalism section, these frameworks work on different granularity of argument sample. Therefore, a procedure is required to convert labeled samples between these platforms in both directions. Previous work (Hacioglu 2004) shows that this conversion is error-prone and significantly affects the SRL performance. To address this problem, we base our sample generation on the constituency and derive one dependency-based argument sample from every constituent-based sample according to rules used for preparing CoNLL 2008 shared task data (Surdeanu et al. 2008). The rules nominate a word token inside the argument boundary (tree constituent) as the dependency-based argument sample whose dependency head is outside that boundary. This one-to-one relation is recorded in the system and helps avoid the mentioned flaw caused by conversion process.

66

Chapter 4


4.5 Learning Two bootstrapping algorithms described in previous chapter are investigated in this work as semi-supervised learning paradigms: self-training and co-training. The successful application of these algorithms to other NLP tasks such as syntactic parsing (Sarkar 2001; McClosky et al. 2008), and the problems identified by their application to semantic role labeling (He & Gildea 2006; Lee et al. 2007) are the motivations behind the choice of these algorithms. Besides, the major motivation for applying co-training to SRL is the promising and competitive results of two CoNLL shared task, 2005 and 2008. Constituent-based SRL built upon full phrase-structure syntax in the former and dependency-based SRL built upon dependency syntax in the latter were used successfully and comparably. These results provide a reasonable inspiration to exploit these two different views of the SRL problem in a co-training approach. Despite their straightforward nature, several variations of these algorithms are possible and used, and a many parameters are involved. The following sections first introduce the base classifier used for the implementation of self-training and co-training and the learning features selected for the classifier. Then, the variations of the algorithms and the parameters studied in this work are introduced.

4.5.1

The Classifier

In chapter 2, we explained Maximum Entropy (ME) and Support Vector Machines (SVM) as the two most-preferred classification algorithms in the context of semantic role labeling and compared them. While SVM is known to be more precise in classification, ME offers some other features as advantage over SVM. Due to the special requirements of self-training and co-training, which appear as limitations imposed on the experimental setting, Maximum Entropy is selected as the 67

Chapter 4


classifier. One limitation is that bootstrapping is computationally expensive, since it involves several iterations on a large amount of unlabeled data. ME needs less-training time than SVM for an accurate classification. Furthermore, SVM is a binary classifier and for the SRL problem, several such classifiers (equal to the number of semantic roles) should be trained, which is infeasible because of the same limitation. ME, on the other hand, is a multiclass classifier and better fits the semantic role labeling problem with bootstrapping. Another requirement is the need for an appropriate confidence measure to select the most reliable labeling of unlabeled data to add to the training set. SVM outputs the distance to the margin for each instance and requires to be converted to the probability, while ME directly outputs the probability that each label is assigned to the argument sample. Maximum Entropy is a statistical learning framework, which integrates contextual information from the underlying problem domain and uses it in classification of the given contexts. While statistical physics is the originating domain of ME, it has been well used in statistical NLP (Berger et al. 1996; Ratnaparkhi 1998). The contextual information is encoded as features. Features in maximum entropy are binary-valued functions which maps context-class pair to 0 or 1. For example, given "PhraseType = NP" as a context and the semantic role Arg0 as a class, a feature function

(or simply a ME feature) can be written as:

,

1 if y Arg0 and 0 otherwise

For a training argument sample, whose phrase type is NP and argument label is Arg0, the value of this feature functions will be 1. Feature functions are empirically extracted from the training corpus and are used as constraints to be satisfied by the model of context-class distribution which maximum entropy algorithm develops. 68

Chapter 4


It is clear that each feature has a different influence on the distribution model. The importance of each feature in the model is represented by its weight, which is a model parameter and should be estimated by numerical optimization methods due to the exponential nature of the task. Two iterative algorithms are mostly used for this purpose. The first one is Generalized Iterative Scaling (GIS) and the other one is Limited-Memory BFGS (L-BFGS). We use Maxent Toolkit1 for training classifier and classifying arguments samples. The toolkit offers the L-BFGS algorithm for parameter estimation in addition to GIS. LBFGS is applied here with 350 iterations for all classifiers including base and selftrained, and the Gaussian Prior smoothing variance of 1 is selected. These parameters were tuned by a set of preliminary experiments.

4.5.2

Features

One of the main problems directly related to the objectives of this research is the selection of features used by underlying learning algorithm. From one point of view, having an accurate and high performance base classifier for bootstrapping is important. Another more related reason is the nature of co-training approach in which the feature split play an important role in cooperation of their underlying learning components. As mentioned in the section 4.3, in selecting features, we have tried to avoid features needing an extra level of annotations, such as named entities. This can reduce the dependence on additional hand-made resources, as it is the aim of semi-supervised learning, or minimize the effect of errors embodied in the automatic annotators. The following sections describe features used here by categorizing them into three groups: constituent-based features, dependency-based features, and general features.

1

The toolkit is available at http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html 69

Chapter 4


Table 4.3: Constituent‐based Features Feature Name Phrase Type

Description Phrase type of the constituent e.g. "NP" for “last year” Path The tree path of non‐terminals and direction from predicate to constituent e.g. "VP↑S↓NP" for "last year" Content Word Lemma * Content word of the constituent e.g. "U.S." for "in the U.S." Content Word POS POS tag of the head word of the constituent e.g. “NNP” for "in the U.S." Governing Category The first VP or S ancestor of the constituent if it is an NP e.g. "S" for "last year" Predicate Subcategorization Rule expanding the predicate's parent e.g. "VP‐>VBD NP PP" for "sold" Constituent Subcategorization * Rule expanding the constituent's parent e.g. "S‐>NP NP VP" for "last year" Clause+VP+NP Count in Path * Number of clauses, NPs and VPs in the path e.g. "3" for "last year" ("VP↑S↓NP") Constituent and Predicate Distance Number of words between constituent and predicate e.g. "0" for "last year" Head Word Location in Constituent * Location of head word inside the constituent based on number of words in its right and left e.g. "2/0" for "in the U.S." (head word = "U.S.")

Features from these groups are selected into various feature sets based on the strategies described in section 4.5.4.1. Appendix B presents the configuration of these feature sets.

4.5.2.1 Constituent-based Features Constituent-based features were selected by first nominating features gathered from SRL literature described in previous chapter, and then running several experiments to choose the most effective feature set. We also modified the original definition of some of these features as well as using some relatively novel features based on the observations of these preliminary experiments. Table 4.3 lists these features along with a brief description for each of them. Examples are given for each feature when the input

70

Chapter 4


sentence is “The luxury automaker last year sold 1,214 cars in the U.S.” with “sold” as the predicate. Some features marked with "*" in the table are described here. • As explained in previous chapter, content word was extracted using the heuristic stated in Surdeanu et al. (2003). • Predicate subcategorization, introduced by Gildea and Jurafsky (2002), has been a widely-used SRL learning feature. We found the same feature for constituent itself useful as well. Also, we add the number of VPs and NPs to the number of clauses in the path, previously used in SRL studies (Surdeanu & Turmo 2005), as it improves the performance. • A novel feature introduced in this work (to the best of our knowledge) is the location of head word inside its relevant constituent. This feature determines the distances of head word from the beginning and end of the constituent separated by a delimiter. The feature engineering experiments showed it very useful for adjuncts and R-Core roles.

4.5.2.2 Dependency-based Features Dependency-based features were chosen using the same feature selection process described in previous section. Only dependency-based features were involved in the selection process. Table 4.4 lists these features along with examples for each feature when the input sentence is “The luxury automaker last year sold 1,214 cars in the U.S.” with “sold” as the predicate. Syntactic and semantic annotation of this sentence is given in Figure 2.5 in chapter 2. The argument word is the argument sample for dependencybased system which has been derived from the constituent as described in section 4.4. It is worth noting that POS tag features are considered common between syntactic frameworks here, and the context from which they are extracted recognizes their 71

Chapter 4


Table 4.4: Dependency‐based Features Feature Name Dependency Relation of Argument Word with Its Head Dependency Relation of Predicate with Its Head Lemma of Dependency Head of Argument Word POS Tag of Dependency Head of Argument Word Relation Pattern of Predicate’s Children Relation Pattern of Argument Word Children POS Pattern of Predicate’s Children POS Pattern of Argument Word’s Children Relation Path from Argument Word to Predicate POS Path from Argument Word to Predicate Family Relationship between Argument Word and Predicate POS Tag of Least Common Ancestor of Argument Word and Predicate POS Path from Argument Word to Least Common Ancestor Word Lemma Path from Argument Word to Predicate Dependency Path Length from Argument Word to Predicate Whether Argument Word Starts with Capital Letter? Whether Argument Word is WH word?

Examples e.g. “ADV” for “year” e.g. “ROOT” for “sold” e.g. “sell” for “sold” e.g. "VBD" for "year" e.g. "SBJ‐ADV‐OBJ‐ADV " e.g. "NMOD" for "year" e.g. "NN‐NN‐NNS‐IN" e.g. “JJ” for "year" e.g. “ADV/>” for “year” e.g. "NN/>" for “year" e.g. "CHILD" for "year" e.g. "VBD" for "year" (the predicate itself is the least common ancestor here) e.g. "NN/>" for "year" (the predicate itself is the least common ancestor here) e.g. "year/” for “year” (the predicate itself is not included) e.g. "1" for "year" e.g. "NO" (Simple) for "year" e.g. "NO" for "year"

category. Therefore, POS Pattern of Predicate’s Children, for example, is extracted based on the dependents of the predicate and is considered as dependency-based feature.

4.5.2.3 General Features General features are those which are extracted independently from constituency and dependency syntax. As mentioned earlier, POS tag features are considered common between two syntactic frameworks, so belong to this feature group as long as their extraction does not depend on any of these frameworks. Note that head word POS and lemma are considered as general feature, because the head word itself is the base of constituency-to-dependency conversion and is common between those frameworks. On 72

Chapter 4


Table 4.5: General Features Feature Name Position + Predicate Voice *

Predicate Lemma Predicate POS Head Word * Head Word POS Compound Predicate Identifier *

Description Concatenation of constituent position (before or after verb) and verb voice e.g. "ba" for "last year" (before and active) Lemma form of the verb predicate e.g. "sell" for "sold" Part of speech tag of the predicate e.g. "VBD" for "sold" Head word of the constituent e.g. "in" for "in the U.S." POS tag of the head word of the constituent e.g. “IN” for "in the U.S." Verb predicate structure type: simple, compound, or discontinuous compound e.g. "S" (Simple) for "sold"

the other hand, content word features are categorized as constituent-based feature, since their extraction rules is different and based on the structure of the constituent. Table 4.5 lists general feature with an example for each of them. Features marked with “*” are described in more detail. • To identify predicate voice, we used a heuristic introduced by Igo and Riloff (2008), which consists of two sets of rules, one for recognizing ordinary passive verbs

and the other for reduced passive verbs (those without an

accompanied passive auxiliary). Preliminary experiments show that joining position and predicate voice is more effective than using them separately. • The head words of constituents are found using a set of rules introduced by Magerman (1994) as head percolation table and modified by Collins (1999), Yamada and Matsumoto (2003), and Johansson and Nugues (2007). We chose to use the recent modification, since it was used by CoNLL 2008 (Surdeanu el al. 2008) shared task data preparation which is our resource for the dependency-based system. However, there are inconsistencies between them 73

Chapter 4


for a few rules. The problem is resolved by choosing the best-fitting rule from either one of them for our setting. • As a relatively new feature, we use compound predicate identifier to distinguish between simple, compound (e.g. "get up"), and discontinuous compound (e.g. "turn it over") verb predicates.

4.5.3

Self-training Algorithm

The general theme of self-training algorithm is almost identical for different implementations. Variations of the algorithm are developed based on the characteristics of the task in hand, mainly by customizing several involved parameters. These parameters may be explicit in the context of algorithm or implicit in the form of other effective factors. The major body of experimental settings in studying self-training is constituted by these parameters. Figure 4.3 shows the self-training algorithm including its parameters, which are described here and specified as they are used in this research.

4.5.3.1 Explicit Parameters Explicit parameters are those which appear in the context of the algorithm as variables. These parameters are described here according to Figure 4.3. • L, U, T: Labeled (L) and unlabeled (U) examples used in this work are described earlier in corpora section in the current chapter. T is first filled by L and gradually loaded from U. The sizes of L and T along with their ratio directly affect the performance and are of particular interest in this work. • C: The classifier was discussed in section 4.5.1. In addition to the choice of classifier, setting its own parameters is crucial for the learning task, especially in terms of computational feasibility of such costly semi-supervised algorithm. 74

Chapter 4


INPUT ‐ A set of labeled examples L as seed. ‐ A set of unlabeled examples U. BODY 1‐ Add the seed example set L to currently empty training set T. 2‐ Train the base classifier C with training set T. 3‐ Iterate the following steps until the stop criterion S is met. a‐ Select p examples from U into pool P. b‐ Label pool P with classifier C c‐ Select n labeled examples with the highest confidence score whose score meets a certain threshold t and add to training set T. d‐ Retrain the classifier C with the new training set. OUPUT A self‐trained classifier SC Figure 4.3: Parameterized Self‐training algorithm

A preliminary experiment was carried out and the best-performing parameter set was chosen, as explained in that section. • S: Various stop criteria are described in previous chapter. In this work, we opt to continue the process until all unlabeled data is used or the algorithm converges before finishing the data (no further improvement is observed). • p, P: Pool and its usage are of special interest in the current research. We experiment both with and without pool. The size of pool (p) was selected by a set of parameter tuning experiments. • n, t: The growth size (n) and confidence threshold (t) are tuned through a set of preliminary self-training experiments. Although we do not report the effect of these parameters here, preliminary experiments shows that, together with pool size, those are significantly effective in the performance of the algorithm.

75

Chapter 4


4.5.3.2 Implicit Parameters Implicit parameters do not appear in the algorithm as variables, but they are effective factors involved in various statements in the algorithm. Preselection While using a pool can improve the efficiency of the self-training process, there can be two other motivations behind it, concerned with the performance of the process. One idea is that when all data is labeled, since the growth size is often much smaller than the labeled size, a uniform set of examples preferred by the classifier is chosen in each iteration. This leads to a biased classifier like the one discussed in previous section. Limiting the labeling size to a pool and at the same time (pre)selecting (Abney, 2008) divergent examples into it (step 3.a) can remedy the problem. The other motivation is originated from the fact that the base classifier is relatively weak due to small seed size, thus its predictions, as the measure of confidence in selection process, may not be reliable. Preselecting a set of unlabeled examples more probable to be correctly labeled by the classifier in initial steps seems to be a useful strategy against this fact. We examine both ideas in here, by a random preselection for the first case and a measure of simplicity for the second case. Random preselection is built into our system, since we use randomized training data. As the measure of simplicity, we propose the number of samples extracted from each sentence; that is we sort unlabeled sentences in ascending order based on the number of samples and load the pool from the beginning. Selection The selection of newly labeled data to be added to training set in each iteration (step 3.c) is the crucial point of self-training, in which the propagation of labeling noise into 76

Chapter 4


upcoming iterations is the major concern. In addition to n and t, selection is influenced by other implicit parameters. One of those is the balance between selected labels. Most of the previous self-training problems involve a binary classification. Semantic role labeling is a multi-class classification problem with an unbalanced distribution of classes in a given text. For example, the frequency of Arg1 as the most frequent role in CoNLL training set is 84,917, while the frequency of 21 roles is less than 20. The situation becomes worse when the dominant label NULL (for nonarguments) is added for argument identification purpose in a joint architecture. This biases the classifiers towards the frequent classes, and the impact is magnified as selftraining proceeds. In previous work, although they use a reduced set of roles (yet not balanced), He and Gildea (2006) and Lee et al. (2007), do not discriminate between roles when selecting high-confidence labeled samples. The former study reports that the majority of labels assigned to samples were NULL and argument labels appeared only in last iterations. To attack this problem, we propose a natural way of balancing, in which instead of labeling and selection based on argument samples, we perform a sentence-based selection and labeling. The idea is that argument roles are distributed over the sentences. As the measure for selecting a labeled sentence, the average of the probabilities assigned by the classifier to all argument samples extracted from the sentence is used. Addition Delibility and indelibility are introduced as two approaches of adding the newly labeled data to the training set in previous chapter. We chose indelibility, so the data are moved from the unlabeled dataset to training set upon labeling and selection.

77

Chapter 4

4.5.4


Co-training Algorithm

As explained earlier, co-training algorithm is similar to self-training algorithm, except in its redundant views of the problem. Therefore, most of the specifications and parameters of self-training apply to co-training as well. Following sections describe the two co-training views and the co-training algorithm employed in this study.

4.5.4.1 The Views and Feature Splits There are a variety of possible views for the problem of semantic role labeling. As explained in chapter 2, features used for SRL can be categorized from different points of view such as the extracting context, linguistic level (syntactic or semantic features), or underlying syntactic formalism. Chapter 3 introduced the views and feature split used by two previous work on co-training for SRL (He & Gildea 2006; Lee et al. 2007), which exploit the division between syntactic and semantic (lexical) features. The two views from which we look at the problem are constituent-based and dependency-based SRL. The theoretical justification of how these two views are conditionally independent is not in the scope of this work. Instead, we study the effect of feature separation level between views. The second assumption has been proven to hold for these views by the result of previous work (Carreras & Marquez 2005; Surdeanu et al. 2008). We investigate the effect of performance balance between the classifiers of the views by varying their feature sets. To implement the views, three sets of feature splits are designed. Each feature split includes two feature sets (one per view) having different relations to each other in terms of balance and separation: • Unbalanced and unseparated feature split (UBUS): the feature sets have some features in common, and a performance gap exists between the classifiers 78

Chapter 4


employing each feature set. The feature set of the constituent-based view (CG) includes all constituent-based features (Table 4.3) plus all general features (Table 4.5). The feature set of the dependency-based view (DGUB) includes all dependency-based features (Table 4.4) plus three general features: head word lemma, compound verb identifier, and position + predicate voice. • Unbalanced and separated feature split (UBS): the feature sets have no feature in common, but there is a big performance gap between the classifiers employing each feature set. The feature set of the constituent-based view (CG) includes all constituent-based features plus all general features. The feature set of the dependency-based view (D) includes only dependency-based features. • Balanced and separated feature split (BS): the feature sets have no feature in common, and the classifiers employing each feature set have similar performances. The feature set of the constituent-based view (CGB) includes all constituent-based features plus two general features, head word POS tag and predicate POS tag. Dependency-based feature set (DGB) includes all dependency-based features plus four general features: head word lemma, compound verb identifier, position + predicate voice, and predicate lemma. The complete configuration of feature sets can be found in Appendix B. The performances of the classifiers employing each feature set are presented in Table 5.2.

4.5.4.2 The Algorithm Figure 4.4 shows the co-training algorithm with its explicit and implicit parameters highlighted. The algorithm is explained based on self-training algorithm and wherever it differs from self-training or has other variations.

79

Chapter 4


INPUT ‐ A set of labeled examples L as seed. ‐ A set of unlabeled examples U. BODY 1‐ Add the seed example set L to currently empty training sets T1 and T2. 2‐ Train the base classifiers C1 and C2 with training sets T1 and T2 respectively. 3‐ Iterate the following steps until the stop criterion S is met. a‐ Select p examples from U into pool P. b‐ Label pool P with classifiers C1 and C2 separately. c‐ Select n labeled examples whose score meets a certain threshold t and add to training sets T1 and T2. d‐ Retrain the classifiers C1 and C2 with the new training sets. OUPUT Two co‐trained classifiers CC1 and CC2 Figure 4.4: Parameterized Co‐training algorithm

4.5.4.2.1 Explicit Parameters • T1, T2: As discussed in section 3.3.2 in Chapter 3, co-training can be performed using separate training sets for the classifiers of each view (T1 and T2 here), or it can be done with a common training set between all those classifiers (T1 = T2 here). We are interetsed in the comparison of common and separate training sets, especially because from the two previous SRL cotraining works, one was based on common (Lee et al. 2007) and the other on separate training sets (He & Gildea 2006). • C1, C2: C1 is the constituent-based classifier and C2 is the dependency-based one. They only differ in their learning feature sets. Different feature sets are used for each classifier to study the effect of performance balancing and feature separation between co-training views, as explained in section 4.5.4.1.

80

Chapter 4


4.5.4.2.2 Implicit Parameters Selection: While the same problems with the selection of newly labeled data (step 3.a), such as class balancing, exist for co-training, the selection here is even more complex than it is in self-training. To decide on selection method, it should be first determined that whether both views use common training sets or each uses a separate training set during co-training. Then, it should be decided that how the classifiers collaborate. With a common training set, selection can be done based on the prediction of both classifiers together. In one approach, only samples with the same predicted labels by both classifiers are selected (agreement-based selection). Another way is to select the most confidently labeled samples. Some select the most confident labelings from each view (Blum & Mitchell 1998). In this method, a sample may be selected by both views, so this conflict needs to be resolved. We select the label for a sample with the highest confidence among both views (confidenece-based selection) to avoid conflict. Both approaches are investigated and compared in this work. With a separate trainings set, selection is done among samples labeled by each classifier individually (usually confidence-based). In this case, selected samples of one view are added to the training set of the other for collaboration. This selection approach is also the subject of experiment and comparison here.

4.6 Evaluation The evaluation data and setting of CoNLL 2005 shared task has been the most common evaluation framework in the SRL literature. This consistency allows better comparison between different works. For this reason, we use this framework including test data and evaluation metrics for evaluating the results of the experiments.

81

Chapter 4


Table 4.6: A typical in‐domain test sentence with annotations No Word 1 2 3 4 5 6 7 8 9 10 11 12 13

POS

Constituency Parse

At IN the DT end NN of IN the DT day NN , , 251.2 CD million CD shares NNS were AUX traded VBN . .

Dependency Parse Head

(S1(S(PP * (NP(NP* * ) (PP* (NP * * )))) * (NP(QP* * ) * ) (VP* (VP* )) * ))

Relation

11 3 1 3 6 4 11 9 10 11 0 11 11

VMOD NMOD PMOD NMOD NMOD PMOD P DEP NMOD VMOD ROOT VC P

Propbank Annotation Predicate

‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ trade ‐

Semantic Role

(AM‐TMP* * * * * *) * (A1* * *) * (V*) *

Table 4.7: A typical out‐of‐domain test sentence with annotations No Word 1 2 3 4 5 6 7 8 9

He waved his arm around at the furnishings .

4.6.1

POS

Constituency Parse

PRP VBD PRP$ NN RP IN DT NNS .

(S1(S(NP * ) (VP* (NP * * ) (PRT* (PP* (NP * * ))) * ))

Dependency Parse Head

Relation

2 0 4 2 2 2 8 6 2

VMOD ROOT NMOD VMOD VMOD VMOD NMOD PMOD P

Propbank Annotation Predicate

‐ wave ‐ ‐ ‐ ‐ ‐ ‐ ‐

Semantic Role

(A0* (V*) (A1* *) (AM‐ADV*) (A2* * *) *

The Test Data

The test data consists of three datasets for development, in-domain, and out-of-domain testing purposes: • Development Data includes 1,346 Propbank sentences of section 24 of WSJ and will be called devel.24. • In-domain Test Data includes 2,416 Propbank sentences of section 23 of WSJ and will be called test.wsj. 82

Chapter 4


• Out-of-domain Test Data includes 426 Propbank sentences selected from Brown corpus and will be called test.brown. Table 4.6 and 4.7 show typical test sentences with all levels of annotation, taken from in-domain and out-of-domain test sets respectively. The annotations are the same as those for the sample training sentence presented in Table 4.1.

4.6.2

Evaluation Metrics

The evaluation metrics include precision, recall, and

, which are measured by the

evaluation script prepared for the shared task. • Precision (P) is the ratio of correctly labeled arguments to all tokens identified as arguments. For example, if the system identifies and labels 80 arguments as Arg1, from which only 60 arguments are correctly Arg1, the precision of Arg1 role is 60/80 or %75. The overall precision is calculated by averaging the precisions of individual roles. • Recall (R) is the ratio of correctly labeled arguments to all arguments in the test set. In the above example that 60 arguments are correctly labeled as Arg1, considering the number of Arg1 arguments in the test dataset to be 100, the recall will be 60/100 or %60. Again the overall recall is calculated by averaging the individual recalls. •

is the harmonic mean of precision and recall and is used to assess the overall performance. The script calculates the

for each individual role and

then obtains its overall for all roles. The formula for calculating

is:

2

83

Chapter 4


Note that NULL is not considered as a role label in the above calculations. This script also measures the percentage of perfect propositions, which is the ratio of propositions annotated completely correct to the number of all propositions in the test data set. The assessments here are based on the three former measures, with an emphasis on

.

Another aspect of the results in this work is the trend of bootstrapping process. That is how the above metrics varies as the algorithm progresses. Since there is no official method for this evaluation in the literature, the assessment will be based on the visual observations using the graph of bootstrapping trend.

84

Chapter 5

Experiments and Analysis of Results

CHAPTER 5 Experiments and Analysis of Results

5.1 Introduction Previous chapter introduced the method used in this research and described the underlying experimental frameworks in detail. It also presented the objectives followed by the research and parameters of interest to be investigated to achieve these objectives. This chapter illustrates the experiments carried out according to the previous chapter and discusses the results when required. The flow of experiments is presented in Figure 5.1. A more detailed diagram including the processing time for each experiment can be found in Appendix D. Before proceeding to the bootstrapping experiments, the characteristics of data and classifiers used for the experiments are explained.

5.2 Characteristics of the Data The source of the labeled and unlabeled data used for training and test were introduced in previous chapter. Table 5.1 lists some characteristics of each of these datasets when parsed with two constituency parsers: the original parses of CoNLL 2005 shared task data (Charniak 2000) represented by cha and parses generated by the reranking parser of Charniak and Johnson (2005) represented by chare. The 3rd column shows the number of syntactic constituents derived by parsers. Potentially, each constituent is an argument sample, but the pruning process reduces it to the number given in 4th column. The 5th column shows the number of gold-standard (i.e. hand-labeled) arguments existing in the labeled datasets as the positive samples. Some of these arguments are not 85

Chapter 5


Supervised Learning Curves size of labeled seed Co‐training with Common Training Sets The Effect of Selection Balancing on Self‐training Confidence‐based Agreement‐based Selection Selection sentence-based selection The Effect of The Effect of The Effect of Preselection Feature Split Feature Split on Self‐training Balancing Balancing and and simplicity-based preselection Separation Separation The Effect of Base Classifier Performance on Self‐training vs. Syntactic Global Improvement Optimization Co‐training with Separate Training Sets The Effect of Feature Split Improved parses / no global optimization Balancing and Separation Out‐of‐domain Self‐training Original Size Extended Unlabeled Unlabeled Out‐of‐domain Co‐training Dataset Dataset Original Size Extended Unlabeled Unlabeled Dataset Dataset Constituent‐based vs. Dependency‐based Self‐training In‐domain Out‐of‐domain Comparison Comparison Figure 5.1: The flow of experiments

matched to any constituent generated by the parser mainly due to parser errors. This is presented in the 6th column. The number of positive samples finally used for training is the number of gold-standard arguments (5th column) minus the number of mismatched 86

Chapter 5


Table 5.1: Characteristics of the Data Dataset

Parser

train.wsj devel.24 test.wsj test.brown train.wsj devel.24 test.wsj test.brown

cha cha cha cha chare chare chare chare

Constituent No. 1,691,135 58,517 101,008 13,252 1,687,358 58,191 100,600 13,134

After‐ Pruning Sample No. 718,478 25,584 40,637 5,655 709,699 25,803 40,506 5,559

Gold Mismatched Positive Argument Argument Sample No. No. No. 242,812 17,921 224,891 8,451 730 7,721 14,322 1,170 13,152 2,197 242 1,955 242,812 7,223 235,589 8,451 588 7,863 14,322 932 13,390 2,197 205 1,992

arguments (6th column) which is shown in the last column. The number of negative arguments can be calculated by subtracting column 4 and this column. Since the base of sample generation is the constituent-based framework, all of these values are common between it and dependency-based frameworks. Comparing columns 3 and 4 show that the pruning stage filters out between %55 to %60 of the parse constituents and makes the learning process more efficient. A close study of the effect of pruning on data shows that about %6.7 of the gold-standard arguments (positive samples) of labeled dataset (train.wsj) are lost by pruning when the original parses (cha) are used. This imposes an upper bound for our classifier performance, both in the classification of test data and unlabeled data. This amount is surprisingly much lower (%2.7) with re-ranked parses (chare), which remedies the issue to some extends and leads to a higher performance. Another interesting analysis is the comparison of the number of mismatched arguments between two parsers in column 6. It is apparent that this number is lower for re-ranked parses, especially for the labeled dataset which is larger than test datasets. Since a major source of performance drop in SRL is the mismatch between arguments and parse constituents (Marquez et al. 2008), this is another reason for the higher

87

Chapter 5


performance of the classifier when trained and testes with the output of reranking parser. The performance difference is presented in the next section. It should be noted that the characteristics of unlabeled data is given in Table 5.3 in section 5.4, since only a portion of it is used here for the out-of-domain experiments.

5.3 Characteristics of the Classifiers Previous chapter described the base classifiers and stated its influence in the bootstrapping process. Table 5.2 depicts the performance of the constituent-based and dependency-based classifier with each feature sets when trained on entire the labeled data (fully supervised) and tested on development, WSJ, and Brown test data. For each framework, the measures for two different syntactic input described in previous chapter (cha and chare, malt and converter) is presented. In addition, to show its effect, the performances with and without the global optimization is listed in the table. These values are referred later when the experiments are explained and discussed. For each feature set, two kinds of performance improvements are seen. One is larger and due to improved syntactic input for both constituent-based and dependencybased framework. It is interesting that for both frameworks, the improved syntactic input (chare for constituency and converter for dependency) has improved the performance in the same direction and almost the scale; the recall has increased and the precision has slightly decreased. This can be caused by the increase in the confidence of the classifier in assigning role (non-NULL) labels to samples. The other improvement is slight and due to global optimization of the labeling. In the opposite direction to the previous one, global optimization improves the precision at the cost of recall. This is expected considering the nature of our optimization process, which assigns NULL to any sample whose predicted label violates the constraints. 88

Chapter 5


Table 5.2: Performance of the Base Classifiers Classifier/ Syntactic Global Feature Set Input Opt. Constituent‐ based (CG) Constituent‐ based (CGB)

cha chare cha chare

malt Dependency‐ based (D) converter malt Dependency‐ based (DGB) converter malt Dependency‐ based (DGUB) converter Punyakanok

cha

‐ √ ‐ √ ‐ √ ‐ √ ‐ √ ‐ √ ‐ √ ‐ √ ‐ √ ‐ √ √

devel.24 P 76.02 77.31 75.26 76.99 68.96 70.72 69.02 70.56 70.63 72.70 71.94 74.17 72.75 75.03 75.20 77.25 71.39 73.53 73.84 75.74 80.05

R 65.86 65.31 71.84 71.16 61.20 60.72 66.19 65.56 53.76 53.29 58.96 58.57 56.54 55.98 61.87 61.37 55.75 55.25 60.71 60.29 74.83

F1 70.58 70.81 73.51 73.96 64.85 65.34 67.58 67.97 61.05 61.50 64.81 65.45 63.63 64.12 67.89 68.40 62.61 63.09 66.64 67.14 77.35

test.wsj P 77.41 78.94 77.24 79.19 70.35 72.21 71.33 73.14 72.12 74.63 73.22 75.52 74.64 77.22 76.74 78.82 73.04 75.51 75.14 77.07 82.28

R 68.10 67.60 74.03 73.33 63.04 62.63 68.85 68.26 55.71 55.14 61.23 60.67 58.51 57.87 64.52 63.96 57.74 57.17 63.27 62.80 76.78

test.brown

F1 P R 72.46 69.53 57.23 72.83 70.93 56.82 75.60 66.30 61.46 76.14 68.81 60.82 66.49 61.78 51.68 67.08 63.80 51.40 70.07 60.67 56.68 70.62 62.25 56.27 62.86 65.08 47.18 63.42 67.36 46.35 66.69 67.48 53.47 67.28 69.72 52.87 65.59 67.95 49.38 66.16 71.04 48.78 70.10 68.76 55.40 70.62 71.18 54.57 64.50 66.11 48.83 65.07 68.96 48.37 68.70 67.83 54.34 69.21 69.86 53.56 79.44 73.38 62.93

F1 62.79 63.10 63.79 64.57 56.28 56.93 58.61 59.11 54.70 54.91 59.66 60.14 57.20 57.84 61.36 61.78 56.17 56.86 60.34 60.63 67.75

The last row (Punyakanok) shows the measures for the state-of-the-art of SRL (Punyakanok et al. 2008) for comparison. As it is clear, there is a big gap between the performance of our SRL systems and the state-of-the-art when using the same syntactic input (72.83 vs. 79.44 on test.wsj). However, it has been largely relieved by using the re-ranked parses (76.14 vs. 79.44 on test.wsj). The main reason of this gap is that, due to the high cost of bootstrapping process in terms of training time, we have compromised between the efficiency and performance. This trade-off has been made in: • the choice of base classifier, where we prefer ME to SVM due to its efficiency • SRL architecture by joining the identification and classification steps

89

Chapter 5


• the choice of learning features, where we avoid using bag-of-word features due and also Named Entity features as discussed in chapter 4 • the global optimization stage, where we use a simple strategy based on only two constraints compared to state-of-the-art systems

5.4 The Effect of Supervision A major factor directing the performance of a classifier is the labeled data used for its training. In a semi-supervised learning approach, the labeled data (seed) determines the amount and quality of the supervision on the learning process. The learning curve of the classifier based on varying the amount of supervision can be an effective source for estimating the amount of seed and unlabeled data required for bootstrapping. Figures 5.2 to 5.5 display the learning curves for both classifiers under different settings in accordance with Table 5.2. The first two figures show the curves for constituent-based classifier using original syntactic inputs and feature set CG without global optimization (row 5 of the performance columns in Table 5.2), tested on in-domain (test.wsj) and out-of-domain (test.brown) test sets respectively. The last two figures correspond to row 16 of performance columns in Table 5.2. The curve starts with training with 50 sentences. Then, from 100 to 1000 sentences, the step size for training set is 100 sentences. From 1,500 to 10,000 sentences (1/4th of the whole labeled data) the step size is 500 sentences. The remaining 3/4th of the data is used by a step size of 1,000. As it is clear, we increase the step size as the amount of training data increases, since its effect decreases. To have reliable learning curves, we randomly shuffled the labeled data three times and repeated the experiments. The results are consistent over all these datasets. A similar set of curves for another random shuffle of dataset can be found in Appendix C. 90

Chapter 5


P

R

F1

P 70

73

65

68

60

63

55

58

50

53

45

48

40

43

35

Figure 5.2: Learning Curve of Constituent‐ based Classifier on test.wsj P

R

F1

Figure 5.3: Learning Curve of Constituent‐ based Classifier on test.brown

F1

P

75

68

70

63

65

58

60

R

50 2050 4050 6050 8050 10050 12050 14050 16050 18050 20050 22050 24050 26050 28050 30050 32050 34050 36050 38050

50 2050 4050 6050 8050 10050 12050 14050 16050 18050 20050 22050 24050 26050 28050 30050 32050 34050 36050 38050

78

R

F1

53

55

48

50 40

38

35

33 50 2050 4050 6050 8050 10050 12050 14050 16050 18050 20050 22050 24050 26050 28050 30050 32050 34050 36050 38050

43

50 2050 4050 6050 8050 10050 12050 14050 16050 18050 20050 22050 24050 26050 28050 30050 32050 34050 36050 38050

45

Figure 5.4: Learning curve of dependency‐ based classifier with on test.wsj

Figure 5.5: Learning curve of dependency‐ based classifier on test.brown

The logarithmic curve formed by the process shows that the main contribution of the supervision is made by about the first 2,000 sentences. The improvement rate is changing inside the range of 2,000 to 4,000 sentences. After this range, the rate is decreasing steadily and more data becomes less useful. This implies the same behavior by unlabeled data and thus the need for much more unlabeled data to gain improvement over the base classifier.

91

Chapter 5


As in Figures 5.3 and 5.5, the out-of-domain test (test.brown) is benefited more than in-domain test (test.wsj) by increasing the training data. The main reason is that out-of-domain classification suffers more from unseen events, so new data is more useful for it. Another observation is that, for the dependency-based classifier (Figures 5.4 and 5.5), the second half of the data has very small effect on the performance. Inspection of the training models for the two classifiers show that whereas the number of training feature-value pairs for feature set CG containing 17 features is 106,684, it is 579,949 for feature set D containing 16 features. This suggests that the dependencybased features are much sparser and thus require more data to be improved.

5.5. Self-training Section 4.6.2 described the self-training algorithm and the parameters and variations of the algorithm. In this section, experiments are conducted with the constituent-based framework using feature set CG for the classifier, and at the end, a comparison is made between the self-training of constituent-based and dependency-based frameworks. At the beginning, original syntactic parses of the CoNLL dataset (cha) are used, but it is then compared to the new parses (chare) in section 5.5.3.1 and replaced afterwards. The default amount of seed labeled data is 4,000 sentences randomly selected from labeled dataset. This amount is inspired by the analysis of learning curves generated in previous section. As discussed, the change in the slope of the curves occurs in the range of 2,000 to 4,000 sentences. We chose the upper bound, which is 1/10th of the whole dataset to have as much acceptable base classifier performance as possible along with reasonably small amount of seed data to simulate the labeled data scarcity. The remaining portion of the labeled data set (35,832 sentences), is used as indomain unlabeled data. As out-of-domain unlabeled data, two sets have been randomly 92

Chapter 5


Table 5.3: Characteristics of the Training Data Dataset 4,000 (s) 4,000 (s) 35,832 (u.wsj) 35,832 (u.wsj) 35,832 (u.oanc) 70,000 (u.oanc)

168,088 167,771

After‐ Pruning Sample No. 70,615 69,868

cha

1,523,047

698,741

‐

‐

‐

chare

1,519,587

628,298

‐

‐

‐

chare

1,632,873

745,585

‐

‐

‐

chare

3,185,208

1,461,156

‐

‐

‐

Constituent Parser No. cha chare

Gold Mismatched Positive Argument Argument Sample No. No. No. 24,224 1,740 22,484 24,224 670 23,554

extracted from OANC corpus: one set is as large as in-domain unlabeled dada, and the other is about twice (70,000 sentences). The characteristics of this portions of labeled (seed) and unlabeled data, similar to those of all labeled data in Table 5.1, is presented in Table 5.3 (“s” in the table stands for labeled seed and “u” for unlabeled data). Note that the number of gold-standard arguments and subsequent statistics (columns 5 to 7) cannot be identified for unlabeled data. It is worth noting that only 38 semantic roles out of total 52 roles are seen in this seed, meaning that the remaining 14 are unknown for the successive self-trained classifiers. This will affect our evaluation results in negative way, where we do not distinguish between known and unknown roles. In other words, all results in this research are pessimistic and can be improved if we exclude unknown roles from the evaluation. The list of experiments is as follows: • The effect of selection balancing • The effect of preselection • The effect of base classifier performance • Out-of-domain self-training • Constituent-based vs. dependency-based self-training 93

Chapter 5

5.5.1


The Effect of Selection Balancing

The process of selecting newly labeled data in each iteration of self-training and the problem of a balance between selected role labels was explained in section 4.5.3, and the proposed solution for balancing the selection was introduced. To assess the effect of this solution, we self-train the classifier once with unbalanced selection following the previous work, in which the selection is sample-based, and once with balancing by sentence-based selection. To be comparable with previous work (He & Gildea 2006), the growth size (n) for unbalanced method is 7000 samples (roughly 1/100th of all unlabeled samples similar to that work) and for balanced method is 350 sentences, since each sentence roughly contains 20 samples. A probability threshold (t) of 0.70 is used for both cases. This value was obtained by a limited set of parameter tuning experiments. The results on WSJ and Brown sets are presented in Figures 5.6 and 5.7 respectively. The figures depicts the precision (P), recall (R), and F1 when the selftrained classifier with each of mentioned methods is tested after each iteration. The horizontal axis shows the number of training samples in thousands for each iteration, starting from 70,615 seed samples (contained in the 4,000 seed sentences) for the base classifier. The

of the base classifier, best-performed classifiers, and final classifiers

for each method are marked on the graphs. The balanced selection method outperforms the unbalanced method on both WSJ (68.38 vs. 68 points of

) and Brown test sets (59.16 vs. 58.39 points of

) in terms of

the best classifier performance. To verify the statistical significance to check whether this improvement is occurred by chance or not, a two-tail t-test based on three different random selections of training data and the three test data sets was performed. The results shows that this improvement is statistically significant at p

learning semantic role labeling via bootstrapping with ...

learning semantic role labeling via bootstrapping with ...

Suggest Documents

Sequence labeling with Reinforcement Learning ... - Semantic Scholar

Semantic Role Labeling via Tree Kernel Joint Inference - Google Sites

Semantic Role Labeling via FrameNet, VerbNet and PropBank

Towards Robust Semantic Role Labeling

Active Learning for Black-Box Semantic Role Labeling ... - Google Sites

Using Learning Decomposition and Bootstrapping with ... - CiteSeerX

Using Learning Decomposition and Bootstrapping ... - Semantic Scholar

Semantic Role Labeling with Neural Network Factors - Dipanjan Das

Improving Chunk-based Semantic Role Labeling with Lexical Features

Semantic Role Labeling with the Swedish ... - LREC Conferences

Chinese Semantic Role Labeling with Bidirectional Recurrent Neural

Semantic Role Labeling Tutorial - Ivan Titov

Semantic Role Labeling - Association for Computational Linguistics

Dependency-based Semantic Role Labeling of PropBank

Semantic Role Labeling Using Different Syntactic Views

Semantic Role Labeling Using Recursive Neural

Grounded Semantic Role Labeling - Association for Computational ...

Combination Strategies for Semantic Role Labeling - surdeanu.info

Semantic Role Labeling as Sequential Tagging

Semantic scene labeling using feature learning

mantic Role Labeling - CiteSeerX

Semantic scene labeling using feature learning

Bootstrapping the semantic web via automated ... - Tapas Kanungo

Bootstrapping Reinforcement Learning-based ...