of user-generated text. â» Have become an important source for both data mining and NLP communities. â» Require approp
Vietnamese POS Tagging for Social Media Text Ngo Xuan Bach+* Nguyen Dieu Linh+ Tu Minh Phuong+* + Posts and Telecommunications Institute of Technology, Vietnam * FPT Software Research Lab, Vietnam
ICONIP2016, Kyoto Japan, October 2016
Part-of-Speech (POS) Tagging The process of assigning to each word in a sentence the proper POS tag in the context it appears
o o
Input: Book that flight . Output: Book/VB that/DT flight/NN ./.
A fundamental task in natural language processing (NLP)
o
Provides useful information for many other NLP tasks ▪
Word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic role labeling, semantic parsing, and so on
Challenges
o
2
How to find POS tags of new words and how to disambiguate multi-sense words Ngo Xuan Bach
POS Tagging Has been studied intensively for several decades
o o
o o
English (Brill, 1995; Ratnaparkhi, 1996; Toutanova et al., 2003) Japanese (Nakagawa et al., 2002; Nakagawa and Uchimoto, 2007) Arabic (Aldarmaki and Diab, 2015) Vietnamese (Nghiem et al., 2008; Tran et al., 2009; Bach et al., 2013)
State-of-the-art POS taggers are statistical or machine learning based models trained on annotated corpora of conventional text
o
o o
3
Penn Treebank for English Kyoto corpus for Japanese Viet Treebank for Vietnamese Ngo Xuan Bach
Social Media Text
Web 2.0 platforms such as blogs, forums, wikis, and social networks have facilitated the generation of a huge volume of user-generated text
Have become an important source for both data mining and NLP communities Require appropriate tools for text analysis
4
Ngo Xuan Bach
POS Tagging for Social Media Text Social media text poses several challenges Facebook Sentences
Expected Sentences
Em đọc đc ấy mà a
Em đọc được ấy mà anh
I can read it
abbreviation
Nó good vậyyyy
Nó giỏi vậy
He is so good
foreign word, typo
Ng đàn ông mặc áo Người đàn ông đen cơ. :)) mặc áo đen cơ. :))
Must be the guy with a black shirt
abbreviation, emoticon
Toi thich cái màu trắng
I like the white one
word without tone mark
Tôi thích cái màu trắng
Translated Sentences
Problems
A POS tagger developed for conventional, edited text would perform poorly on such noisy data 5
Ngo Xuan Bach
This Work We develop a Vietnamese tagging model with various types of linguistic features and a new POS tagset
o
empirically show the effectiveness of the method on data from Vietnamese Facebook
We construct an annotated corpus for Vietnamese POS tagging
o
consisting of 4150 sentences collected from Facebook
Both annotated corpus and trained POS tagger are made available to the research community
6
Ngo Xuan Bach
Outline Introduction Tagging Method Experiments Summary
7
Ngo Xuan Bach
A POS Tagset for Social Media Text We extended the conventional POS tagset to cover the variations of social media text
8
Ngo Xuan Bach
Annotation Procedure We extracted textual content of posts and comments from Vietnamese Facebook
Raw text
Preprocessing
Automatic tagging
-Split sentences -Remove noisy sentences
-Tag sentences using vnTagger
Manual tagging -Two annotators manually correct sentences
Corpus Annotated corpus
We used the Cohen’s kappa coefficient to measure the inter-annotator agreement
o
9
The score was 0.84, which can be interpreted as almost perfect agreement Ngo Xuan Bach
Statistical Information of the Corpus #sentences #unique words 4150
6416
% 20
#tokens
#tokens/sentence
38498
9.3
POS tag distribution
19.2 18.46
18 16 14 12 10
8.86 8.04 6.68 5.89 5.23
8 6 4
3.8 3.543.52
3.17
2 0
2.412.36 1.751.551.311.15 0.930.790.490.32 0.3 0.210.05
POS tags V
10
N
PUN
R
A
P
AB
E
Np
T
C
M
HP
X
I
FL
Nc
L
CF
CC
Nu
SD
IL
AR
Ngo Xuan Bach
Tagging Model
Corpus
Feature Extraction
Conditional Random Fields
Tagging Model
Feature Type
Description
Basic features
unigrams, bigrams, and trigrams of words
Enhanced features
special character, icon or emoticon, digits, capitalization, hashtags and URLs
METAPH feature
used the Metaphone algorithm to create a coarse phonetic normalization of words to simpler keys
GENTAG features
the output (the predicted POS tags) of vnTagger - Unigrams, bigrams, trigrams of POS tags
11
Ngo Xuan Bach
Experiments
12
Ngo Xuan Bach
Experimental Setup
Conduced 10-fold cross-validation Measured the performance by Accuracy, Precision, Recall, and the F1 score
13
Model
Characteristics
Baseline1
used he output of vnTagger
Baseline 2
used a list of icons to automatically correct the output of vnTagger
CRF1
used CRFs with basic features
CRF2
CRF1 + enhanced features
CRF3
CRF2 + METAPH features
CRF4
CRF3 + features from Baseline1
CRF5
CRF3 + features from Baseline2 Ngo Xuan Bach
Tagging Accuracy % 90
88.16
88 86 83.53
84 82
80.69
88.26
83.86
81.39
80 78
76.99
76
74 72 70
Tagging Accuracy
Baseline 1
14
Baseline 2
CRF1
CRF2
CRF3
CRF4
CRF5
Ngo Xuan Bach
Tagging Results in Detail
15
Ngo Xuan Bach
Confused Tags
16
Ngo Xuan Bach
Summary
17
Ngo Xuan Bach
Summary
We have developed a Vietnamese part-of-speech tagging system for social media text o o
an annotated corpus from Facebook outperformed a state-of-the-art Vietnamese POS tagger trained on general text by a large margin
The tagger as well as the annotated data can be useful for further research not only on POS tagging but also other NLP tasks for Vietnamese social media text
18
Ngo Xuan Bach
19
Ngo Xuan Bach