Development of Indonesian Natural Language Processing Tools and Its Usage in Text Applications Ayu Purwarianti
named entity
word sense text mining syntactic information extraction social media
Computational dialogue spam filtering Linguistics texttranslation essay scoring
text categorization
sentiment analysis
1
semantic
2
Information Extraction
structured text unstructured text
3
Text Categorization Giving label or category to text/document automatically
Sentiment analysis Plagiarism detection
Label: positive, negative, neutral
Label: plagiat vs not plagiat
Essay scoring Label: score 4
Question Answering & Dialogue
System able to return answer or snippet of user question
question
answer
5
Indonesia
Source: https://en.wikipedia.org/wiki/Languages_of_Indonesia
6
Languages in Indonesia Indonesian
Number Year surveyed Main areas where spoken (millions) 250 2014 throughout Indonesia
Javanese
84.3
2000 (census)
Sundanese Madurese Minangkabau Musi (Palembang Malay)[4] Manado Malay (Minahasan) Bugis
34.0 13.6 5.5
2000 (census) West Java, Banten 2000 (census) Madura Island (East Java) 2007 West Sumatra, Riau
3.9
2000 (census) South Sumatra
3.8
2001
3.5
1991
Banjarese
3.5
Betawi etc
2.7
Language
Northern Banten, Northern West Java, Yogyakarta, Central Java and East Java
Minahasa, North Sulawesi
South Sulawesi South Kalimantan, East Kalimantan, Central 2000 (census) Kalimantan 1993 Jakarta
Source: https://en.wikipedia.org/wiki/Languages_of_Indonesia
7
Text is … a string list of tokens
having structure represent an intention
8
Basic Tools in NLP
Lexical Named Entity Tagger Phrase Tagger Parser
Syntactic
Semantic
reference resolution
Tokenization Sentence Splitter Stemming Lemmatization POS Tagger
Word Sense Disambiguation Semantic Analysis
Pragmatic
9
lexical
syntactic
semantic
pragmatic 10
STEMMING FOR INDONESIAN
11
Stemming • Indonesian Morphological Rules Morpheme
Examples
Prefix
me-, di-, be-, pe-, ke-, ter-, se-
membaca belajar pekerja
read study worker
Infix
-em-, -el-, -er-
Jemari Telunjuk Gerigi
Finger Index finger Serration
Suffix
-kan, -an, -i, -isme, -isasi
tuliskan Makanan Tandai
Write Food Mark
Possessiv e pronoun
-ku, -mu, -nya
Bukuku bukunya
My book His/her book
Particle
-lah, -kah, -tah, -pun
Bacalah Benarkah Sayapun
Read! Is it true? I am too 12
memperadilankan (english: sue someone) [pref-1] mem + [pref-2] per + [root] adil + [suff-1] an + [suff-2] kan
kenyataannyalah (english: the truth) [pref-1] ke + [root] nyata + [suff-1] an + [poss] nya + [particle] lah
13
Stemming with Rule based Approach • NFA for Indonesian Stemming Rule
Problems: - More than 1 candidates: Add rules of language model of stemmed word 14
POS TAGGER FOR INDONESIAN 15
POS Tag • Open class • Noun, verb, adjective, adverb
• Closed class words ( ~ function words) • Preposition, determiner, conjunction, pronoun, auxiliary verbs, particles(up,down,on), numerals
Cases: 1. Polysemy • One word might have more than 1 POS Tag candidate 2. OOV • The word is not in the training data or dictionary 16
POS Tag List No 1 2 3 4 5 6 7
POS OP CP GM ; : “ .
8 9 10 11 12
, … JJ RB
POS Name Open Parenthesis Close Parenthesis Slash Semicolon Colon Quotation Sentence Terminator Comma Dash Ellipsis Adjective Adverb
13 14
NN NNP
Common Noun Proper Noun
15 16 17
NNG Genitive Noun VBI Intransitive Verb VBT Transitive Verb
Example ({[ )}] / ; : “‘ .!? , … Kaya, Manis Sementara, Nanti Mobil Bekasi, Indonesia Bukunya Pergi Membeli
No 18 19 20 21 22 23 24 25 26 27 28
POS IN MD CC SC DT UH CDO CDC CDP CDI PRP
29 30 31 32 33 34 35
WP PRN PRL NEG SYM RP FW
POS Name Preposition Modal Coor-Conjunction Subor-Conjunction Determinier Interjection Ordinal numeral Collective numeral Primary numeral Irregular numeral Personal pronouns WH-pronoun Number Pronoun Locative Pronoun Negation Symbols Particles Foreign Words
Example Di, ke, dari Bisa Dan, atau, tetapi Jika, ketika Para, ini, itu Wah, aduh, oi Ketiga, keempat Berlima, berempat Satu, sepuluh Beberapa Saya, kamu Apa, siapa Kedua-duanya Sini, sana Bukan, tidak &, %, $ Pun, kah Foreign, computer
17
Alternatives • Table of Words and its POS Tag • Should be added by rules to select the correct POS Tag
• Statistical Approach • Added by affix rules to handle OOV
18
HMM Model POS Tag
π
a12
a23
b22 Y1
b33 Y2
a45
X3
X2
X1 b11
a34 X4 b44 Y3
X5 b55
Y4
Y5
Word
19
• Affix Tree for OOV
20
NAMED ENTITY TAGGER FOR INDONESIAN 21
Named Entity Tagger (NER)
Unstructured text
Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov 1996 17:37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more ….
Structured Text
computer_science_job id:
[email protected] title: SOFTWARE PROGRAMMER company: recruiter: state: TN city: country: US language: C platform: PC \ DOS \ OS-2 \ UNIX rea: Voice Mail req_years_experience: 2 desired_years_experience: 5 22
For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a superimportant shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying…
Select Name From PEOPLE Where Organization = ‘Microsoft’
PEOPLE Name Bill Gates Bill Veghte Richard Stallman
Title Organization CEO Microsoft VP Microsoft founder Free Soft..
Bill Gates Bill Veghte 23
Named Entity Tagger (NER) • Statistical Approach, with features: • POS tag • Language model of NE class • Knowledge (rules) • Manual • List of entities (common) • List of entity cues
• Automatic • Lexical surface of current token and previous token • Semi supervised (seed data)
• Algorithm/model: • Single machine learning algorithm • Ensemble machine learning algorithm • Different model for each class
24
Collaboration with TU Wien & University of Indonesia
INDONESIAN NER (NAMED ENTITY RECOGNITION) FOR STRIKE 25
Process Flow Strike Twitter input
Feature Extraction
Named Entity Classification
• Named Entity Types: • Location, Date, Org-People
NE of Each tweet
Topic Clustering
• Problems: • Size of Training Data • OOV (Out of Vocabulary) • Language independent as possible • Strategy: • semi supervised • without entity cue words • Gazetteer (entity list) taken from available resource
26
Result Example • Input: • •
Mahasiswa di Jombang Bentrok dengan Polisi Saat Demo Tolak Ijazah Palsu: Aksi unjuk rasa puluhan mahasiswa Uni. http://t.co/7HfJTqyncf English: Students in Jombang clashed with polices in a strike of refusing fake diplomas: Strike of tens of university student
• Output: • • •
Location:jombang Org-people:mahasiswa (students) Date:
• Input: •
•
RT @opikro: Aksi demo mhsswa cirebon,hari ini @frans_surya @af1_ @malakmalakmal @suryadelalu http://t.co/EOhJnyuJpT English: Strike of Cirebon Students, today
• Output: • • •
Location: Org-people:mahasiswa cirebon (cirebon students) Date:hari ini (today)
27
Similar Research with Indonesian NER • Shift of Indonesian Political People • Source: Indonesian news article • Collaboration with KITLV and Univ. of Amsterdam
• Shopping Pattern of Indonesian People • Source: Twitter, Facebook, Kaskus
• E-Commerce Product Information • Source: individual e-commerce sites
28
Complaint Management System (for Bandung City) Collaboration with Bandung City Government and PT. NoLimit
29
Research Background • Bandung is 8th city with highest Twitter traffic • Bandung Government open services for citizen to give their complaint through Twitter
30
Automatic Complaint Management System
Twitter input
Authority Classification
Information Extraction
Complaint Clustering
Summary of Complaints
Monitoring
• ':\'( RT @yudamuhamad21: @urmyfr13nd @infobdg @PRFMnews @dbmpkotabdg di bdg selatan, sawahna tos jd hotel. alus pisan nya mang #infokabbandung‘ • Hasil akhir: • Keluhan: sawah jadi hotel • Tempat: bandung selatan
31
suarabdg.com
• Enhance the response from the authority 32
Analysis of Twitter Text Input • Characteristics of Twitter text (complaints in Bandung): • Abbreviation • Non formal terms • Regional language • Foreign language
Pa punten, lapangan supratman tong janten lahan bangunan (English translation: Excuse me, Sir, Supratman field should not be construction area) Kesadaran msyrkt kita u/ buang sampah pd temptnya msh sgt kurang pa @ridwankamil (English translation: Our society awareness to dispose garbage in its place is still lacking, Mr. @ridwankamil)
33
Analysis of Twitter Text Input (2) Untuk pengawasan dilokasi ex-pkl lebih • Disagreements to a government policy
• • • • • •
baik dgn cctv, petugas @Satpolppbdg t4 lain, ada yg Answer to a question, bersiap mostlydi are given bynge-tes the langsung tindak. @ridwankamil Perlu bu.. @ridwankamil ini sesuai dg kurikulum 2013 government agency @dishub_kotabdg pak, (English translation: For surveillance tematik - integratif RT @EtreeMamito RT @dadangtibum: @ridwankamil depan kantor imigrasi suci byk yg parkir in the Complaint@infobdg status (done or bikin not), given bybetter the government location of sparuh PKL be pelajaran klsdone 2 ... @bkdkotabandung Razia PNS di to (liar)former smp@disdik_bandung macet jalan stiap agency BIP berkeliaran CCTV, officers prepare http://t.co/zykpVo6e1b pd jam kerja@Satpolppbdg hari. with #lapor keskian x in another location, test and act directly) (English translation: Necessary, Miss. This http://t.co/gIczn5cQFs (English translation: @dishub_kotabdg New complaint and new question is accordance 2013(illegaly) thematic (English translation: RTin@dadangtibum: @ridwankamil Mr, a lotDiinformasikan of with car park @Satpolppbdg : Sudah ada integrative curriculum RT @EtreeMamito @ridwankamil @infobdg in front petugas of sucipolicy immigration office that make Explanation to aRTgovernment pemadam kebakaran di lokasi.. @jawaracinambo: @ridwankamil @disdik_bandung lesson in class ... @bkdkotabandung PNS raid on BIP that halfway(English jammed every day. #reported nth two translation: Being informed: There @OdedMD @PemkotBandung Selamat http://t.co/zykpVo6e1b) working hours) Greetings wandering or congratulations times) has been a fireman in the location) Atason Dilantiknya http://t.co/REkGzaZzEY @Satpolppbdg @dishub_kotabdg #BandungJuara @OdedMD @ridwankamil mantap terus Supports or agreements to a government policy (English translation: @jawaracinambo: lanjutkan RT penertiban PKL .. BRAVO GOOD @ridwankamil JOB@OdedMD !! @PemkotBandung on the (English Congratulations translation: @Satpolppbdg inauguration@dishub_kotabdg http://t.co/REkGzaZzEY @OdedMD #BandungChampion) @ridwankamil solid continue controlling PKL .. BRAVO GOOD JOB !!)
34
Authority Classification Complaints
Twitter input
Authority Classification Not Complaints
• Complaints authorities (21) Local staffing agency Fire dept. Information & communication services Youth and sports services
Transportation agency Dept. of layout and copyrighted works Local market company services
Environmental management agency Dept. of culture & tourism Tax services office Education office
Dept. of agriculture and food security Local water company Tirtawening State power company
Dept. of highways & irrigation Health dept. Funeral and landscaping services Financial and asset management services Social services Dept. of hygiene Police department 35
Complaint Classification Twitte r input
Preprocessing
Feature Extraction
Complaints Classification
Not Complaints
• Characteristics of Twitter text in Bandung Complaint Management System • Affect the preprocessing component
• Features need for Authority Classification on Twitter (text content)
• Multi label Classification • The classification algorithm 36
Complaint Information Extraction Complaint Twitter input
Feature Extraction
Named Entity Classification
Relation Extraction
Summary of Each tweet
• Named entity type: • Location, condition, cause, advice, etc
• Features: • Lexical, orthography, clue word, gazetteer, previous word
37
@dishub_kotabdg @ridwankamil pak, depan kantor imigrasi suci byk yg parkir (liar) smp bikin macet sparuh jalan stiap hari. #lapor keskian x (English translation: @dishub_kotabdg @ridwankamil Mr, a lot of car park (illegaly) in front of suci immigration office that make halfway jammed every day. #reported nth times)
• Location: depan kantor imigrasi suci (in front of suci immigration office) • Condition: macet (traffic jam) 38
INDONESIAN QUESTION ANSWERING SYSTEM 39
User input
Question Analyzer
Passage Retriever
Answer Finder
Answer Corpus
40
QUESTION ANALYZER Question: Dimana (where) Alexander Graham Bell meninggal (died)? EAT: LOCATION Keywords: Alexander, Graham, Bell, meninggal (died) ANSWER FINDER Sentence: Alexander Graham Bell dilahirkan (was born) pada 3 Maret 1847 di Edinburgh, Skotlandia, Britania Raya dan meninggal (died) pada 2 Agustus 1922 di Beinn Bhreagh, Nova Scotia, Kanada. Closest location: Edinburgh, Skotlandia, Britania Raya Correct answer: Beinn Bhreagh, Nova Scotia, Kanada
41
Answer Finder • Phrase distance
Answer finder: classifying the phrase using machine learning algorithm 42
INDONESIAN MIND MAP GENERATOR 43
Indonesian sentence
Syntactic Parser
POS Tagger
S NP VP
Kartini
VBI
Semantic Analyzer + Reference Resolution
First Order Logic
VP
NNP
Parse tree
ADVP IN
NP
λa object (a,Kartini)
MindMap Symbol Generator
NNP
lahir (was born)
Jepara
λb λc event(b,lahir) ∧ agent(b,c)
di (in)
λa object (a,Jepara)
MindMap Representation
λd λe event(d) ∧ place(d,e) 44
• Kartini lahir di Jepara (Kartini was born in Jepara)
lahir kartini
jepara
45
Indonesian Text Understanding Evaluation System •
Generate question and answer from FOL of Indonesian sentence
•
Compare the reader answer with the correct generated answer
46
THANK YOU Terima Kasih
ขอบคุณ
ありがとう
47