Computational Linguistics text mining

4 downloads 80631 Views 3MB Size Report
Nov 17, 1996 - Bill Gates. CEO. Microsoft ... Twitter input. Feature. Extraction. NE of. Each tweet. Named Entity ... Characteristics of Twitter text (complaints in.
Development of Indonesian Natural Language Processing Tools and Its Usage in Text Applications Ayu Purwarianti

named entity

word sense text mining syntactic information extraction social media

Computational dialogue spam filtering Linguistics texttranslation essay scoring

text categorization

sentiment analysis

1

semantic

2

Information Extraction

structured text unstructured text

3

Text Categorization Giving label or category to text/document automatically

Sentiment analysis Plagiarism detection

Label: positive, negative, neutral

Label: plagiat vs not plagiat

Essay scoring Label: score 4

Question Answering & Dialogue

System able to return answer or snippet of user question

question

answer

5

Indonesia

Source: https://en.wikipedia.org/wiki/Languages_of_Indonesia

6

Languages in Indonesia Indonesian

Number Year surveyed Main areas where spoken (millions) 250 2014 throughout Indonesia

Javanese

84.3

2000 (census)

Sundanese Madurese Minangkabau Musi (Palembang Malay)[4] Manado Malay (Minahasan) Bugis

34.0 13.6 5.5

2000 (census) West Java, Banten 2000 (census) Madura Island (East Java) 2007 West Sumatra, Riau

3.9

2000 (census) South Sumatra

3.8

2001

3.5

1991

Banjarese

3.5

Betawi etc

2.7

Language

Northern Banten, Northern West Java, Yogyakarta, Central Java and East Java

Minahasa, North Sulawesi

South Sulawesi South Kalimantan, East Kalimantan, Central 2000 (census) Kalimantan 1993 Jakarta

Source: https://en.wikipedia.org/wiki/Languages_of_Indonesia

7

Text is … a string list of tokens

having structure represent an intention

8

Basic Tools in NLP

Lexical Named Entity Tagger Phrase Tagger Parser

Syntactic

Semantic

reference resolution

Tokenization Sentence Splitter Stemming Lemmatization POS Tagger

Word Sense Disambiguation Semantic Analysis

Pragmatic

9

lexical

syntactic

semantic

pragmatic 10

STEMMING FOR INDONESIAN

11

Stemming • Indonesian Morphological Rules Morpheme

Examples

Prefix

me-, di-, be-, pe-, ke-, ter-, se-

membaca belajar pekerja

read study worker

Infix

-em-, -el-, -er-

Jemari Telunjuk Gerigi

Finger Index finger Serration

Suffix

-kan, -an, -i, -isme, -isasi

tuliskan Makanan Tandai

Write Food Mark

Possessiv e pronoun

-ku, -mu, -nya

Bukuku bukunya

My book His/her book

Particle

-lah, -kah, -tah, -pun

Bacalah Benarkah Sayapun

Read! Is it true? I am too 12

memperadilankan (english: sue someone) [pref-1] mem + [pref-2] per + [root] adil + [suff-1] an + [suff-2] kan

kenyataannyalah (english: the truth) [pref-1] ke + [root] nyata + [suff-1] an + [poss] nya + [particle] lah

13

Stemming with Rule based Approach • NFA for Indonesian Stemming Rule

Problems: - More than 1 candidates: Add rules of language model of stemmed word 14

POS TAGGER FOR INDONESIAN 15

POS Tag • Open class • Noun, verb, adjective, adverb

• Closed class words ( ~ function words) • Preposition, determiner, conjunction, pronoun, auxiliary verbs, particles(up,down,on), numerals

Cases: 1. Polysemy • One word might have more than 1 POS Tag candidate 2. OOV • The word is not in the training data or dictionary 16

POS Tag List No 1 2 3 4 5 6 7

POS OP CP GM ; : “ .

8 9 10 11 12

, … JJ RB

POS Name Open Parenthesis Close Parenthesis Slash Semicolon Colon Quotation Sentence Terminator Comma Dash Ellipsis Adjective Adverb

13 14

NN NNP

Common Noun Proper Noun

15 16 17

NNG Genitive Noun VBI Intransitive Verb VBT Transitive Verb

Example ({[ )}] / ; : “‘ .!? , … Kaya, Manis Sementara, Nanti Mobil Bekasi, Indonesia Bukunya Pergi Membeli

No 18 19 20 21 22 23 24 25 26 27 28

POS IN MD CC SC DT UH CDO CDC CDP CDI PRP

29 30 31 32 33 34 35

WP PRN PRL NEG SYM RP FW

POS Name Preposition Modal Coor-Conjunction Subor-Conjunction Determinier Interjection Ordinal numeral Collective numeral Primary numeral Irregular numeral Personal pronouns WH-pronoun Number Pronoun Locative Pronoun Negation Symbols Particles Foreign Words

Example Di, ke, dari Bisa Dan, atau, tetapi Jika, ketika Para, ini, itu Wah, aduh, oi Ketiga, keempat Berlima, berempat Satu, sepuluh Beberapa Saya, kamu Apa, siapa Kedua-duanya Sini, sana Bukan, tidak &, %, $ Pun, kah Foreign, computer

17

Alternatives • Table of Words and its POS Tag • Should be added by rules to select the correct POS Tag

• Statistical Approach • Added by affix rules to handle OOV

18

HMM Model POS Tag

π

a12

a23

b22 Y1

b33 Y2

a45

X3

X2

X1 b11

a34 X4 b44 Y3

X5 b55

Y4

Y5

Word

19

• Affix Tree for OOV

20

NAMED ENTITY TAGGER FOR INDONESIAN 21

Named Entity Tagger (NER)

Unstructured text

Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov 1996 17:37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more ….

Structured Text

computer_science_job id: [email protected] title: SOFTWARE PROGRAMMER company: recruiter: state: TN city: country: US language: C platform: PC \ DOS \ OS-2 \ UNIX rea: Voice Mail req_years_experience: 2 desired_years_experience: 5 22

For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a superimportant shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying…

Select Name From PEOPLE Where Organization = ‘Microsoft’

PEOPLE Name Bill Gates Bill Veghte Richard Stallman

Title Organization CEO Microsoft VP Microsoft founder Free Soft..

Bill Gates Bill Veghte 23

Named Entity Tagger (NER) • Statistical Approach, with features: • POS tag • Language model of NE class • Knowledge (rules) • Manual • List of entities (common) • List of entity cues

• Automatic • Lexical surface of current token and previous token • Semi supervised (seed data)

• Algorithm/model: • Single machine learning algorithm • Ensemble machine learning algorithm • Different model for each class

24

Collaboration with TU Wien & University of Indonesia

INDONESIAN NER (NAMED ENTITY RECOGNITION) FOR STRIKE 25

Process Flow Strike Twitter input

Feature Extraction

Named Entity Classification

• Named Entity Types: • Location, Date, Org-People

NE of Each tweet

Topic Clustering

• Problems: • Size of Training Data • OOV (Out of Vocabulary) • Language independent as possible • Strategy: • semi supervised • without entity cue words • Gazetteer (entity list) taken from available resource

26

Result Example • Input: • •

Mahasiswa di Jombang Bentrok dengan Polisi Saat Demo Tolak Ijazah Palsu: Aksi unjuk rasa puluhan mahasiswa Uni. http://t.co/7HfJTqyncf English: Students in Jombang clashed with polices in a strike of refusing fake diplomas: Strike of tens of university student

• Output: • • •

Location:jombang Org-people:mahasiswa (students) Date:

• Input: •



RT @opikro: Aksi demo mhsswa cirebon,hari ini @frans_surya @af1_ @malakmalakmal @suryadelalu http://t.co/EOhJnyuJpT English: Strike of Cirebon Students, today

• Output: • • •

Location: Org-people:mahasiswa cirebon (cirebon students) Date:hari ini (today)

27

Similar Research with Indonesian NER • Shift of Indonesian Political People • Source: Indonesian news article • Collaboration with KITLV and Univ. of Amsterdam

• Shopping Pattern of Indonesian People • Source: Twitter, Facebook, Kaskus

• E-Commerce Product Information • Source: individual e-commerce sites

28

Complaint Management System (for Bandung City) Collaboration with Bandung City Government and PT. NoLimit

29

Research Background • Bandung is 8th city with highest Twitter traffic • Bandung Government open services for citizen to give their complaint through Twitter

30

Automatic Complaint Management System

Twitter input

Authority Classification

Information Extraction

Complaint Clustering

Summary of Complaints

Monitoring

• ':\'( RT @yudamuhamad21: @urmyfr13nd @infobdg @PRFMnews @dbmpkotabdg di bdg selatan, sawahna tos jd hotel. alus pisan nya mang #infokabbandung‘ • Hasil akhir: • Keluhan: sawah jadi hotel • Tempat: bandung selatan

31

suarabdg.com

• Enhance the response from the authority 32

Analysis of Twitter Text Input • Characteristics of Twitter text (complaints in Bandung): • Abbreviation • Non formal terms • Regional language • Foreign language

Pa punten, lapangan supratman tong janten lahan bangunan (English translation: Excuse me, Sir, Supratman field should not be construction area) Kesadaran msyrkt kita u/ buang sampah pd temptnya msh sgt kurang pa @ridwankamil (English translation: Our society awareness to dispose garbage in its place is still lacking, Mr. @ridwankamil)

33

Analysis of Twitter Text Input (2) Untuk pengawasan dilokasi ex-pkl lebih • Disagreements to a government policy

• • • • • •

baik dgn cctv, petugas @Satpolppbdg t4 lain, ada yg Answer to a question, bersiap mostlydi are given bynge-tes the langsung tindak. @ridwankamil Perlu bu.. @ridwankamil ini sesuai dg kurikulum 2013 government agency @dishub_kotabdg pak, (English translation: For surveillance tematik - integratif RT @EtreeMamito RT @dadangtibum: @ridwankamil depan kantor imigrasi suci byk yg parkir in the Complaint@infobdg status (done or bikin not), given bybetter the government location of sparuh PKL be pelajaran klsdone 2 ... @bkdkotabandung Razia PNS di to (liar)former smp@disdik_bandung macet jalan stiap agency BIP berkeliaran CCTV, officers prepare http://t.co/zykpVo6e1b pd jam kerja@Satpolppbdg hari. with #lapor keskian x in another location, test and act directly) (English translation: Necessary, Miss. This http://t.co/gIczn5cQFs (English translation: @dishub_kotabdg New complaint and new question is accordance 2013(illegaly) thematic (English translation: RTin@dadangtibum: @ridwankamil Mr, a lotDiinformasikan of with car park @Satpolppbdg : Sudah ada integrative curriculum RT @EtreeMamito @ridwankamil @infobdg in front petugas of sucipolicy immigration office that make Explanation to aRTgovernment pemadam kebakaran di lokasi.. @jawaracinambo: @ridwankamil @disdik_bandung lesson in class ... @bkdkotabandung PNS raid on BIP that halfway(English jammed every day. #reported nth two translation: Being informed: There @OdedMD @PemkotBandung Selamat http://t.co/zykpVo6e1b) working hours) Greetings wandering or congratulations times) has been a fireman in the location) Atason Dilantiknya http://t.co/REkGzaZzEY @Satpolppbdg @dishub_kotabdg #BandungJuara @OdedMD @ridwankamil mantap terus Supports or agreements to a government policy (English translation: @jawaracinambo: lanjutkan RT penertiban PKL .. BRAVO GOOD @ridwankamil JOB@OdedMD !! @PemkotBandung on the (English Congratulations translation: @Satpolppbdg inauguration@dishub_kotabdg http://t.co/REkGzaZzEY @OdedMD #BandungChampion) @ridwankamil solid continue controlling PKL .. BRAVO GOOD JOB !!)

34

Authority Classification Complaints

Twitter input

Authority Classification Not Complaints

• Complaints  authorities (21) Local staffing agency Fire dept. Information & communication services Youth and sports services

Transportation agency Dept. of layout and copyrighted works Local market company services

Environmental management agency Dept. of culture & tourism Tax services office Education office

Dept. of agriculture and food security Local water company Tirtawening State power company

Dept. of highways & irrigation Health dept. Funeral and landscaping services Financial and asset management services Social services Dept. of hygiene Police department 35

Complaint Classification Twitte r input

Preprocessing

Feature Extraction

Complaints Classification

Not Complaints

• Characteristics of Twitter text in Bandung Complaint Management System • Affect the preprocessing component

• Features need for Authority Classification on Twitter (text content)

• Multi label Classification • The classification algorithm 36

Complaint Information Extraction Complaint Twitter input

Feature Extraction

Named Entity Classification

Relation Extraction

Summary of Each tweet

• Named entity type: • Location, condition, cause, advice, etc

• Features: • Lexical, orthography, clue word, gazetteer, previous word

37

@dishub_kotabdg @ridwankamil pak, depan kantor imigrasi suci byk yg parkir (liar) smp bikin macet sparuh jalan stiap hari. #lapor keskian x (English translation: @dishub_kotabdg @ridwankamil Mr, a lot of car park (illegaly) in front of suci immigration office that make halfway jammed every day. #reported nth times)

• Location: depan kantor imigrasi suci (in front of suci immigration office) • Condition: macet (traffic jam) 38

INDONESIAN QUESTION ANSWERING SYSTEM 39

User input

Question Analyzer

Passage Retriever

Answer Finder

Answer Corpus

40

QUESTION ANALYZER Question: Dimana (where) Alexander Graham Bell meninggal (died)? EAT: LOCATION Keywords: Alexander, Graham, Bell, meninggal (died) ANSWER FINDER Sentence: Alexander Graham Bell dilahirkan (was born) pada 3 Maret 1847 di Edinburgh, Skotlandia, Britania Raya dan meninggal (died) pada 2 Agustus 1922 di Beinn Bhreagh, Nova Scotia, Kanada. Closest location: Edinburgh, Skotlandia, Britania Raya Correct answer: Beinn Bhreagh, Nova Scotia, Kanada

41

Answer Finder • Phrase distance

Answer finder: classifying the phrase using machine learning algorithm 42

INDONESIAN MIND MAP GENERATOR 43

Indonesian sentence

Syntactic Parser

POS Tagger

S NP VP

Kartini

VBI

Semantic Analyzer + Reference Resolution

First Order Logic

VP

NNP

Parse tree

ADVP IN

NP

λa object (a,Kartini)

MindMap Symbol Generator

NNP

lahir (was born)

Jepara

λb λc event(b,lahir) ∧ agent(b,c)

di (in)

λa object (a,Jepara)

MindMap Representation

λd λe event(d) ∧ place(d,e) 44

• Kartini lahir di Jepara (Kartini was born in Jepara)

lahir kartini

jepara

45

Indonesian Text Understanding Evaluation System •

Generate question and answer from FOL of Indonesian sentence



Compare the reader answer with the correct generated answer

46

THANK YOU Terima Kasih

ขอบคุณ

ありがとう

47