From Dialogue Corpora to Dialogue Systems ...

2 downloads 0 Views 6MB Size Report
September 14th 2016. Callejas-Rodríguez, A., Villatoro-Tello E. (et al.) A Teenager Chatbot for Preventing Cyber-Pedophilia. TSD 2016, Brno, Czech Republic.
Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

From Dialogue Corpora to Dialogue Systems: Generating a Chatbot with Teenager Personality for Preventing Cyber-Pedophilia

Ángel Callejas-Rodríguez1 , Esaú Villatoro-Tello1 , Ivan Meza2 and Gabriela Ramírez-de-la-Rosa1 1 Language and Reasoning (LyR) Research Group, Information Technologies Dept.,UAM, México 2 Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, UNAM, México

TSD, Brno, Czech Republic. September 14th 2016

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

1 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Outline

1

Introduction

2

Grooming Attack

3

Proposed method

4

Experimental setup

5

Results

6

Conclusions and Future work

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

2 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Problem statement

Introduction The Internet in our everyday life : Allows users to easily accessing tons of information Multiple communication services are freely available Users are exposed to a high quantity of illegal activities

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

3 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Communication platforms

Popular services on the Internet

Some of the most popular services among Internet users are those known as instant messaging platforms Twitter, Facebook, Skype, G+, etc. According to some recent studies, children and teenagers are becoming very active users3

These are very attractive since they provide many advantages : Fast, cheap and virtual environments by nature Allow to hide the real identity of users 3. Pew Research Center : http://www.pewinternet.org/2015/04/09/teens-social-mediatechnology-2015/

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

4 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Vulnerability

What about the security of users ?

An specific and very frequent type of cyber-crime activity is Grooming Attack Sexual predators or pedophiles take advantage of the anonymity provided by these messaging services According to some international organizations, child grooming is one of the illegal acts that is becoming more common in recent years

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

5 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Grooming Attack/Child Grooming

Definition

Child grooming : A communication process by which a perpetrator applies affinity seeking strategies, while simultaneously engaging in sexual desensitization and information acquisition about targeted victims in order to develop relationships that result in need fulfillment (e.g. physical sexual molestation) 4

4. C. Harms. 2007. Grooming : An operational definition and coding scheme. Sex Offender Law Report, 8(1) :1–6. Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

6 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Grooming Attack/Child Grooming

Official statistics

According to the National Center for Missing & Exploited Children 5 and the Office of Juvenile Justice and Delinquency Prevention 6 : One in five children are contacted by an offender In the 100% cases of child abuse, the victims meet the offender voluntarily About the 16% of teenagers have considered to meet someone in person after just chatting (once) with him/her, from these, the 8% have done it The 75% of children are willing to share personal information with strangers in exchange of some benefit, for instance a payment In addition, the FBI estimates that there are about 750,000 sexual predators on-line around the world at every moment

5. http://www.missingkids.com 6. www.ojjdp.gov Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

7 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Traditional approach

Sexual Predators Identification (SPI)

Currently, sexual predators are catch through police officers or volunteers, who pose as teenager in chat rooms and provoke sexual predators to approach them Since year 2004, the Perverted Justice (www.perverted-justice.com) organization has captured more than 600 predators The Terre des Hommes initiative (http://www.terredeshommesnl.org/en), managed to identify thousands of predators within months of work

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

8 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Identified problems

Disadvantages of the traditional approach

Police officers and volunteers will never be sufficient for patrolling all the Internet traffic Nowadays, there is no 100% automatic systems that allows preventing and detecting sexual predators (neither on-line or off-line) The current approach represents a forensic technique, i.e., police officer has to review thousands of text lines in order to accurately identify a sexual offender Given the nature of this approach, it is subject to errors ; since being too much time in front of a computer, posing as a victim, might result very upsetting

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

9 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

At the identification phase

Analysing and classifying SPI

During the last decade several approaches have been proposed for solving the problem of SPI : Identification of predatory chat lines Classification of predatory chat conversations Identification of the offender and the victim In general, two main research lines have been followed : 1

Representation (features) of the data : BOW, psycholinguistic features and complex behavioural attributes

2

Learning algorithms : chained classifiers, semi-supervised learning and one-class learning algorithms

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

10 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Assisting police officers

Towards the construction of chatbots

Most of the work on SPI has been oriented to analyse and classify chat conversations by means of text mining strategies Recently, Laorden, C. et al. 2013 proposed a chatter-bot called Negobot Negobot is based on game theory Negobot was designed for Spanish, however its knowledge base comes from the PJ website A main limitation is the use of common slang

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

11 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Main goals

Research questions

Our main goal is in providing automatic tools that can assist officers in the process of SPI To develop a conversational agent which interacts with human users via natural language Thus, our main research questions are : 1

To what extent is possible to automatically extract conversational rules from real dialogue corpora ?

2

How likely is that such a chatter bot behaves as a teenager does within a chatroom ?

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

12 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

System architecture

General architecture

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

13 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

The corpus

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

14 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Building the conversational model

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

15 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Building the conversational model

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

16 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Building the conversational model

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

17 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Building the conversational model

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

18 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

The chatbot model

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

19 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Evaluation

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

20 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Built corpus

Corpus

Statistics regarding the collected corpora Number of ... Users Avg. interventions (per user) Avg. length per intervention (words) Vocabulary size (distinct tokens)

Original data

Filtered Data

1300

1300

782

234

10

7

711854

186857

The built corpus is available in : http://ccd.cua.uam.mx/~evillatoro/Resources/Corpus_Ask_MX.tar.gz

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

21 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Conversational rules

Automatically inferring the conversational rules

Six different models were evaluated : 1

Least frequent word (1W-LF) [Abu Shawar et al.]

2

Most frequent word (1W-MF)

3

Least frequent word-bigram (2W-LF) 7

4

Most frequent word-bigram (2W-MF)7

5

Most frequent word and least frequent word bigram (1W-MF-2W-LF)7, 8

6

Least frequent word and most frequent word bigram (1W-LF-2W-MF)7,8

7. Context aware method 8. Bigram strategy combined with word strategy in a back-up fashion Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

22 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Evaluation

Evaluation metrics Lexical richness (LR) It captures the vocabulary diversity used in the answers generated by our chatbot LR is defined as the ratio between the number of distinct lexical units and the total number of lexical elements used Syntactical richness (SR) It pays attention to the sequence the syntactic information produced by our chatbot SR is defined as the ratio between the number of distinct POS units and the total number of POS elements used Both LR and SR were measured at 1, 2 and 3 grams level Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

23 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Evaluation

Evaluation metrics

Perplexity (P) It give us an idea of how predictable is the language we are producing in the answer Since we are trying to capture informal language used by teenagers we prefer a high perplexity, i.e., not a very predictable system

Validation For all the performed experiments we employed as a validation method a 10-fold cross validation strategy

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

24 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Experimental results

Experimental results

Method

Lexical richness (LR) 1-gram 2-grams 3-grams

Syntactical richness (SR) 1-gram 2-grams

3-grams

Perplexity (P)

1W-MF

0.078

0.135

0.256

−0.097 −0.022

0.000

1W-LF

0.130

0.213

0.412

−0.144 −0.016

0.031

917

2W-MF

0.019

0.076

0.151

−0.043 −0.011

0.001

632

2W-LF

0.079

0.181

0.344

−0.049 −0.001

0.021

928

1W-MF-2W-LF

0.086

0.157

0.306

−0.037

0.000

0.019

654

1W-LF-2W-MF

0.039

0.066

0.128

0.016

0.009

0.011

213

772

For comparison purposes we evaluated the performance in human-to-human dialogues : LR − 1gram = 0.273, LR − 2gram = 0.407, LR − 3gram = 0.735, SR − 1gram = 0.382, SR − 2gram = 0.134, SR − 3gram = 0.126 and P = 1618

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

25 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Experimental results

Lexical behavior

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

26 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Experimental results

Syntactic behavior

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

27 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Experimental results

Dialogue example (Chatbot model :1W-LF) u:

hola (hello)

c:

hola (hello)

u:

como te llamas ? (what is your name ?)

c:

muy bien :) (very good :))

u:

que edad tienes ? (how old are you ?)

c:

D : (D :)

u:

de donde eres ? (where are you from ?)

c:

aslajshhas ? C : (aslajshhas ? C :)

u:

que te gusta ? (what do you like ?)

c:

D : (D :)

u:

en donde estudias ? (where do you study ?)

c:

7 :39 en mi casa jaja (7 :39 in my home haha)

u:

dime algo (tell me something)

c:

Bien supongo... (Ok I guess...)

u:

lo que sea (Anything)

c:

Un detalle muy especial (A very special detail)

u:

me tengo que ir (got to go)

c:

... (...)

u:

Adios (Goodbye)

c:

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

28 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Experimental results

Dialogue example (Chatbot model :1W-LF-2W-MF) u:

hola (hello)

c:

hola (hello)

u:

como te llamas ? (what is your name ?)

c:

No soy famosa (I’m not famous)

u:

que edad tienes ? (how old are you ?)

c:

17 yirs ol (17 years old)

u:

de donde eres ? (where are you from ?)

c:

Por que ? (Why ?)

u:

que te gusta ? (what do you like ?)

c:

Ser rechazado un y otra vez por la misma persona xDDD Aun asi seguire XDDY una golpiza :c (Being rejected over and over again by the same person xDDD still keep up XDDY beating :c)

u:

en donde estudias ? (where do you study ?)

c:

En un hospital ah que :33 (In a hospital ah :33)

u:

dime algo (tell me something)

c:

Jajaja ok ok 7w7 (Hahaha ok ok 7w7)

u:

lo que sea (Anything)

c:

Que se rumora ? (What is rumored ?)

u:

me tengo que ir (got to go)

c:

Clase a estas horas ? Pues bueno bueno. Nos seguimos contactando c : (Class at this hour ? Well then good. We’ll keep in contact c :)

u:

Adios (Goodbye)

c:

Bye :3 (Bye :3)

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

29 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Conclusions and future work

Conclusions

1

We have proposed a novel methodology for developing a conversational agent that learns to behave as a teenager from a real dialogue corpora

2

The proposed method is language independent and can be easily applied to different domains, i.e., domain independent and does not require the formulation of a hand-coded knowledge base

3

Performed experiments demonstrate that considering contextual information in addition to the use of some key units, i.e. the 1W-LF+2W-MF model ; it is possible to obtain a more natural behavior

4

An important contribution of this work is the compilation of a real dialogue corpora among Mexican teenagers, containing more than 300000 pairs of q : a

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

30 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Conclusions and future work

Future work

1

We aim at evaluating our proposed method under a PJ scenario, i.e., its pertinence assisting police officers

2

In addition, we would like to evaluate this type of tools for detecting other types of behaviours, such as cyber-bullying or aggressive text

3

Test the performance of our proposed method by means of incorporating a richer semantic representation, such as Word2Vec representation

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

31 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Conclusions and future work

Take away message

Even though we get to build accurate systems, there is no 100% guarantee that this type of cyber-crime gets to be eradicated We must teach our children how to be careful enough when surfing on the Web It is very important to sensibilize our authorities regarding the importance of the research being done on this field, some times the prefer keep doing analogically, i.e., without any (technologically) assistance

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

32 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Questions ?

Thank you !

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

33 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Questions ?

Contact information :

Dr. Esaú Villatoro, email : [email protected] URL : http://ccd.cua.uam.mx/~evillatoro/ Twitter : @EsauVT Twitter : @LyR_UAMC Released resources Dataset : http://ccd.cua.uam.mx/~evillatoro/ Resources/Corpus_Ask_MX.tar.gz Chatbot source code : https://github.com/Angel2113/Chatbot Ángel Callejas, email : [email protected]

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

34 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Questions ?

AIML Example

tag signifies start of the AIML document tag defines the knowledge unit tag defines the pattern user is going to type tag defines the response to the user if user types Hello Alice

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

35 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

Questions ?

Examples

1

1W-LF : Dear sister tell my father that I love him