From Dialogue Corpora to Dialogue Systems ...

Introduction

Grooming Attack

Proposed method

Experimental setup

Results

Conclusions and Future work

From Dialogue Corpora to Dialogue Systems: Generating a Chatbot with Teenager Personality for Preventing Cyber-Pedophilia

Ángel Callejas-Rodríguez1 , Esaú Villatoro-Tello1 , Ivan Meza2 and Gabriela Ramírez-de-la-Rosa1 1 Language and Reasoning (LyR) Research Group, Information Technologies Dept.,UAM, México 2 Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, UNAM, México

TSD, Brno, Czech Republic. September 14th 2016

Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)

A Teenager Chatbot for Preventing Cyber-Pedophilia

TSD 2016, Brno, Czech Republic

1 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Outline

1

Introduction

2

Grooming Attack

3

Proposed method

4

Experimental setup

5

Results

6





2 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Problem statement

Introduction The Internet in our everyday life : Allows users to easily accessing tons of information Multiple communication services are freely available Users are exposed to a high quantity of illegal activities




3 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Communication platforms

Popular services on the Internet

Some of the most popular services among Internet users are those known as instant messaging platforms Twitter, Facebook, Skype, G+, etc. According to some recent studies, children and teenagers are becoming very active users3

These are very attractive since they provide many advantages : Fast, cheap and virtual environments by nature Allow to hide the real identity of users 3. Pew Research Center : http://www.pewinternet.org/2015/04/09/teens-social-mediatechnology-2015/




4 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Vulnerability

What about the security of users ?

An specific and very frequent type of cyber-crime activity is Grooming Attack Sexual predators or pedophiles take advantage of the anonymity provided by these messaging services According to some international organizations, child grooming is one of the illegal acts that is becoming more common in recent years




5 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Grooming Attack/Child Grooming

Definition

Child grooming : A communication process by which a perpetrator applies affinity seeking strategies, while simultaneously engaging in sexual desensitization and information acquisition about targeted victims in order to develop relationships that result in need fulfillment (e.g. physical sexual molestation) 4

4. C. Harms. 2007. Grooming : An operational definition and coding scheme. Sex Offender Law Report, 8(1) :1–6. Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)



6 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Grooming Attack/Child Grooming

Official statistics

According to the National Center for Missing & Exploited Children 5 and the Office of Juvenile Justice and Delinquency Prevention 6 : One in five children are contacted by an offender In the 100% cases of child abuse, the victims meet the offender voluntarily About the 16% of teenagers have considered to meet someone in person after just chatting (once) with him/her, from these, the 8% have done it The 75% of children are willing to share personal information with strangers in exchange of some benefit, for instance a payment In addition, the FBI estimates that there are about 750,000 sexual predators on-line around the world at every moment

5. http://www.missingkids.com 6. www.ojjdp.gov Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)



7 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Traditional approach

Sexual Predators Identification (SPI)

Currently, sexual predators are catch through police officers or volunteers, who pose as teenager in chat rooms and provoke sexual predators to approach them Since year 2004, the Perverted Justice (www.perverted-justice.com) organization has captured more than 600 predators The Terre des Hommes initiative (http://www.terredeshommesnl.org/en), managed to identify thousands of predators within months of work




8 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Identified problems

Disadvantages of the traditional approach

Police officers and volunteers will never be sufficient for patrolling all the Internet traffic Nowadays, there is no 100% automatic systems that allows preventing and detecting sexual predators (neither on-line or off-line) The current approach represents a forensic technique, i.e., police officer has to review thousands of text lines in order to accurately identify a sexual offender Given the nature of this approach, it is subject to errors ; since being too much time in front of a computer, posing as a victim, might result very upsetting




9 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


At the identification phase

Analysing and classifying SPI

During the last decade several approaches have been proposed for solving the problem of SPI : Identification of predatory chat lines Classification of predatory chat conversations Identification of the offender and the victim In general, two main research lines have been followed : 1

Representation (features) of the data : BOW, psycholinguistic features and complex behavioural attributes

2

Learning algorithms : chained classifiers, semi-supervised learning and one-class learning algorithms




10 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Assisting police officers

Towards the construction of chatbots

Most of the work on SPI has been oriented to analyse and classify chat conversations by means of text mining strategies Recently, Laorden, C. et al. 2013 proposed a chatter-bot called Negobot Negobot is based on game theory Negobot was designed for Spanish, however its knowledge base comes from the PJ website A main limitation is the use of common slang




11 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Main goals

Research questions

Our main goal is in providing automatic tools that can assist officers in the process of SPI To develop a conversational agent which interacts with human users via natural language Thus, our main research questions are : 1

To what extent is possible to automatically extract conversational rules from real dialogue corpora ?

2

How likely is that such a chatter bot behaves as a teenager does within a chatroom ?




12 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


System architecture

General architecture




13 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


The corpus




14 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Building the conversational model




15 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results






16 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results






17 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results






18 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


The chatbot model




19 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Evaluation




20 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Built corpus

Corpus

Statistics regarding the collected corpora Number of ... Users Avg. interventions (per user) Avg. length per intervention (words) Vocabulary size (distinct tokens)

Original data

Filtered Data

1300

1300

782

234

10

7

711854

186857

The built corpus is available in : http://ccd.cua.uam.mx/~evillatoro/Resources/Corpus_Ask_MX.tar.gz




21 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Conversational rules

Automatically inferring the conversational rules

Six different models were evaluated : 1

Least frequent word (1W-LF) [Abu Shawar et al.]

2

Most frequent word (1W-MF)

3

Least frequent word-bigram (2W-LF) 7

4

Most frequent word-bigram (2W-MF)7

5

Most frequent word and least frequent word bigram (1W-MF-2W-LF)7, 8

6

Least frequent word and most frequent word bigram (1W-LF-2W-MF)7,8

7. Context aware method 8. Bigram strategy combined with word strategy in a back-up fashion Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)



22 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Evaluation

Evaluation metrics Lexical richness (LR) It captures the vocabulary diversity used in the answers generated by our chatbot LR is defined as the ratio between the number of distinct lexical units and the total number of lexical elements used Syntactical richness (SR) It pays attention to the sequence the syntactic information produced by our chatbot SR is defined as the ratio between the number of distinct POS units and the total number of POS elements used Both LR and SR were measured at 1, 2 and 3 grams level Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)



23 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Evaluation

Evaluation metrics

Perplexity (P) It give us an idea of how predictable is the language we are producing in the answer Since we are trying to capture informal language used by teenagers we prefer a high perplexity, i.e., not a very predictable system

Validation For all the performed experiments we employed as a validation method a 10-fold cross validation strategy




24 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Experimental results


Method

Lexical richness (LR) 1-gram 2-grams 3-grams

Syntactical richness (SR) 1-gram 2-grams

3-grams

Perplexity (P)

1W-MF

0.078

0.135

0.256

−0.097 −0.022

0.000

1W-LF

0.130

0.213

0.412

−0.144 −0.016

0.031

917

2W-MF

0.019

0.076

0.151

−0.043 −0.011

0.001

632

2W-LF

0.079

0.181

0.344

−0.049 −0.001

0.021

928

1W-MF-2W-LF

0.086

0.157

0.306

−0.037

0.000

0.019

654

1W-LF-2W-MF

0.039

0.066

0.128

0.016

0.009

0.011

213

772

For comparison purposes we evaluated the performance in human-to-human dialogues : LR − 1gram = 0.273, LR − 2gram = 0.407, LR − 3gram = 0.735, SR − 1gram = 0.382, SR − 2gram = 0.134, SR − 3gram = 0.126 and P = 1618




25 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results



Lexical behavior




26 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results



Syntactic behavior




27 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results



Dialogue example (Chatbot model :1W-LF) u:

hola (hello)

c:

hola (hello)

u:

como te llamas ? (what is your name ?)

c:

muy bien :) (very good :))

u:

que edad tienes ? (how old are you ?)

c:

D : (D :)

u:

de donde eres ? (where are you from ?)

c:

aslajshhas ? C : (aslajshhas ? C :)

u:

que te gusta ? (what do you like ?)

c:

D : (D :)

u:

en donde estudias ? (where do you study ?)

c:

7 :39 en mi casa jaja (7 :39 in my home haha)

u:

dime algo (tell me something)

c:

Bien supongo... (Ok I guess...)

u:

lo que sea (Anything)

c:

Un detalle muy especial (A very special detail)

u:

me tengo que ir (got to go)

c:

... (...)

u:

Adios (Goodbye)

c:




28 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results



Dialogue example (Chatbot model :1W-LF-2W-MF) u:

hola (hello)

c:

hola (hello)

u:

como te llamas ? (what is your name ?)

c:

No soy famosa (I’m not famous)

u:

que edad tienes ? (how old are you ?)

c:

17 yirs ol (17 years old)

u:

de donde eres ? (where are you from ?)

c:

Por que ? (Why ?)

u:

que te gusta ? (what do you like ?)

c:

Ser rechazado un y otra vez por la misma persona xDDD Aun asi seguire XDDY una golpiza :c (Being rejected over and over again by the same person xDDD still keep up XDDY beating :c)

u:

en donde estudias ? (where do you study ?)

c:

En un hospital ah que :33 (In a hospital ah :33)

u:

dime algo (tell me something)

c:

Jajaja ok ok 7w7 (Hahaha ok ok 7w7)

u:

lo que sea (Anything)

c:

Que se rumora ? (What is rumored ?)

u:

me tengo que ir (got to go)

c:

Clase a estas horas ? Pues bueno bueno. Nos seguimos contactando c : (Class at this hour ? Well then good. We’ll keep in contact c :)

u:

Adios (Goodbye)

c:

Bye :3 (Bye :3)




29 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Conclusions and future work

Conclusions

1

We have proposed a novel methodology for developing a conversational agent that learns to behave as a teenager from a real dialogue corpora

2

The proposed method is language independent and can be easily applied to different domains, i.e., domain independent and does not require the formulation of a hand-coded knowledge base

3

Performed experiments demonstrate that considering contextual information in addition to the use of some key units, i.e. the 1W-LF+2W-MF model ; it is possible to obtain a more natural behavior

4

An important contribution of this work is the compilation of a real dialogue corpora among Mexican teenagers, containing more than 300000 pairs of q : a




30 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results



Future work

1

We aim at evaluating our proposed method under a PJ scenario, i.e., its pertinence assisting police officers

2

In addition, we would like to evaluate this type of tools for detecting other types of behaviours, such as cyber-bullying or aggressive text

3

Test the performance of our proposed method by means of incorporating a richer semantic representation, such as Word2Vec representation




31 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results



Take away message

Even though we get to build accurate systems, there is no 100% guarantee that this type of cyber-crime gets to be eradicated We must teach our children how to be careful enough when surfing on the Web It is very important to sensibilize our authorities regarding the importance of the research being done on this field, some times the prefer keep doing analogically, i.e., without any (technologically) assistance




32 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Questions ?

Thank you !




33 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Questions ?

Contact information :

Dr. Esaú Villatoro, email : [email protected] URL : http://ccd.cua.uam.mx/~evillatoro/ Twitter : @EsauVT Twitter : @LyR_UAMC Released resources Dataset : http://ccd.cua.uam.mx/~evillatoro/ Resources/Corpus_Ask_MX.tar.gz Chatbot source code : https://github.com/Angel2113/Chatbot Ángel Callejas, email : [email protected]




34 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Questions ?

AIML Example

tag signifies start of the AIML document tag defines the knowledge unit tag defines the pattern user is going to type tag defines the response to the user if user types Hello Alice




35 / 36

Introduction

Grooming Attack

Proposed method

Experimental setup

Results


Questions ?

Examples

1

1W-LF : Dear sister tell my father that I love him