Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
From Dialogue Corpora to Dialogue Systems: Generating a Chatbot with Teenager Personality for Preventing Cyber-Pedophilia
Ángel Callejas-Rodríguez1 , Esaú Villatoro-Tello1 , Ivan Meza2 and Gabriela Ramírez-de-la-Rosa1 1 Language and Reasoning (LyR) Research Group, Information Technologies Dept.,UAM, México 2 Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, UNAM, México
TSD, Brno, Czech Republic. September 14th 2016
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
1 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Outline
1
Introduction
2
Grooming Attack
3
Proposed method
4
Experimental setup
5
Results
6
Conclusions and Future work
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
2 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Problem statement
Introduction The Internet in our everyday life : Allows users to easily accessing tons of information Multiple communication services are freely available Users are exposed to a high quantity of illegal activities
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
3 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Communication platforms
Popular services on the Internet
Some of the most popular services among Internet users are those known as instant messaging platforms Twitter, Facebook, Skype, G+, etc. According to some recent studies, children and teenagers are becoming very active users3
These are very attractive since they provide many advantages : Fast, cheap and virtual environments by nature Allow to hide the real identity of users 3. Pew Research Center : http://www.pewinternet.org/2015/04/09/teens-social-mediatechnology-2015/
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
4 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Vulnerability
What about the security of users ?
An specific and very frequent type of cyber-crime activity is Grooming Attack Sexual predators or pedophiles take advantage of the anonymity provided by these messaging services According to some international organizations, child grooming is one of the illegal acts that is becoming more common in recent years
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
5 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Grooming Attack/Child Grooming
Definition
Child grooming : A communication process by which a perpetrator applies affinity seeking strategies, while simultaneously engaging in sexual desensitization and information acquisition about targeted victims in order to develop relationships that result in need fulfillment (e.g. physical sexual molestation) 4
4. C. Harms. 2007. Grooming : An operational definition and coding scheme. Sex Offender Law Report, 8(1) :1–6. Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
6 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Grooming Attack/Child Grooming
Official statistics
According to the National Center for Missing & Exploited Children 5 and the Office of Juvenile Justice and Delinquency Prevention 6 : One in five children are contacted by an offender In the 100% cases of child abuse, the victims meet the offender voluntarily About the 16% of teenagers have considered to meet someone in person after just chatting (once) with him/her, from these, the 8% have done it The 75% of children are willing to share personal information with strangers in exchange of some benefit, for instance a payment In addition, the FBI estimates that there are about 750,000 sexual predators on-line around the world at every moment
5. http://www.missingkids.com 6. www.ojjdp.gov Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
7 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Traditional approach
Sexual Predators Identification (SPI)
Currently, sexual predators are catch through police officers or volunteers, who pose as teenager in chat rooms and provoke sexual predators to approach them Since year 2004, the Perverted Justice (www.perverted-justice.com) organization has captured more than 600 predators The Terre des Hommes initiative (http://www.terredeshommesnl.org/en), managed to identify thousands of predators within months of work
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
8 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Identified problems
Disadvantages of the traditional approach
Police officers and volunteers will never be sufficient for patrolling all the Internet traffic Nowadays, there is no 100% automatic systems that allows preventing and detecting sexual predators (neither on-line or off-line) The current approach represents a forensic technique, i.e., police officer has to review thousands of text lines in order to accurately identify a sexual offender Given the nature of this approach, it is subject to errors ; since being too much time in front of a computer, posing as a victim, might result very upsetting
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
9 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
At the identification phase
Analysing and classifying SPI
During the last decade several approaches have been proposed for solving the problem of SPI : Identification of predatory chat lines Classification of predatory chat conversations Identification of the offender and the victim In general, two main research lines have been followed : 1
Representation (features) of the data : BOW, psycholinguistic features and complex behavioural attributes
2
Learning algorithms : chained classifiers, semi-supervised learning and one-class learning algorithms
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
10 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Assisting police officers
Towards the construction of chatbots
Most of the work on SPI has been oriented to analyse and classify chat conversations by means of text mining strategies Recently, Laorden, C. et al. 2013 proposed a chatter-bot called Negobot Negobot is based on game theory Negobot was designed for Spanish, however its knowledge base comes from the PJ website A main limitation is the use of common slang
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
11 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Main goals
Research questions
Our main goal is in providing automatic tools that can assist officers in the process of SPI To develop a conversational agent which interacts with human users via natural language Thus, our main research questions are : 1
To what extent is possible to automatically extract conversational rules from real dialogue corpora ?
2
How likely is that such a chatter bot behaves as a teenager does within a chatroom ?
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
12 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
System architecture
General architecture
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
13 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
The corpus
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
14 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Building the conversational model
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
15 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Building the conversational model
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
16 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Building the conversational model
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
17 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Building the conversational model
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
18 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
The chatbot model
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
19 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Evaluation
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
20 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Built corpus
Corpus
Statistics regarding the collected corpora Number of ... Users Avg. interventions (per user) Avg. length per intervention (words) Vocabulary size (distinct tokens)
Original data
Filtered Data
1300
1300
782
234
10
7
711854
186857
The built corpus is available in : http://ccd.cua.uam.mx/~evillatoro/Resources/Corpus_Ask_MX.tar.gz
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
21 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Conversational rules
Automatically inferring the conversational rules
Six different models were evaluated : 1
Least frequent word (1W-LF) [Abu Shawar et al.]
2
Most frequent word (1W-MF)
3
Least frequent word-bigram (2W-LF) 7
4
Most frequent word-bigram (2W-MF)7
5
Most frequent word and least frequent word bigram (1W-MF-2W-LF)7, 8
6
Least frequent word and most frequent word bigram (1W-LF-2W-MF)7,8
7. Context aware method 8. Bigram strategy combined with word strategy in a back-up fashion Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
22 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Evaluation
Evaluation metrics Lexical richness (LR) It captures the vocabulary diversity used in the answers generated by our chatbot LR is defined as the ratio between the number of distinct lexical units and the total number of lexical elements used Syntactical richness (SR) It pays attention to the sequence the syntactic information produced by our chatbot SR is defined as the ratio between the number of distinct POS units and the total number of POS elements used Both LR and SR were measured at 1, 2 and 3 grams level Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
23 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Evaluation
Evaluation metrics
Perplexity (P) It give us an idea of how predictable is the language we are producing in the answer Since we are trying to capture informal language used by teenagers we prefer a high perplexity, i.e., not a very predictable system
Validation For all the performed experiments we employed as a validation method a 10-fold cross validation strategy
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
24 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Experimental results
Experimental results
Method
Lexical richness (LR) 1-gram 2-grams 3-grams
Syntactical richness (SR) 1-gram 2-grams
3-grams
Perplexity (P)
1W-MF
0.078
0.135
0.256
−0.097 −0.022
0.000
1W-LF
0.130
0.213
0.412
−0.144 −0.016
0.031
917
2W-MF
0.019
0.076
0.151
−0.043 −0.011
0.001
632
2W-LF
0.079
0.181
0.344
−0.049 −0.001
0.021
928
1W-MF-2W-LF
0.086
0.157
0.306
−0.037
0.000
0.019
654
1W-LF-2W-MF
0.039
0.066
0.128
0.016
0.009
0.011
213
772
For comparison purposes we evaluated the performance in human-to-human dialogues : LR − 1gram = 0.273, LR − 2gram = 0.407, LR − 3gram = 0.735, SR − 1gram = 0.382, SR − 2gram = 0.134, SR − 3gram = 0.126 and P = 1618
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
25 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Experimental results
Lexical behavior
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
26 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Experimental results
Syntactic behavior
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
27 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Experimental results
Dialogue example (Chatbot model :1W-LF) u:
hola (hello)
c:
hola (hello)
u:
como te llamas ? (what is your name ?)
c:
muy bien :) (very good :))
u:
que edad tienes ? (how old are you ?)
c:
D : (D :)
u:
de donde eres ? (where are you from ?)
c:
aslajshhas ? C : (aslajshhas ? C :)
u:
que te gusta ? (what do you like ?)
c:
D : (D :)
u:
en donde estudias ? (where do you study ?)
c:
7 :39 en mi casa jaja (7 :39 in my home haha)
u:
dime algo (tell me something)
c:
Bien supongo... (Ok I guess...)
u:
lo que sea (Anything)
c:
Un detalle muy especial (A very special detail)
u:
me tengo que ir (got to go)
c:
... (...)
u:
Adios (Goodbye)
c:
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
28 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Experimental results
Dialogue example (Chatbot model :1W-LF-2W-MF) u:
hola (hello)
c:
hola (hello)
u:
como te llamas ? (what is your name ?)
c:
No soy famosa (I’m not famous)
u:
que edad tienes ? (how old are you ?)
c:
17 yirs ol (17 years old)
u:
de donde eres ? (where are you from ?)
c:
Por que ? (Why ?)
u:
que te gusta ? (what do you like ?)
c:
Ser rechazado un y otra vez por la misma persona xDDD Aun asi seguire XDDY una golpiza :c (Being rejected over and over again by the same person xDDD still keep up XDDY beating :c)
u:
en donde estudias ? (where do you study ?)
c:
En un hospital ah que :33 (In a hospital ah :33)
u:
dime algo (tell me something)
c:
Jajaja ok ok 7w7 (Hahaha ok ok 7w7)
u:
lo que sea (Anything)
c:
Que se rumora ? (What is rumored ?)
u:
me tengo que ir (got to go)
c:
Clase a estas horas ? Pues bueno bueno. Nos seguimos contactando c : (Class at this hour ? Well then good. We’ll keep in contact c :)
u:
Adios (Goodbye)
c:
Bye :3 (Bye :3)
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
29 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Conclusions and future work
Conclusions
1
We have proposed a novel methodology for developing a conversational agent that learns to behave as a teenager from a real dialogue corpora
2
The proposed method is language independent and can be easily applied to different domains, i.e., domain independent and does not require the formulation of a hand-coded knowledge base
3
Performed experiments demonstrate that considering contextual information in addition to the use of some key units, i.e. the 1W-LF+2W-MF model ; it is possible to obtain a more natural behavior
4
An important contribution of this work is the compilation of a real dialogue corpora among Mexican teenagers, containing more than 300000 pairs of q : a
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
30 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Conclusions and future work
Future work
1
We aim at evaluating our proposed method under a PJ scenario, i.e., its pertinence assisting police officers
2
In addition, we would like to evaluate this type of tools for detecting other types of behaviours, such as cyber-bullying or aggressive text
3
Test the performance of our proposed method by means of incorporating a richer semantic representation, such as Word2Vec representation
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
31 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Conclusions and future work
Take away message
Even though we get to build accurate systems, there is no 100% guarantee that this type of cyber-crime gets to be eradicated We must teach our children how to be careful enough when surfing on the Web It is very important to sensibilize our authorities regarding the importance of the research being done on this field, some times the prefer keep doing analogically, i.e., without any (technologically) assistance
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
32 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Questions ?
Thank you !
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
33 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Questions ?
Contact information :
Dr. Esaú Villatoro, email :
[email protected] URL : http://ccd.cua.uam.mx/~evillatoro/ Twitter : @EsauVT Twitter : @LyR_UAMC Released resources Dataset : http://ccd.cua.uam.mx/~evillatoro/ Resources/Corpus_Ask_MX.tar.gz Chatbot source code : https://github.com/Angel2113/Chatbot Ángel Callejas, email :
[email protected]
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
34 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Questions ?
AIML Example
tag signifies start of the AIML document tag defines the knowledge unit tag defines the pattern user is going to type tag defines the response to the user if user types Hello Alice
Callejas-Rodríguez, A., Villatoro-Tello E. (et al.)
A Teenager Chatbot for Preventing Cyber-Pedophilia
TSD 2016, Brno, Czech Republic
35 / 36
Introduction
Grooming Attack
Proposed method
Experimental setup
Results
Conclusions and Future work
Questions ?
Examples
1
1W-LF : Dear sister tell my father that I love him