Textual Anomaly Detection for Goal-Oriented Conversational Agents Amir Bakarov Chatme AI LLC, Novosibirsk, Russia Higher School of Economics, Moscow, Russia
[email protected]
Introduction Anomaly detection along with classification, clustering and regression is a separate type of problems in a field of machine learning. Anomaly detection problems are to find observations that highly deviate from other observations in a given dataset; such deviating observations are called anomalies. The main difference from binary classification problems is that in anomaly detection task the positive class is very well sampled, while the anomaly class is severely under-sampled. In recent years the need of resolving this task in a field of natural language processing increased, especially for developers of goal-oriented conversational agents (or, simply, chatbots) and questionanswering models – computer systems that could converse with humans using natural language dialogues to help them reach a certain goal. The number of goals (or topics) about which an agent could converse is limited, and one of the tasks of the agent is to identify whether the user talks about the goal which is not supported. We consider that this task could be approximated with an anomaly detection task, whether user queries about “unsupported” goals could be considered as anomalies. Nicety is that such utterances could be also strongly heterogeneous since the amount of topics that are not pre-defined is unlimited, and we are not able to create a decent dataset of anomalous class. So in this case we can consider this problem as a novelty detection task: type of anomaly detection tasks whether the anomaly class is not just under-sampled – it is absent in the training set. So in this study we explore ways of resolving the task of anomaly detection and compare different supervised methods of anomaly detection as well as unsupervised methods of novelty detection. To be able to use supervised methods and evaluate the results we created two multi-class datasets for anomaly detection on Russian and English languages with an hand-crafted anomalous class. Our work is the first towards the task of anomaly detection applied to textual data of Russian language, and we consider this fact as our main contribution.
Explored methods One-Class Support Vector Machine (OCSVM). Draws a soft boundary on dataset by finding the maximum-margin hyperplane (based on a Support Vector Machine (SVM) method), considering as anomalies objects outside the boundary. Isolation Forest (IF). Isolates objects of the dataset by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature (based on a Random Forest algorithm), and when a forest of collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies. Local Outlier Factor (LOF). Computes the local density deviation on every object of the dataset with respect to its neighbors, considering as anomalies samples that have a substantially lower density than their neighbors. Threshold for a standard derivation of classifier predictions (TSD). Trains a multiclass classifier (in this study we used a logistic regression with l2 penalty and inverse of regularization strength of 1) on classes of the dataset and then computes standard derivation of probabilities of classifier’s predictions for any new object that should be checked for anomalousness; if standard derivation will be lower than the heuristic threshold, then the object will be labeled as an anomaly.
Threshold for a distance to LDA topics keywords (TLDA). Creates a set of references (bags of keywords) from the topics of training data with a method of LDA and then calculates distance (we used cosine metric in this study) between new object and every reference; if the distance to each reference will be lower than the threshold, the object will be labeled as anomaly. Threshold for a reconstruction error of an Autoencoder (TA). Trains autoencoder (Diabolo network) on training data and computes reconstruction error between regression target and actual value considering objects with high error as anomalies.
Experimental setup For this study we created two datasets, one for English language and one for Russian. We mined two sets of posts from sub-forums of the most popular Web forums of each language – Dvach (https:// 2ch.hk) and Reddit (https://www.reddit.com) – to model conversational utterances. Each post set has 10 classes (which model different pre-defined topics of a conversational agent) and one class of anomalous utterances that we manually created from random posts of other sub-forums concluding intra-annotator agreement for anomalousness. Each class had 100 posts, overall amount of posts in each dataset was 1100.
To obtain embeddings of the sentences we averaged embeddings of all the words constituting the sentence; out-of-the-vocabulary words were dropped. In this study we used two Word2Vec models (each for single dataset): trained on Russian data of Dvach one and trained on English news corpus one . For each method we tuned the best hyperparameters on each dataset which were obtained by grid search; code, datasets and links to the models are available on our GitHub (https://github.com/bakarov/ conversational-anomaly).
Conclusion We conclude that using threshold on SD of predictions of a metric classifier is the most efficient method of anomaly detection, according to our datasets. In future we plan to extend the comparison to more state-of-the-art methods like GAN, to extend the datasets and to try to use more complicated methods of getting vector representations of sentences (for example, skip-thought vector model). Table: Accuracy of the considered methods on each dataset.
OCSVM IF LOF TSD TLDA TA
Dvach 0.5 0.47 0.47 0.68 0.5 0.5
Reddit 0.53 0.47 0.47 0.71 0.5 0.5
References [1] Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. A review of novelty detection. Signal Processing, 99:215–249, 2014.
Acknowledgements Figure: Distribution of anomalous (red) and normal data (black) in a two-dimensional space obtained with t-SNE on both datasets. First column of pictures is a distribution of true labels, others illustrate distribution of predictions of each method.
We thank Olga Gureenkova (Chatme AI LLC) for early conversations on some of the ideas presented.