Subword Semantic Hashing for Intent Classification on Small Datasets

Subword Semantic Hashing for Intent Classification on Small Datasets Kumar Shridhar1,2 , Amit Sahu1 , Ayushman Dash1 , Pedro Alonso4 , Gustav Pihlgren4 , Vinay Pondeknath3 , Fotini Simistira4 , Marcus Liwicki1,3,4

arXiv:1810.07150v1 [cs.CL] 16 Oct 2018

1

2

MindGarage Technical University Kaiserslautern 3 University of Fribourg 4 Lule˚ a University of Technology August 2018 Abstract

In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and outperform previous state-of-the-art methods on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based methods [11, 10, 8] are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This is the case in Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise by the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently. Second, users frequently make spelling errors. Typically, the models for intent classification are not trained with spelling errors and it is difficult to think about ways in which users will make mistakes. Models depending on a word vocabulary will always face such issues. An ideal classifier should handle spelling errors inherently. With Semantic Hashing, we overcome these challenges and achieve state-of-the-art results on three datasets: AskUbuntu, Chatbot, and Web Application [2]. Our benchmarks are available online.1

1

Introduction

State-of-the-art systems in many different classification tasks have their basis in deep neural networks [5]. This is, among other reasons, because of neural networks’ ability to efficiently learn the various features present in the classes. However, this ability also makes neural networks prone to over-fitting on the training data. A wide variety of strategies are used to prevent this, but the most reliable way to prevent over-fitting is to have a large amount of training data. This makes neural networks, deep networks in particular, unsuited for solving problems with small datasets. In such cases, as this paper will show, it is often better to use less complex machine learning models, possibly together with deep representation learning as pre-training for meta-features. One field where datasets are often small is Intent Classification for industries. Intent classification is the task of giving an input, usually a text, finding the intent behind the said text. For example, the intent behind the sentence ”Sugar causes teeth decay” is to make you eat less sugar. Since there are countless different intents that a text can have this makes labeling them difficult and highly problem specific. For example, political intents may have labels such as ”leftist” or ”right-wing” while questions may have intents such as ”where can I find” or ”how do I”. Because of this, most of the real-life intent classification datasets are small (below 25 examples per class). In this paper, we introduce an effective 1 https://github.com/kumar-shridhar/Know-Your-Intent

2

way to provide features to an intent classifier for small datasets and we evaluate it on the AskUbuntu, WebApplication, and Chatbot corpora [2]. The trained classifier on semantic hashing features outperforms the previous state-of-the-art classifiers as shown in the paper. The three datasets on which the classifier have been evaluated was introduced in the paper Evaluating Natural Language Understanding Services for Conversational Question Answering Systems [2] as a baseline to test Natural Languages Understanding (NLU) services. An NLU service is a toolkit or API which can train a natural language classifier. The idea is that a user without prior knowledge of machine learning can simply provide examples of the input and expected output of a natural language processing system and the NLU will train that system for them. The above-mentioned paper introduces three datasets that can be used to train and evaluate different NLU services on intent classification or entity classification. These datasets have often been used as a baseline for testing NLU services and other classification systems [2].

2

Related Work

Embeddings are one of the essential part of any deep learning based Natural language processing system. Word Embeddings are the most prominent way of embedding, thanks to Word2Vec [10] and GloVe [12]. Both of these models are trained in an unsupervised fashion and are based on the distributional hypothesis: Words that occur nearby have similar contextual meaning. These word embeddings were further improved by FastText [6] with the inclusion of character n-grams. N-grams inclusion allowed better approximations of out of vocabulary words. Further, state-of-the-art was improved by Allen Institute for AI with the introduction of Deep Contextualized Word Representations (ELMo) [13]. Words to Sentence Embedding involves averaging a sentence word vectors, often referred to as Bag of Word approach. This simple approach was further improved by the usage of Concatenated p-mean Embeddings [14] instead of a simple averaging. Skip-thought-vectors [7] is another approach of learning unsupervised sentence embeddings. A vocabulary expansion scheme improved the results by handling the unseen words during training. The training time is very high for the skip-thoughts vectors. Quick thought vectors [9] improved the training time by replacing the decode with a classifier by following a discriminative approximation to the generation problem. Supervised learning approach does not seem intuitive for embeddings until InferSent [4] used the Stanford Natural Language Inference (SNLI) Corpus [1] to train a classifier. MILA/MSR’s General Purpose Sentence Representation [16] further extended the supervised approach by encoding multiple aspects of the same sentence. Google’s Universal Sentence Encoder [3] uses their transformer network to train over a variety of datasets and then use the same model for a variety of tasks.

3

Datasets

Two different data corpora were used for our evaluation and benchmarks: The Chatbot Corpus and The StackExchange Corpus. A chatbot on Telegram was built to answer questions related to public transport connections. The questions were collected when the bot was in use and it forms the base for the Chatbot Corpus. The data distribution in this Corpus is described in details in the next section. On the other hand, Ask Ubuntu Platform and Web Application platform forms the base for The StackExchange Corpus. The data corTable 1: Data sample distribution for the Chatbot corpus Intent

Train

Test

Departure Time

43

35

Find Connection

57

71

3

Table 2: Data sample distribution for The AskUbuntu corpus Intent

Train

Test

Make Update Setup Printer Shutdown Computer Software Recommendation None

10 10 13 17 3

37 13 14 40 5

Table 3: Data sample distribution for the WebApplication corpus Intent

Train

Test

Change Password Delete Account Download Video Export Data Filter Spam Find Alternative Sync Accounts None

2 7 1 2 6 7 3 2

6 10 0 3 14 16 6 4

pus is collected from the two platforms mentioned. Both corpora are available on GitHub under the Creative Commons CC BY-SA 3.0 license11: https://github.com/sebischair/ NLU-Evaluation-Corpora.

3.1

The Chatbot Corpus

The Chatbot Corpus consists of two different intents (Departure Time and Find Connection) with a total of 206 questions. The corpus also has five different entity types (StationStart, StationDest, Criterion, Vehicle, Line) which was not used in our benchmarks as we only focused on Intent Classification. The language of the samples present is English. However, the train station names used are in German which is evident from German vowels usage (¨ a,¨ o,¨ u,ß). The data is further split in Train and Test dataset as shown in Table 1.

3.2

The StackExchange Corpus

The StackExchange Corpus consists of two datasets: AskUbuntu Corpus and WebApplication Corpus. The AskUbuntu Corpus consists of five Intents (Make Update, Setup Printer, Shutdown Computer, Software Recommendation, and None). A total of 190 samples were extracted from the AskUbuntu platform. Only those questions were taken that has the highest scores and most views. The answers to these questions were also taken and included in the corpus. For mapping the correct Intent to these question, Amazon Mechanical Turk was used. Table 2 shows the data distribution of AskUbuntu Corpus. The WebApplication Corpus followed the same procedure as AskUbuntu Corpus for data preparation. It consists of 100 samples and eight Intents (Change Password, Delete Account, Download Video, Export Data, Filter Spam, Find Alternative, Sync Accounts, and None). The data distribution is shown in Table 3.

4

Methodology

4.1

Semantic Hashing

Our method for semantic hashing is inspired by the Deep Semantic Similarity Model [15]. In that work, the authors propose a way to hash tokens in an input sentence so that the model will depend on a hash value rather than on tokens. This method also reduces hash collisions. A description of our method is as follows: Given an input text T , e.g., ”I have a flying disk”, tokenize it to create a list of tokens ti . The output of the tokenization should look like, [”I”, ”have”, ”a”, ”flying”, ”disk”]. Pass each token into a pre-hashing function H(ti ) to generate sub-tokens tji , where j is the index of the sub-tokens. E.g., H(have) = [#ha, hav, ave, ve#]. H(ti ) first appends a # at the beginning and at the end of a token, and then extracts trigrams from it. These trigrams are the sub-tokens tji . H(ti ) can then be applied to the entire corpus to generate sub-tokens. These sub-tokens are then used to create a Vector Space Model 2 (VSM). This VSM should be used to extract 2 Vector

Space Model

4

Figure 1: f1 score comparison of various classifiers with Semantic Hashing

features for a given input text. In other words, this VSM acts as a hashing function for an input text sequence.

4.2

Data Augmentation

A dictionary based synonym replacement of nouns and verbs was done to generate new sentences and to increase the data size. This helped in getting new variations in the training dataset. However, it did not take the spelling errors into account. We introduce QWERTY Based Mistake Probability to add noise in the training data and to make the model more certain in case of spelling mistakes. Our proposed method works in two steps and is based on the assumption that the spelling errors are very likely due to pressing the nearby key in the keyboard instead of the real one. Firstly, take all the keys present on the keyboard and map the nearest keys around it. A list is created for all the keys with their nearby distances. A fixed x and y distances are used to calculate the distance of a key to another. A figure showing the Cartesian coordinates of all keys on the keyboard is present in the Appendix section at the end. Secondly, a Mistake Probability P is defined that denotes the chances of making a spelling mistake. For any character C, swap it with another character C’ that is within d distance from C, and has an error probability greater than P. We predefined the values of d and P. These spelling errors based noise is added to the training dataset and the performance of the model is evaluated as mentioned in the next section.

4.3

Intent Classification

Using the VSM from the previous section as feature extractor, a classifier can be trained for classifying Intents. This classifier could ideally be any known classifier, like Support Vector Machines (SVM), Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), etc. The experiments in this paper have been carried out with a number of classifiers namely Random Forest Classifier, Ridge Classifier, Multi-Layer Perceptron, Support Vector Machines, K-Nearest-Neighbours (KNN), Linear Regression, Logistic Regression, and Stochastic Gradient Descent (SGD) Classifier. The classifiers are provided by Scikit learn library. Default parameters were used for Regression, SVM, and Ridge Classifiers. A grid search was used for MLP, Random Forest, SGD and KNN classifiers to find the best hyper-parameters. A prior value based on the class distribution was used for Naive Bayes Classifier. Figure 1 shows the f1 score comparison of the mentioned classifiers with semantic hashing as features. The results achieved were comparable to the state-of-the-art results for all the three corpora. To further improve the results, data augmentation was used as mentioned in the previous section.

5

Results

The performance of our method is evaluated based on the f1 score on the test dataset on all three corpora and the mean average of all three scores. The final average calculated score is compared against the results on various NLU services in the market: Botfuel, Dialogflow, Luis, Watson, Rasa, Recast, and Snips. We achieved the state-of-the-art results on AskUbuntu corpus, WebApplication corpus, and a comparable result to the state-of-the-art

5 Table 4: f1 score comparison of different NLU services with our approach Platform

Chatbot

AskUbuntu

WebApplication

Average

Botfuel

0.98

0.90

0.80

0.91

Luis

0.98

0.90

0.81

0.91

Dialogflow

0.93

0.85

0.80

0.87

Watson

0.97

0.92

0.83

0.91

Rasa

0.98

0.86

0.74

0.88

Snips

0.96

0.83

0.78

0.89

Recast

0.99

0.86

0.75

0.89

Our Approach

0.98

0.95

0.83

0.92

in Chatbot corpus. Also, we achieved state-of-the-art results on the overall average score. Table 4 shows the f1 score comparison of the same. A plot showing the f1 score comparison can be found in Appendix section at the end. Another thing to note is the training and the inference time which is in the order of seconds and milliseconds, respectively. Unfortunately, there is no data available for training and testing time for other NLU services and hence no comparison can be made. But the whole solution if integrated into a conversational service will act in real time.

6

Discussion and Future Work

The f1 score of the three big companies (Dialogflow, Luis, and Watson) vary a lot on the three corpora. This shows a lot of difference in the approach used by these companies. One interesting thing to note is that the results are comparable to the start-ups: Snips, Recast, and Botfuel, as well as, the open source platform, Rasa. This shows that this field of AI is rather new for everyone and there is no clear domination the giants in the field. Our method works good for all the three use cases and achieves the best results on average. which shows a good potential for the method. In future work, it needs to be tested and to be benchmark on more datasets to assess the domain independence of our method.

References [1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015. [2] Daniel Braun, Adrian Hernandez-Mendez, Florian Matthes, and Manfred Langen. Evaluating natural language understanding services for conversational question answering systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 174–185, 2017. [3] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, and R. Kurzweil. Universal Sentence Encoder. ArXiv e-prints, March 2018.

6

[4] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. ArXiv e-prints, May 2017. [5] Kevin Gurney. An Introduction to Neural Networks. Taylor & Francis, Inc., Bristol, PA, USA, 1997. [6] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of Tricks for Efficient Text Classification. ArXiv e-prints, July 2016. [7] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-Thought Vectors. ArXiv e-prints, June 2015. [8] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196, 2014. [9] Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. In International Conference on Learning Representations, 2018. [10] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [11] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [12] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014. [13] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proc. of NAACL, 2018. [14] Andreas R¨ uckl´e, Steffen Eger, Maxime Peyrard, and Iryna Gurevych. Concatenated power mean embeddings as universal cross-lingual sentence representations. arXiv, 2018. [15] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gr´egoire Mesnil. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd International Conference on World Wide Web, pages 373–374. ACM, 2014. [16] S. Subramanian, A. Trischler, Y. Bengio, and C. J Pal. Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. ArXiv eprints, March 2018.

7

Appendix

7

Figure 2: f1 score comparison of various NLU services with our approach

Figure 3: Cartesian coordinates of all keys on the keyboard

Subword Semantic Hashing for Intent Classification on Small Datasets

Subword Semantic Hashing for Intent Classification on Small Datasets

Suggest Documents

Intent Classification of Short-Text on Social Media - Semantic Scholar

Intent Classification of Voice Queries on Mobile ... - Semantic Scholar

Intent Classification of Short-Text on Social Media - Semantic Scholar

Subword Retrieval on Biomedical Documents - Semantic Scholar

Subword Variation in Text Message Classification - Semantic Scholar

Classification of Small Datasets: Why Using Class-Based ... - lirmm

Neural Network Modeling for Small Datasets - Semantic Scholar

Deep Learning for Emotion Recognition on Small Datasets ... - Vintage

Deep Learning for Emotion Recognition on Small Datasets ... - Vintage

Architectural Enhancements for Fast Subword ... - Semantic Scholar

Unsupervised Morphological Expansion of Small Datasets for

Feature Selection on Classification of Medical Datasets based on ...

Geometric Mining: Scaling Geometric Hashing to Large Datasets

Scalable Packet Classification through Maximum Entropy Hashing

Symphony: Distributed Hashing In A Small World 1 ... - Semantic Scholar

Hashing Functions Performance in Packet Classification - Computer ...

Protein Sequence Classification Using Feature Hashing - People

Real Brain Tumors Datasets Classification using ... - Semantic Scholar

Intent Classification of Voice Queries on Mobile Devices - People

Query Generation for Semantic Datasets - Semantic Scholar

Semantic Hashing - Google Groups

Semantic Hashing - Google Groups

Hashing for Large-Scale Structured Data Classification - OPUS at UTS

robust video hashing for identification based on mds - Semantic Scholar