Automatic Stop List Generation for Clustering Recognition Results of ...

2 downloads 0 Views 155KB Size Report
vised domain dependent method of automatic stop list generation. The ... Stop lists can be constructed based on frequency statistics in large text corpora.
Automatic Stop List Generation for Clustering Recognition Results of Call Center Recordings Svetlana Popova1,2, Tatiana Krivosheeva3, and Maxim Korenevsky3 1

Saint-Petersburg State University, Saint-Petersburg, Russia 2 Scrol, Saint-Petersburg, Russia [email protected] 3 STC-innovations Ltd., Saint-Petersburg, Russia {krivosheeva,korenevsky}@speechpro.com

Abstract. The paper deals with the problem of automatic stop list generation for processing recognition results of call center recordings, in particular for the purpose of clustering. We propose and test a supervised domain dependent method of automatic stop list generation. The method is based on finding words whose removal increases the dissimilarity between documents in different clusters, and decreases dissimilarity between documents within the same cluster. This approach is shown to be efficient for clustering recognition results of recordings with different quality, both on datasets that contain the same topics as the training dataset, and on datasets containing other topics. Keywords: clustering, stop words, stop list generation, ASR.

1

Introduction

This paper deals with the problem of clustering recognition results of call center recordings in Russian. Solving this task is necessary in speech analytics, in order to find groups of specific calls, as well as thematically similar calls. Clustering of recognition results involves classifying recordings into groups of thematically close recordings. The resulting clusters can be used to produce annotations or to discover specific features of the topics of each cluster. Solving these tasks is important for structuring and visualizing speech information. This paper deals with one aspect of improving clustering quality, which is the automatic generation of domain dependent stop lists. Using stop lists makes it possible to remove words that have a negative effect on clustering quality. We propose and test a method for automatic generation of this list. The principle behind it is that removing stop words should result in an increase in average dissimilarity between texts from different clusters and a decrease in dissimilarity between texts from the same cluster.

2

Motivation and State of the Art

Stop words are defined as words that do not carry any important information. Stop lists can be constructed based on frequency statistics in large text corpora A. Ronzhin et al. (Eds.): SPECOM 2014, LNAI 8773, pp. 137–144, 2014. c Springer International Publishing Switzerland 2014 

138

S. Popova, T. Krivosheeva, and M. Korenevsky

[1]. The principle is that the more documents contain a word, the less useful it is for distinguishing documents. This principle does not always work, since words occurring in many documents can have high context significance [2]. Using stop words improves computational efficiency of algorithms and the quality of their work [3]. Importantly, stop lists are domain dependent, that is, they are ill suited for use outside the field for which they were developed. This problem is dealt with in [2] for the task of integrating different web resources. In [4] we used a simple method of stop list generation for the task of key phrase extraction. It was based on searching through the dataset vocabulary and choosing words which improved the quality of key phrase extraction when added to the stop list. The stop list was built on the training dataset, and was used for extracting key phrases from the test dataset. In [5] we used a stoplist that was constructed by an expert by means of analyzing the frequency list of the dataset and parts of the data. The use of that stoplist significantly improved clustering quality. However, an expert is not always on hand for making specialized stop lists. For this reason we need automatic approaches to stop list generation that can replace expert work. Solving this task is the goal of our research described in this paper.

3

Experimental Data

The dataset consists of spontaneous speech recordings (8kHz sample rate) recorded in different analogue and digital telephone channels. The recordings were provided by Speech Technology Center Ltd and contain customer telephone calls to several large Russian contact centers. All test recordings have manual text transcripts. We used both manual transcripts of the recordings and recognition results with different recognition accuracy. The transcripts and recognition results were manually classified by experts into the most frequent call topics. Three datasets were constructed. The first two datasets comprise different documents classified into the same five topics. One of these datasets, which we will further refer to as TRAIN, was used for automatically generating a stop list, while the other, which we will refer to as TEST, was used for testing the changes in clustering quality before and after using the expanded stop list. The description of the TRAIN and TEST datasets is presented in Table 1. In order to demonstrate that the generated stop list can be useful for clustering recognition results of the given type of calls but does not depend on the call topic, we made a third test dataset comprising clusters on other topics. We will refer to it as TEST NEXT. In contrast to the former two datasets, which contain documents on relatively distinct (dissimilar) topics, the latter dataset contains documents on comparatively similar topics: issues of wages and unemployment in various professions (medicine, education, etc). Each of the three datasets was available in three versions: manual transcripts with ideal quality (TRAIN 100, TEST 100), recognition results with 45 − 55% accuracy (TRAIN 55, TEST 55), and recognition results with 65 − 80% accuracy (TRAIN 80, TEST 80). For receiving results with low recognition accuracy we specially used non-target

Automatic Stop List Generation for Clustering Recognition Results

139

Table 1. The description of the datasets: topics and |D| - the number of documents Dataset Topic 1, |D| TRAIN Municipal issues, 44

Topic 3, |D| Political issues, 28

TEST

Political issues, 61

TESTNEXT

Topic 2, |D| Military service issues, 24 Municipal Military issues, 128 service issues, 39 Medical Teachers’ worker’s salary salary issues, issues, 49 20

Topic 4, |D| Family & maternity issues, 55 Family & maternity issues, 75

Topic 5, |D| Transport issues, 35 Transport issues, 58

Jobs & Unemploiment issues, 53

language model (LM). We used the speaker-independent continous speech recognition system for Russian developed by Speech Technology Center Ltd [6][7], which is based on a CD-DNN-HMM acoustic model [8]. The recognition results were obtained: 1) using a general (non-target) LM trained on a text database of news articles (3-gram model, 300k word vocabulary, 5 mln n-gramms) - recognition accuracy for parts of the test database with different quality varies from 45 to 55%; 2) using an interpolation of a general news LM with a thematic language model trained on a set of text transcripts of customer calls to the contact center (70MB of training data, the training and test datasets did not overlap) recognition accuracy on the test dataset under these conditions reached 65−80%.

4

Special Requirements for Stop Words in Call Center Dialog Transcripts

It should be noted that all our datasets contain unilateral phone calls, that is, the recording contains a monologue of the caller who is recounting his or her opinion or problem. Importantly, the actual problem or the core of the message is described by only a small fraction of words in the monologue, while its substantial part consists of the callers introducing themselves, saying where they come from, greetings, goodbyes, thanks and a large number of common conversational expressions. The words that are not included in the topical part of the call are often very similar and used in different texts independent of their topic, so they can be considered noise. This leads us to the task of improving clustering quality by purging recognition results of noise words characteristic for such calls. We formulate this task as the task of automatically expanding the stop list. It should be noted that for this task, as well as for many others [2], the method of extracting stop words based on maximum frequency is not very effective. The reason for this is probably that this type of data is characterized by a relatively small set of words that define the call topic and that can be found in call transcripts in very small numbers. Removing such words from the text has a negative effect. For instance, words important for topic detection, such as “ZhKH ’municipality’, detej ‘children”’ turn out to be more frequent than such

140

S. Popova, T. Krivosheeva, and M. Korenevsky

noise words as esche ’still’, takie ’such’... We need to find a more sophisticated method of stop list generation.

5

Quality Evaluation

Our criterion for evaluating the quality of the generated stop list was the result of clustering the test dataset with and without the stop list. The greater the improvement in clustering quality provided by using the stop list, the more successful the stop list is considered to be. The clustering algorithm we use is k-means [9]. We used a vector model for representing documents. The feature space was defined using the dataset vocabulary. The weight of the i−th feature in the j −th document was estimated using tf − idf . We did not use the whole dataset vocabulary for document representation; instead, we used only words whose document frequency exceeded the value of the df parameter. The reason for deleting rare words from the vocabulary was to remove incorrectly recognized words and words specific for certain callers but not for the call topic. In order to evaluate the clustering quality, the k-means algorithm was run 100 times. We used the average estimate of all results, as well as the best and the worst result. The implementation of the k-means algorithm we used was provided by the Weka library (http://www.cs.waikato.ac.nz/ml/weka/). The parameters for seeds initialization were numbers from 0 to 99. The number of centroids was the same as the number of topics in the gold standard for the dataset (the result of manual topic labeling). Dissimilarity (documents similarity for k-means) was estimated using dist(j, k) = 1 − cos(j, k), 0

Suggest Documents