Agent Productivity Measurement in Call Center Using Machine ...

17 downloads 81098 Views 540KB Size Report
Official Full-Text Paper (PDF): Agent Productivity Measurement in Call Center Using Machine Learning.
Agent Productivity Measurement in Call Center Using Machine Learning Abdelrahman Ahmed1(B) , Sergio Toral1 , and Khaled Shaalan1,2,3 1

Electronic Engineering Department, University of Seville, Seville, Spain [email protected], [email protected] 2 School of Informatics, Edinburgh, UK 3 The British University in Dubai, Dubai, UAE [email protected]

Abstract. We present an application of sentiment analysis using natural language toolkit (NLTK) for measuring customer service representative (CSR) productivity in real estate call centers. The study describes in details the decisions made, step by step, in building an Arabic system for evaluation and measuring. The system includes transcription method, feature extraction, training process and analysis. The results are analyzed subjectively based on the original test set. The corpus consists of 7 h real estate corpus collected from three different call centers located in Egypt. We draw the baseline of productivity measurement in real estate sector. Keywords: Sentiment analysis · Agent productivity · Natural language toolkit · productivity measurement

1

Introduction

The call centers are the front door of any organization where crucial interactions with the customer are handled (Reynolds 2010). The effective and efficient operations are a key ingredient to the overall success of insource and outsource call center profitability and reputation. It is very difficult to measure productivity objectivity because the agent output, as a firm worker, is the spoken words delivered to the customer over the phone. The team of quality assurance is responsible for listening and evaluating the customer service representatives (CSR). However, the evaluation is handled in a subjective way. In other words, they listen to the recorded calls to evaluate the performance and productivity of the call center agent according to their previous experience. The quality team is responsible for reviewing the calls and measuring the level of service quality in term of conversation evaluation with the customer. In the best case, the quality team may evaluate a small number of the total number of agents calls because of limited number of quality team or huge number of calls or both. The mission c Springer International Publishing AG 2017  A.E. Hassanien et al. (eds.), Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016, Advances in Intelligent Systems and Computing 533, DOI 10.1007/978-3-319-48308-5 16

Agent Productivity Measurement in Call Center Using Machine Learning

161

of the quality team is to listen to the agents recorded calls and to use predefined evaluation forms (evaluation check list) (Reynolds 2010) as an objective evaluation. Evaluation forms state the main aspects of the productive call, i.e. saying the greeting, mentioning his/her name, or well describing the product or the service when answering customer inquiries. The evaluation process has many drawbacks because of the following reasons: (1) although the evaluation is performed through predefined check lists, the quality team marks the list according to evaluator perception and previous experience (Judkins et al. 2003). The subjectivity of this evaluation restrains generalizing the evaluation results for the rest of the agents calls. (2) The limited number of quality members may not be able to cover all the agents consistently per time. Accordingly, some agents are evaluated in rush time under very high work pressure while others are evaluated in normal operation. This leads to inconsistent evaluation from agent to another. These reasons and others have drastic impact on agent performance and lead to high turnover which negatively affects call center business. This paper proposes a method of objectively measuring agent productivity through machine learning. The rest of the paper is structured as follows: Sect. 2 discussed the conceptual framework and overview about each framework component. Section 3 details the front end preparation for feature extraction and Sect. 4 explains the experiment carried out. Obtained results are shown in Sect. 5 and finally, Sect. 6 concludes the paper.

2

Proposed Framework

In this section, we review previous literature about productivity measurement as well as general concepts and methods of agent evaluation in call centers environment. 2.1

Productivity Measurement Definition

A productivity measure is commonly understood as a ratio of outputs produced to resources consumed (Steemann Nielsen 1963). However, the observer has many different choices with respect to the scope and nature of both the outputs and resources considered (Card 2006). For example, outputs can be measured in terms of delivered product or services, while resources can be measured in terms of effort or monetary cost (Card 2006). An effective productivity measurement enables the establishment of a baseline against which performance improvements can be measured (Thomas and Zavrki 1999). This is the crucial part in productivity measurement because each call center has its own constructed reality, which differs from one domain to another. Therefore, it requires a dynamic approach of grasping eminent productivity characteristics for each call center, which helps organizations make better decisions about investments in processes, methods, tools, and outsourcing (Card 2006). The productivity measurement can be formulated using the following equation: P roductivity =

Agent Output Input Effort

(1)

162

A. Ahmed et al.

Most of the call centers focus on the agents working time and schedule analysis as a baseline of performance evaluation. The call centers key performance indicators (KPIs) are widely used to measure agent performance (Reynolds 2010). However, they mainly measure agent discipline and the time duration sitting front the desk. Although the monitoring and reporting call center systems have a wide improvement for better evaluating the agent productivity, the agent output performance is gauged and only measured through call recording systems (Reynolds 2010). The agent output (spoken words) is our objective in this study for analysis and study. The conceptual framework of the study is described in Fig. 1, where the evaluation process is automated from the beginning to the end. The block diagram includes a speaker diarization process, speech recognition, sentiment modeling and classification process. The next section discusses each building block in the framework.

Fig. 1. The study Frame work

2.2

Speech Recognition and Speaker Diarization

Speech recognition systems started in the 80 s and obtained a significant improvement by the era of neural network for machine learning (Yu and Deng 2012). By transcribing the calls into text, the content analysis has become a powerful tool for features prediction and interpretation (Carmel 2005; Othman et al. 2004).

Agent Productivity Measurement in Call Center Using Machine Learning

163

The speech in Arabic language achieved high accuracy in terms of word error rate (WER) (Ali et al. 2014). The word error rate is the main indicator of the speech recognition accuracy (Young et al. 2013). The lower WER, the higher performance of speech recognition (Woodland et al. 1994). The inbound or outbound call is divided into agent talk part, customer part, silences, music on hold and noise. As the agent part is the target of the analysis, it requires a diarization process. The diarization process is some sort of acoustic models that can make sophisticated signal and speech processing to split the one channel mono recorded voices into different speakers (Tranter and Reynolds 2006). It removes silences, music as well as giving clear one speaker voice (Tranter and Reynolds 2006). 2.3

Sentiment Analysis

Once speech is transformed to text, words (features) are the input data like raw material for sentiment analysis. Sentiment analysis refers to detecting and classifying the sentiment expressed by an opinion holder, and considers the word frequency and word probability following specific words (context) (Chen and Goodman 1996). Sentiment analysis, sometimes also called opinion mining, is the way to classify the text based on opinions and emotions regarding a particular topic (Richert et al. 2013). This technique classifies the text in polarity way (on/off/yes/no/good/bad), and it is used for assessing people opinion in books, movies etc. It deals with billions of words over the web and classifies the positive and negative opinions according to the most informative features (word) extracted from the text. Agent productivity is different than opinion mining or emotions classification because productivity is assessment of the output of the agent as mentioned in Eq. (1) regardless emotional effect. One of the most important quality standards of call centers is emotional control of the agent while speaking to the customer (Wegge et al. 2006). This means less emotions or opinion expressions are expressed through the call. Hence, the study proposes sentiment analysis to extract a hidden features that are different than opinion mining and help classifying productivity on stochastic basis. This can be achieved by high accuracy of success instances compared to training set of the input calls. Sentiment feature is the probability resulted of the class type (productive/non-productive) given probability of class features p(Cx) being x is the features (words). A Na¨ıve Bayes classifier was selected to train the model and classify the results. The Na¨ıve Bayes classifier is built on Bayes theorem by considering that features are independent from each other. Na¨ıve Bayes satisfies the following equation: p(x|c)p(c) (2) p(c|x) = p(x) where c is the class type (productive/non-productive), and x1 , x2 , , xn are the text features. The p(x|c) is the likelihood of the features given the class in the training process. Na¨ıve Bayes is a generative model so that the model is trained

164

A. Ahmed et al.

to generate the data x from class c in Eq. (2). The computation of the term p(x|c) means features prediction to be seen in input x given the class c (Martin and Jurafsky 2000). We ignore p(x) in the equation denominator as it is a constant value that never change. Accordingly, we are looking for maximum value of both of the likelihood value p(x|c) and prior probability p(c) to predict the class probability given input features. Hence the predictive class is as follows: p(c|x) = argmax[p(x|c)p(c)]

(3)

We calculate the joint probability by multiplying the probability of words given class p(x|c) with class probability p(c) to get the highest features for each class (highest probability) as follows (Murphy 2006): p(Ck |x1 , x2 , , xn ) = p(Ck )

n 

p(xi |Ck )

(4)

i=1

The next sections will illustrate the training of the classifier and how to satisfy the Eq. (4). 2.4

Data Validity

The data classification accuracy is one of the most important parts in this study. The human accuracy shows that the level of agreement regarding sentiment is about 80 %1 . However, we cannot consider this as a baseline for two reasons. The first reason, the accuracy is dependent on the domain of the text collected, and varies from one domain to another. For example, the productivity features are perceived in different way than other domains like spam emails or movies review. The second reason, machine learning approach is incomparable to human perception as we could not find similar research in the same domain. Hence, we draw the first baseline for agent call center in real estate in Arabic. For the data validation of the study, the na¨ıve classifier should be able to classify the test set accurately as intended. The accuracy calculation is given by Eq. (5). Accuracy =

c Fcor Ftot

(5)

c is the correctly classified features per class and Ftot is the total where Fcor features extracted.

3

Front-End Preparation

Front-end preparation refers to the input data diarization, the transcription of voice into text, and the transcription of the Arabic text into Latin to be ready for training. The following subsections detail each step. 1

http://www.webmetricsguru.com/archives/2010/04/sentiment-analysis-best-doneby-humans.

Agent Productivity Measurement in Call Center Using Machine Learning

3.1

165

Audio Diarization and Speech Recognition

The diarization process was intended to be performed using LIUM diarization toolkit (Meignier and Merlin 2010). It is a java based open source toolkit specialized in diarization using speech recognition models. It required Gaussian Mixture Model (GMM) for training voice and corresponding labels using two clustering states or more according to the number of speakers (number of states equal to the number of speakers). It uses GMM mono-phone to present the local probability for each speaker (Meignier and Merlin 2010). For speech recognition, we uses both of GMM and Hidden Markov Model (HMM) methods but in different configurations. In Arabic, we use 3 HMM states for each phone (Arabic proposed phones are 40 phones), each state is presented by 16 Gaussian models. There are some challenges we have faced when applying this process in our study. The first challenge that the corpus is around 7 h (small corpus) to build mature acoustic model for diarization and speech recognition. The second challenge that the corpus is a telephone conversations (8 k bit). LIUM requires at least 16 k bit to work. Hence, we manually transcribed and split the agent talk. 3.2

Converting the Arabic Text into Latin (Transcription Process)

The Arabic transcription is converted into Latin characters so that we can make processing easier. The character set are 36 character as shown in Table 1. Table 1. Sample of Arabic letters, corresponding characters transcription and its English equivalent.

The transcription process maps each letter from Arabic to the corresponding Latin character. The next example shows a transcription of a statement:

Example 1. Sample of Arabic statement transcribed in Buckwalter. The transcription shown above transforms the statement from right to left (Arabic writing direction) to left to right (Latin). Buckwalter2 is a powerful open source tool to transcribe the Arabic to Latin, and it is used in many Arabic transcription purposes.

2

www.qamus.org/transliteration.htm.

166

3.3

A. Ahmed et al.

Training the Na¨ıve Bayes Classifier

We have manually transcribed and classified the data (training set) into productive/non-productive. For training the classifier, we have to find the probability of maximum likelihood of p(x|c) and p(c). The p(c) is simply calculated as following: Nc (6) p(c) = Ntot where Nc is the number of text files annotated per class divided by total number of files Ntot . To calculate the maximum likelihood, we count the frequency of word per class and divide it to overall words counted per the same class (Martin and Jurafsky 2000) as following: count(xi |c) p(x|c) =  count(x|c)

(7)

As some words may not exist in a class, the result of count(xi , c) will be zero. The total multiplied probabilities will be zero as well. A Laplace smoothing (Martin and Jurafsky 2000) is used to avoid this problem by adding one: count(xi |c) + 1 p(x|c) =  count(x|c) + 1

(8)

To avoid underflow and to increase the speed of the code processing, we use log(p) in Na¨ıve Bayes calculations (Yu and Deng 2012).

4

The Experiment

The corpus consists of 7 h real estate call centers hosted in Egypt, which is collected and transcribed by Luminous Technologies center ([email protected]). We have listened to the calls carefully (30 calls) and categorized them to productive and non-productive. The criterion used in text files annotation is the ability of the agent or CSR to respond to the customer inquiries with right answers and appropriate words (features) being used. The 30 calls are split into smaller audio chunks each audio file duration is around 10–20 s. This process is mandatory for diarization and speech recognition for files decoding process because longer files are failed to be automatically recognized (Meignier and Merlin 2010). Therefore, total files are 500 files divided to training set of 400 files and test set of 10 % of the files (100 files). While the solution proposed is based on machine learning, we avoid the problem of productivity definition and left the quality team in each call center decide the criteria to categorize the calls according to their definition. At the end, a modeling of the training set is built for a unique features according to each call center criteria. The number of productive files was 400 files versus only 100 files for non-productive set. This unbalanced annotation biases the results of probability of each class p(c) in Eq. (3) which gives (scaling) for one class than the other. It should be for

Agent Productivity Measurement in Call Center Using Machine Learning

167

Table 2. The most informative features

example 400 productive versus 400 non-productive. In this case, we have to use a scaling factor i.e. α to adjust the probability per class p(c). The equation will be as follows: (9) C ψ = argmax[logp(x|c) + αlogp(c)] where C ψ is the predicted class ∈ C We convert the text from Arabic to Latin using Buckwalter script. Then we return back the result into Arabic. The code is developed in python using NLTK na¨ıve classifier (Raschka 2015). The code uses bag-of-words which is an unordered set of words regardless to its position but its frequency (Martin and Jurafsky 2000).

5

Results

The accuracy result of this experiment is 67 % success of the calls classified either productive or non-productive. The corpus (30 calls - 7 h) is small as expected to get higher accuracy for bigger corpus. The test set has been classified as per resulted features extracted from the corpus which is called the most informative features (MIF). MIF is feature or set of features that have very high frequency presenting productive/non-productive words that the decision of classification is built on them. The features have been converted from Buckwalter to Arabic, then to English as following: The classifier provides the frequency of the whole words in the test set according to each class (productive/non-productive). As shown in Table 2, we selected

168

A. Ahmed et al.

only the 10 most informative features out of more than 100 features extracted and classified. We try subjectivity to explore the meaning behind classification for better understanding the definition of productivity. For feature number 3 or in English saying (I have no idea) is non-productive according to lack of awareness of the product or the service. In feature number 6, the agent dictates his/her mobile number over the phone, and this is considered no productive as it consumes much time. The prices for the feature number 10 - (expensive) is classified non-productive feature because it drags the CSR in debate and consumes much time with no benefit. Furthermore, it might be an unjustified answer by agent which may indicate less awareness to the market updates and prices changes. This feature may be categorized under the same feature number 3. For productive agents, they mentioned features of the apartments or villas such as (the view), (the roof ) and the city (October) can be considered as product awareness. Nevertheless, the feature in itself works perfectly in evaluating the agent for mentioning some selling points through the call. There are other features meaningless, for example, (to go), (yesterday), (Upper), and (Sales). We think these may relevant to the error occurred of 33 % of sentiment analysis which is expected from the beginning. The accuracy can be improved by increasing the corpus and getting balanced classification for the training set for productive and non-productive.

6

Conclusion

We have proposed a method of evaluating agents performance using a machine learning approach to evaluate call center CSR objectively. The sentiment analysis is proposed using NLTK na¨ıve classifier to classify the agents as productive/nonproductive. The study also proposes end-to-end evaluation system using diarization and speech recognition. There are research gaps for finding another approach of sentiment analysis using logit regression. Comparing the results of this study (generative model) with Logit regression as a discriminative model may give significant explanation in productivity measurement. The statement fragmentation techniques, i.e. fragment significance estimation, are statement context analysis for better evaluating the text contents and context. For Agent evaluation and in order to be more informative, the productivity classification can also be extended to a range of scales rather than binary mode (productive/nonproductive). Supervised neural network can also help to obtain better results for problem classifications with less time consumption. Acknowledgment. Many thanks for Luminous technology center ([email protected]) for the corpus and giving full access to experiment server. Special Thanks for Dr. Kyoko Fukukawa, Bradford University, Bradford, UK and Dr. Yasser Hifny, Helwan University, Cairo, Egypt for their outstanding effort in this paper.

Agent Productivity Measurement in Call Center Using Machine Learning

169

References Ali, A., Zhang, Y., Cardinal, P., Dahak, N., Vogel, S., Glass, J.: A complete KALDI recipe for building Arabic speech recognition systems. In: 2014 IEEE Spoken Language Technology Workshop (SLT). IEEE (2014) Card, D.N.: The challenge of productivity measurement. In: Proceedings of the Pacific Northwest Software Quality Conference (2006) Carmel, D.: Automatic analysis of call-center conversations. In: Ron Hoory, A.R. (ed.) (2005). https://www.researchgate.net/publication/221614459 Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics (1996) Judkins, J.A., Shelton, M., Peterson, D.: System and method for evaluating agents in call center. Google Patents (2003) Martin, J.H., Jurafsky, D.: Speech and Language Processing. Prentice Hall PTR, Upper Saddle River (2000). International Edition Meignier, S., Merlin, T.: LIUM SpkDiarization: an open source toolkit for diarization In: CMU SPUD Workshop (2010) Murphy, K.P.: Naive Bayes Classifiers. University of British Columbia, Vancouver (2006) Othman, E., Shaalan, K., Rafea, A.: Towards resolving ambiguity in understanding arabic sentence. In: International Conference on Arabic Language Resources and Tools, NEMLAR. Citeseer (2004) Raschka, S.: Python Machine Learning. Packt Publishing, Birmingham (2015) Reynolds, P.: Best practices in performance measurement and management to maximize quitline efficiency and quality. North American Quitline Consortium (2010) Richert, W., Chaffer, J., Swedberg, K., Coelho, L.: Building Machine Learning Systems with Python. Packt Publishing, Birmingham (2013). 1. GB Steemann Nielsen, E.: Productivity, definition and measurement. In: Hill, M.N. (ed.) The Sea, vol. 2, pp. 129–164. Wiley, New York (1963) Thomas, H.R., Zavrki, I.: Construction baseline productivity: theory and practice. J. Constr. Eng. Manag. 125(5), 295–303 (1999) Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006) Wegge, J., Van Dick, R., Fisher, G.K., Wecking, C., Moltzen, K.: Work motivation, organisational identification, and well-being in call centre work. Work Stress 20(1), 60–83 (2006) Woodland, P.C., Odell, J.J., Valtchev, V., Young, S.J.: Large vocabulary continuous speech recognition using HTK. In: 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-94. IEEE (1994) Young, S., Evermann, G., Gales, M., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, D., Woodland, P.: The hkt book (2013) Yu, D., Deng, L.: Automatic Speech Recognition. Springer, London (2012)

Suggest Documents