A Framework for Large Scalable Natural Language Call Routing ...

12 downloads 69164 Views 58KB Size Report
routing application for enterprise call centers and covers from call classification algorithm investigation to application programming model. Experimental results ...
A Framework for Large Scalable Natural Language Call Routing Systems Cheng Wu, David Lubensky, Juan Huerta, Xiang Li*, and Hong-Kwang Jeff Kuo Human Language Technology Department, IBM T. J. Watson Research Center, P. O. Box 218, Yorktown Heights, New York 10549, USA *Dept. of Electrical and Computer Engineering and School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213, USA

{chengwu, davidlu, huerta, hkuo} @us.ibm.com [email protected]

ABSTRACT A framework is proposed for enterprise automated call routing system development and large scalable natural language call routing application deployment based on IBM’s speech recognition and NLU application engagement practices in recently years. To facilitate employing different call classification algorithms in an easy integration manner, this framework architecture provides a plug & play environment for evaluating promising call routing algorithms and a systematic approach to carry out a large scalable enterprise application deployment. The paradigm in this paper illustrates the complementary effort to develop an automatic call routing application for enterprise call centers and covers from call classification algorithm investigation to application programming model. Experimental results on a live data testing set collected from an enterprise call center shows that the performance of the call classification algorithm implemented in this framework is outstanding. Keywords: Call classification

routing,

Framework,

Call

1. INTRODUCTION A call center is the result of the evolution from office admission to customer service center, and therefore it is an effective and contemporary means for improving customer relations, and increasing customer loyalty and winning their business. While running such important task a typical call center in an enterprise employs several hundred to a few thousand human agents. Normally a call center has many different groups of agents to handle various different issues such as billing and payment,

checking your order status, and technical support requests, etc. In order to transfer a call to its appropriate destination a human agent, normally called level one support, needs to answer the calls and engage in a dialogue with the caller to identify the reason for each call. Considering the huge number of calls daily, the cost for level one support whose only purpose is to properly transfer a call to a right destination is very significant. Since its inception in the late 1960’s, Interactive Voice Response(IVR) systems have become a natural part of everyday life. Call centers were able to deflect some of their level one support costs through DTMF key based IVR applications which ask the user to make a numerical choice using the telephone keypad. For those manual routing applications through DTMF key, the assumption is that a user fully understands menus(long and complicated sometimes) and is able to makes a correct numerical choice. However, in fact the caller’s behavior is typically far away from expectation; particularly under the long and complicated menus, the part heard before is initially easily forgotten. Many callers either randomly press a number in the hope of being able to talk to a real person or may finally succeed after repeated tries, The costs for the enterprise therefore include mistakes in routing and high customer dissatisfaction rates due to the stress and disappointment of dealing with the DTMF menus. Call center executives are facing strong challenges to cut cost and improve service quality, and are looking for new cutting edge technologies. According to the survey for 100 large call center executives conducted by Frost&Sullivan, speech recognition based automation is the number one technology (28%) they are looking for.

Voice automated call routing applications enable customers to express what they want in spoken natural language. An example is AT&T’s How May I Help You(HMIHY) natural dialog system, in this system users ask questions about their bill, calling plans, service contacts, etc.[ Gorin et al 1997] using natural language, and user’s calls can be automatically routed to appropriate level two support agents or other self-service applications. BBN has developed an automatic call routing solution called BBN Call Director which uses a statistical topic identification system to identify the topic of the call and route the call to the desired destination[Natarajan et al 2002]. Voice-enabled automated natural language call routing offers a great potential to release callers from the frustrating environment of the touch-tone modality and sharply reduce the rate of misrouting. This technology successfully shifts human agent operation to automation, transfers the cognitive load from callers to the computer, and will dramatically lower the call center operation cost and improve the user satisfaction rate. Many papers have been published in recent years on natural language call routing as natural language call routing applications attract more and more attention from the enterprise call center industry. The reported approaches include probabilistic models with salient phrases[Gorin et al 1997, Arai, et al 1999], a vector space model based information retrieval technique[Li et al 2002, Chu-Carroll et al 1999], a multinomial model for keyword occurrences plus incorporating Bayesian and Log Odds classifiers[Golden et al 1999], the use of a boosting-based system for categorization[Schapire et al 2000] and Discriminative Training (DT) of Latent Semantic Indexing (LSI) [Kuo et al 2003]. This paper focuses on the framework and methodology to speed up the development and deployment of automated natural language call routing systems. The paradigm proposed in this paper illustrates the complementary effort of building a scalable call routing application for enterprise call centers and addresses issues such as call classification algorithm, application programming model, and system self-learning. First the architecture of a scalable automated call routing system is presented and discussed in detail. Next the investigation of call classification is performed based on the benchmark result for a test set of live

data collected from a enterprise call center, and the comprehensive testing results are properly addressed. 2.

THE ARCHITECTURE OF AUTOMATED CALL ROUTING FRAMEWORK

The mission for an automated natural language call routing system is to route the caller to the desired destination based on a naturally spoken response to an open-ended prompt like “XYZ(company name), how may I help you”, so the system certainly includes IVR, CTI(Computer Telephony Interface), speech recognition, text-tospeech, call classifier, and these components must work together in a framework to meet the requirements for a large scalable enterprise call center. On the other hand, in order to evaluate different call classification algorithms and prove that the selected one has the best performance for scalable system deployment, it is also desired to have a framework that has a plug & play interface through which call classification algorithms can be easily integrated; The application call flow can be implemented in many different ways depending on the deploying environment, so it will be an important feature and potential great market value to have a framework which supports multiprogramming model System integration is a major step for voice automation application commercial deployment, voice automated call routing applications rely on both telephony and internet networks, famated call routing application deployment. SPDE is an advanced multimedia, data and content service delivery environment that embodies a multitude of both products and services that are intended to best match the telecommunications customer’s strategic service delivery strategies. SPDE allows for global telecommunication carriers to provide for their customer’s customers -- creating an environment for performing application delivery and mission-critical, fully integrated, day-to-day business operation. 2.1. IVR and SPDE IBM WebSphere Voice Response(IVR) provides a scalable telephony infrastructure integration environment. Embedded digit T1/E1 trunks on a single board support up to 288 telephony channels(T1), and up to three T1/E1 digit trunk

boards can be configured on each WebSphere Voice Response IVR system. The call transfer function from WebSphere Voice Response provides the best match to the requirement of a call routing system. IBM SPDE provides the easy integration of call center automation services with multi-device inputs. 2.2. Speech Recognition Engine IBM ViaVoice telephony recognition engine is employed in this framework. The engine derives a 12th order MFCC(Mel frequency cepstral coefficients and energy every 10 ms, along with their first and second derivatives. Linear Discriminative Analysis(LDA) train is applied, and LDA transformation is calculated every 10 ms(frame). A trigram language model is used. Multi-devices Input/delivery

WebSphere Voice App Access (WVAA) VXML

Mobile input/delivery IVR Server

Broadband Input/delivery Internet input/delivery

Application State Table

Function State Tables Custom Server

Voice Server

SR Proc TTS Proc NLP Proc

2.5. Processes on the Voice Server The Voice Server manages three basic kinds of processes , Speech Recognition(SR) process, Testto-Speech(TTS) process and Natural Language Processing(NLP) process, each of these processes connecting to its own engine. Each of these processes is considered as a channel which is an object constructed to provide certain functionality. The channels are always asynchronous, and the results are delivered back in two modes, callback and notification, which provides the foundation for large scalable voice servers. 2.6. Unified API and programming models

Java Bean Landline input/delivery

can be easily loaded by the NLP process. Both interfaces use the same data format, making them transparent to application programming models.

SR Engines TTS Engines Tcl Procedure Plug-in C Lib Plug-in

Figure 1 SPDE based IBM scalable call routing framework architecture

2.3. Text-to-Speech engine IBM’s new concatenative TTS engine based on more naturally sounding synthesized speech is used for this framework. 2.4. Call classification engine There are two plug-in interfaces available from Natural Language Processing (NLP) process for any call classifiers. One is Tool Command Language(Tcl) based, so that a standard Tcl procedure written in Tcl script can be easily plugged into the NLP process; the other is C based, the C library where a call classifier is implemented

There are three programming models(State Table, VXML and WVAA) supported by this framework, and a unified API for getting access to the call classifiers is implemented across all programming models, no matter how a call classifier is implemented and loaded. The State Table programming model allows a voice automated calling routing application to be written in WVR’s state table which is a scripting language based on pre-built functions. From Figure 1 it is shown that an application in the state table gets access to the speech recognition, TTS, and call classification functions on the voice server by using pre-built function state tables through a custom server. The Voice XML programming model is an XMLbased internet markup language for developing speech applications. Figure 1 illustrates that VXML application connects to the same pre-built functional state table through Java Bean technology. More specifically, a VXML application employs a VXML object which is written in Java script to call a Java Bean, this VXML object accesses the same recognition, TTS, and call classification functions on the voice server side. WVAA (WebSphere Voice Application Access) provides the programming model which allows users to implement three kind of presentation modalities, speech presentation(VXML), visual presentation(HTML) and wireless presentation

(WML), together. The speech presentation is developed by using JSP which includes both VXML and Java script, as the same as in VXML programming model , WVAA also uses VXML object written in Java script to call the same functional state tables through a Java Bean layer. The difference here is that the speech presentation needs to be sent to the IVR server from WVAA (application server) through an HTTP connection. The API for accessing call classification functions is transparent to the programming and implementation method of a call classifier, and the API only defines the input/output between a calling routing application and a call classifier. Although the communication between the two components is created through many different layers , the API protocol is not re-interpreted. The unified API has following basic attribute value pair format: From application: {ACTION DATA} ACTION

={RECOTEXT,

IVRACT} To application: {ACTION DATA, GRAMMAR DATA, PROMPT DATA,…..} 2.7 Testing Environment for large scalable deployment This framework provides an easily integration environment for call classification algorithms. Before the first phase of deployment it’s critical to evaluate the selected call classification algorithm’s performance including accuracy, speed, scalability, and satiability. This framework is designed for that purpose. It’s very convenient for developers to perform various testing from live call evaluation to load test based on this framework. WVR can be configured to make calls from one trunk to another and play back desired dialogs over the connected channels. Users can easily collect all the testing data from the log files provide by WVR and analyze the result. This framework also can easily co-operate with other commercial Hammer testing programs. 3. CALL CLASSIFICATION ALGORITHM ASSESSMENT Natural language call routing is still a challenging research problem despite the apparent success of a

few commercial automated natural language call routing system deployed. Although speech recognition technology has been making great progresses in recently years, it’s still far away from being perfect. Because of mis-recognition by a speech recognition engine, a robust natural language call classifier is critical in order to process a caller’s request with the demands of natural language dialogue. For example, one challenge is how to deal with multiple topics within one request, a very common case when a caller engages in spoken natural language dialog. The another important issue in the design of natural language call routing application is domain portability of the algorithm, which means the developed algorithms should be easily applied to various tasks in different call centers with limited field data supply. So it is the goal of this framework to find a deployment-oriented algorithm which can deal with all above issues. Currently most popular and basic call routing method is vector-based natural language call routing. In the vector space model, call routing is treated as an instance of document routing, where a collection of labeled document is used for training, and where the task is to judge the relevance of a set of test documents. Each destination in the call center is treated as a collection of documents and a new caller’s request is evaluated in terms of relevance to each destination. The key problem here is that there is no guarantee for the classification error to be minimized based on the way the routing matrix is constructed. In addition to the vector based call routing methods, we also tested the performance of statistical Bayes classification based routing methods. Specifically, we used multinomial distribution in modeling the likelihood of words given each routing destination. With our multinomial distribution assumption, the likelihood of a series of words (W1,W2,…Wn), which is a sentence, given the routing destination T j can be written as:

P(W1 , W2 , 1Wn | T j ) =

N! |W | |W | |W | PW1 |T j 1 PW2 |T j 2 1 PWn |T j n | W1 |!| W2 |!1 | Wn |! (3.1) , and the final call routing destination is determined

as:

T = arg max{P(W1 ,W2 ,1Wn | T j ) P(T j )} (3.2) j

2

Given that the correct target destination for x is k, the misclassification function is defined as

In the above equations, | Wi | is the total number of occurrences of word Wi within that sentence, N is the total number of word occurrences of the sentence, N = | Wi | , and PWi |T j is the probability

∑ i

of single word Wi given the routing destination T j , which can be estimated using the maximum likelihood(ML) criterion from training data as:

PWi |T =

| Wi | T j |

∑| W

j

(3.3)

| Tj |

j

where | W j | T j | is the total number of occurrences of word W j in the training sentence with routing destination T j . Discriminative Training of the routing matrix in the vector based model gave very impressive results. Since one of authors of this paper originally proposed this Discriminative Training of natural language call routing[8] , it’s our privilege to explore this discriminative training more on this framework and testing database. Discriminative Training is applied to the latent semantic indexing(LSI) matrix. This matrix is called a routing matrix and is trained based on the statistical occurrences of words and word sequences in a training. The training process involves the construction an (n*m) routing matrix R. The columns of R represent the m terms (features), and the rows represent the n destinations (classes). Terms are weighted according to term frequency inverse document frequency (TFIDF) and are also normalized to unit length to reduce the dynamic range. Specifically, let x2 be the observation vector and r2 j be the model document vector for destination j. The discriminant function for class j and observation vector x2 is defined to be the dot product of the model vector and the observation vector:

2 2 2 d k ( x , R ) = − g k ( x , R ) + G k ( x , R ) (3.5)

Where  2 G k ( x , R ) =  K 1− 1 

1

η 2 g j ( x , R )η  ∑ j ≠ k ,1 ≤ j ≤ k 

(3.6)

is the anti-discriminative function of the input x2 in class i and K-1 is the number of competing class. Notice that 2 d k (x, R) > 0

(3.7)

implies misclassification classes, i.e. the discriminant function for the correct class is less than the anti-discriminant function. Equation 5 essentially converts a mult-dimensional decision function into a one-dimension metric. The Generalized Probabilistic Descent(GPD) algorithm then can be applied to iteratively optimizes a nondecreasing function of this misclassification metric. The key message here is that, in contrast to conventional maximum likelihood training, discriminative training of the routing matrix uses the minimum classification error criterion, and the classification accuracy and robustness are improved by adjusting the models to increase the separation of the correct class from adjacent ambiguous classes. Finally discriminative training improves portability by making the classifier robust to different feature selection and by decreasing the amount of training data needed. The ViaVoice recognition engine can return the Nbest list of hypotheses. How to use this N-best information and to ameliorate the effects of recognition errors are a challenge for the classifier. A simple practical method of using the N-best list is to feed the N-best hypotheses instead of the best hypothesis only. This framework provides a simple and an efficient channel to pass all the N-best recognition results to a classifier. 4. EXPERIMENT RESULTS

F 2 2 2 g j ( x , R ) = r j . x = ∑ r ji x i i =1

(3.4)

To evaluate the performance of call classification algorithms, a training and test databases are created,

both of them live data and collected from an enterprise call center where IBM has deployed conversational applications. Both training and test data are expected to describe the real call center environment that a customer experiences. An utterance from both training and test sets might have zero to up six topics/classes, and the test set is large enough to make the results significant. The Number of Topics per sentence

0

1

2

3

4-5-6

Number of Utterances

4601

17842

3437

377

69

Percentages %

17.73

68.76

13.25

1.4

0.22

Table 1. Training-Set data distribution, the data collected from an enterprise call center. Number of Topics per sentence Number of Utterances Percentages %

0

1

3

4-5-6

82

11

1.57

0.2

so that important features are accentuated to achieve minimum classification error, while the weights for unimportant features are automatically reduced through normalization. Discriminative Training based LSI Error Rate

Multinomial model with boosting Error Rate

Multinomi al model without boosting

4.0%

7.5%

9.1%

2 topics 6.2% utterances

8.5%

10%

3 topics 7.3% utterances

9.5%

10.9%

1 topic utterances

0-6 topics 10.0% utterances

2

820

3584

723

15.7

68.66

13.85

Table 2. Test-set data distribution, the data collected from an enterprise call center.

following Tables 1 and 2 show the topic

distributions of training and test sets respectively. A big variety of call classification algorithms have been evaluated for our framework. These algorithms are trained by using the same training data set described in Table 1 and performed testing on the test set described in Table 2. The list of algorithms for the benchmark testing includes standard heuristic word weighting TFIDF, Naïve Bayes classifier, discriminatively trained LSI model, multinomial distribution based method itself and its combination with boosting algorithm, etc.. Among all the evaluated algorithms, the multinomial model and discriminatively trained LSI model are in the top two of the list in terms of overall performances, and their testing results are presented in this paper, Table 3 shows the comparison of these two algorithms. It should be pointed out that the multinomial model involved in the above evaluation does not use any stop word filtering techniques, this contrast with the discriminative training based vector space model, an optimal set of weights is determined for the features

Table 3 Comparison of the Classification Error Rate of the discriminatively trained LSI model and the multinomial model on test set.

Overall the discriminative training based vector model gave outstanding performance, and is about 30-40% better than the multinomial model. However if the detection of multi-topics in an utterance is considered, the multinomial model appears to be more robust than the discriminatively trained vector space model, because from one topic detection to two topic detection, the multinomial model’s performance degraded 12%, in contrast to 35% performance degradation for discriminatively trained vector space model. In summary, IBM’s natural language call routing framework provides an easy platform and environment for developers of various call classification algorithms to speed up the procedure from development to deployment. 5.

CONCLUSION

The IBM natural language call routing framework introduced in this paper offers a completely infrastructure for a large scalable natural language call routing system deployment and an easy platform/environment for call classification development and live performance evaluation. The unified API of interface to classifiers and transparent implementation in this framework creates a plug & play environment to easily employ different call

classifiers. The approach of discriminative training on the vector-based model addressed in this paper, shows that it is one of the most promising algorithms for call routing application for enterprise call centers. Experiments on the standard test set in the framework show that the discriminatively trained LSI model has achieved substantial performance gains. Future work will focus on discriminative training for multi-topic detection and discriminative training of multinomial distribution model. 6.

REFERENCES

[1] Gorin, A. L. Riccardi, G and Wright, J 1997 “How May I Help You?”, Speech Communications, Vol. 23 pp. 113-127. [2] Natarajan, Perm Prasad, Rohit Suhm, Bemhard and McCarthy, Daniel 2002 “Speech Enabled Natural Language call routing: BBN call director ”, Proc. ICSLP’s 2002, pp. 1161-1164. [3] Arai, K. Wright, J. H.and Gorin, A. 1999 “Grammar fragment acquisition using syntactic and semantic clustering”, Speech Communications, pp. 43-62, vol. 27. [4] Li, Li Chou, Wu 2002 “Improving Latent Semantic Indexing Based Classifier with Information Gain”, Proc. ICSLP’2002, pp. 1141-1144. [5] Chu-Carroll, J. and Carpenter, B 1999 “Vectorbased natural language call routing”,Computational Linguistics, Vol. 25 , No. 3 pp. 361-368. [6] Golden, J. O.Kimball, M. Siu, and Gish, H 1999 “Automatic Topic Identication for TwoLevel Call Routing,” Proc. ICASSP pp. 509512, Apr. [7] Schapire, R.E. and Singer, Y 2000 “BoosTexter: a boosting-based system for text categorization,” Proc. Machine Learn, vol.39, no.2/3, pp.135-168. [8] Kuo, Jeff Hong-Kwang and Lee, Chin-Hui 2003 “Discriminative Training of Natural Language Call Routers” IEEE Transactions on speech and audio processing, Vol. 11 No. 1 January.

Suggest Documents