Tracking Changes in Language - CiteSeerX

3 downloads 88848 Views 207KB Size Report
... variants of this analysis upon data drawn from an automated call center. ... John Grothendieck, AT&T Labs—Research, 180 Park Avenue,Florham Park, ...
1

Tracking Changes in Language John Grothendieck

Abstract One recent field of study is the extraction of useful information from changes in a data stream including natural language. Statistical tests upon single word occurrences can reveal many apparent differences. However, to automatically ascertain the causes of changes in the data stream requires methods for finding structure within the entire set of individual changed items. This work presents a methodology for understanding how a language model has altered based upon utterance clustering and statistical tests on individual features. It further examines clustering of lexical items via profiles of changes in association scores. A machine using an analysis package based on these techniques can isolate novel portions of the data stream. Human inspection of such data then readily determines the nature of the observed change. We investigate several variants of this analysis upon data drawn from an automated call center.

EDICS category: 1-LANG Index Terms Clustering, change detection, speech data mining

I. I NTRODUCTION Modern computers can understand natural language at a useful level. For example, AT&T has automated many customer service tasks using the “How May I Help You?” (HMIHY® ) system [6]. This service receives, processes, and transcribes a caller’s telephone audio signal automatically. Semantically loaded words allow a machine learning algorithm to build a classifier that performs well on a restricted domain. Yet as time passes, customer needs can evolve beyond the situations that arose within the training data. The automated classifier may not perform acceptably on novel requests. Recognizing such situations can maintain user satisfaction and prevent financial losses. The essential problem is characterizing the differences between two language models M1 and M2 . While some research has attempted to adapt existing language models, there has been little attention to the nature of the differences among them. Statistical tests on individual elements of a model can generate a long list of significant John Grothendieck, AT&T Labs—Research, 180 Park Avenue,Florham Park, NJ 07932, USA Email: [email protected] Phone: 973-360-8277

2

differences. Given a sufficiently large data sample, hundreds of words can demonstrate shifts in their relative frequencies. Yet such a list may be due to a handful of root causes. One might view the detected changes as extracted information while the causes would be extracted knowledge. The goal of analysis is typically to extract knowledge. Thus tools that connect statistical changes in the language to a human appreciation of semantics are very useful. Analysis thus divides into two stages. Changes within the data must be discovered; this has some body of literature and practice. The second stage is the extraction of useful intelligence about the changes, to extend data mining beyond merely noticing unexpected patterns in the data. Such analysis on a stream of speech data records appears to be a novel problem. Recognizable structure can emerge within a list of individual changes to the language. Characteristic words and phrases typically appear in utterances regarding a specific topic. Should a new topic emerge, those words appearing within its signature templates will exhibit stronger mutual associations than had been observed previously. A computer can identify a set of related changes and bring it to the attention of some monitor, and point out a few transactions that seem representative of what has changed; human judgment and a few moments of inspection can provide a label for each group. Rather than attempting to build an explicit probability model for complicated data records, one can cluster utterances based upon some notion of similarity and present any group exhibiting significant changes to a human for evaluation. Statistical comparisons relative to the global time distribution lead to investigation of those clusters which demonstrate major departures. Thus a machine algorithm can output a list of clusters prioritized by unexpected distribution in time. Such provides a more parsimonious explanation of how languages differ since one cluster can account for many changes. Detailed examinations of such clusters generate a smaller set of changes on a readilydescribable subset of the population. Natural speech contains hesitations, misstatements, and other dysfluencies. Audio data further includes background noise. Systems designed for general conversational use and real-time response have poor accuracy compared with human transcriptions. Perhaps 20–30% of the words chosen by such an ASR system will be wrong [5]. Noisy recognizer output complicates any analysis. A significant change in the frequency of a word need not indicate a change in the true occurrence of that word. Consistent mistakes can still be detected and provide useful knowledge, but consistency is not guaranteed. Novel utterances spark especially poor performance in the ASR, so various errors may disperse the impact of one change among multiple incorrect words. Thus human judgment is still required. Problems of noisy data and false alarms can be addressed by providing the human with sufficient information about each cluster to rapidly decide the import of the associated changes. This work focuses on 1-best speech recognizer output. Thus the fundamental unit of language considered is the word—reasonable when considering data records in English text, while other applications or other languages might

3

support a different choice of lexical unit. Investigation begins with the unigram model, performing statistical tests on word counts. More elaborate language models would demand a similar analysis on a different set of events, whether n-grams or parsing steps in a probabilistic context-free grammar. Comparing unigrams provides a natural starting point in analyzing a stream of conversations with meta-data. Even this simple language model presents challenges to extracting knowledge. When the relative frequency of a particular word shifts significantly in the ASR output, possible explanations include: 1) No actual change occurred in the language, but something else has affected the records—e.g. a new speech recognizer or different audio quality. 2) The usage of some other word or words elsewhere in the data has changed, and the detected change is a secondary effect. Statistical tests of relative frequency for common words can have enough power to detect “significant” changes when the real activity is in some other portion of the data. 3) The word is being used in a known context, but with different frequency. This commonly follows from a more general change in customer intents, so meta-data might provide insight into the cause. 4) The usage of some other word has changed, leading to consistent misrecognition. 5) The word is being used in a new context. Recognizing such situations is important since the system has not been trained on such transactions and is likely to respond in a less-than-intelligent fashion. These are roughly in order of increasing interest, since concern is with situations that the automated system is not trained to handle or handles poorly. Version changes to a deployed system should be a matter of record. There would be benefit in distinguishing among the other types of changes—secondary, known, and novel. Understanding the reasons underlying changes to the language requires placing them within some wider context. II. U TTERANCE C LUSTERING A. Sample Here is an example using real data from AT&T’s HMIHY® application. An independent test on the relative frequency of each word in the ASR output for January against March 2002 gives the list shown in Table I. This presents the changed words ranked by the difference in log probability for the two months. Closer examination of the data reveals a novel set of utterances in mid-January concerning mass mailings that warned customers of pending increases in the costs of various service plans. Presented with a list of unigram changes and access to the full data records, a human finds the cause of the many utterances containing “letter” in January without much difficulty. The different relative frequency of references to specific months is hardly unexpected given that the month has changed. A little knowledge of the domain explains the word “unlimited”—AT&T introduced a new calling plan with that name in February.

4

Changes to such words as “about” prove harder to explain. The test significance shows that mere coincidence is highly unlikely, but no obvious explanation presents itself. There are many differences between the two time periods considered; many explanations seem plausible. The entire list consists of over 100 detected changes, so people are unlikely to examine every member. Any insights the lower-scoring words might have shed on the causes of the more important changes will be lost. These results would be more useful in a format that presented groups of related changes. A human being tends to seek classes within the full list. These intuitive groups demonstrate several distinct patterns of change. Some words are strongly associated within both time domains; some phrase has a different relative frequency, e.g. “long distance.” Other words such as letter, rate, and change tend to co-occur in only one time domain. The month names (class 2) each possess a distinctive profile in time; rather than appearing within the same phrase, they tend to fulfill the same role in different utterances. Automatically extracting such different classes presents a challenge, but would be a major step towards discovering knowledge. Associated meta-data fields can provide further insight into observed changes. Comparison of word probabilities for data sampled from consecutive days reveals a periodic pattern within the customer service requests seen in HMIHY. Saturdays and Sundays exhibit characteristic differences, as do Sundays and Mondays. Customer intents on the weekends follow a different distribution. Tests reveal well over 100 significant changes in individual word probability (at p=0.001) when Sundays are compared with the following Monday, yet few changes are typically detected between successive Sundays. Does the difference in caller intents explain the changes to the language? Figure 1 presents the Kullback-Liebler divergence of each day from the year-long average unigram distribution as a baseline. After generating a conditional word unigram model for each transaction label, one could compare the observed unigram model to the mixture of label-specific conditional distributions weighted by number of SLU labels. Figure 1 gives the KL divergence between these probability distributions for the days of 2002. The outlying values consist almost entirely of Sundays. Perhaps on Sundays some part of the system is performing notably worse, or the word-to-label associations differ. Further conditioning the label-specific unigram distributions on the day of week brings the divergences for Sunday into line with the other observations. The unigram word distributions differ on Sundays, even for fixed call classes. Much of the change in the language model can be explained by the different distribution of caller intents—but further meta-data such as day of week has a notable impact.

B. Sublanguages Suppose one wishes to compare two different language models. Some observed shifts in P (W ) may not be of interest. In particular, the relative frequency of one word may alter significantly when the actual cause is unusual

5

behavior in a different portion of the data stream. Figure 2 illustrates a situation in which many apparent changes are explained by another. Thus detected changes can be secondary effects of a major shift in the overall language distribution. One advantage of examining subpopulations is that such secondary changes are readily recognized— while the overall word frequency has changed, the context in which it is used has not. This section develops an approach to solving this problem via a recursive divisive analysis. Given meta-data, it is possible to condition the language model on any other field. While noting that a distribution has changed has some value, more useful tools would connect this change with a particular label (or customer segment, or geographic region), particularly if the remainder of the data appears unchanged. Localizing changes within the entire data stream to a particular subpopulation is itself a major step towards recognizing the cause. Language can be viewed as a continuum of overlapping sublanguages, each more or less familiar to a given user. A human hearing conversation compars it with existing memories in order to determine which categories it belongs to, then attempts to understand it by those rules. Thus natural language data D is not evaluated by some global language model P (D) so much as it is by scanning known sublanguage scores. The more appropriate (probable) sublanguages are used to extract the meaning of D. This requires multiple previously encountered language models to be stored in memory, and somehow generalized or adapted to suit similar but not identical data. More formally, given topics Ti with probability distributions over linguistic events P (.|Ti ), a document can be generated by first choosing a topic Ti based on some multinomial distribution, then using P (.|Ti ) to randomly generate the sequence D. For simplicity, assume the linguistic events are single words and that these occur independently [10]. Then K Y N X P (W1 W2 . . . WN ) = ( P (Wn |Tj ))PT (Tj ) j=1 n=1

Since topic shifts occur in natural language, more elaborate models allow the topic of D to change—examples include the aspect model [7] and latent Dirichlet allocation [1]. The known benefits of such models suggest an analysis based upon change detection within multiple, restricted portions of the data. The problem of comparing linguistic data records from two domains naturally lends itself to recursive divisive analysis. Combining all data records provides a global language model. That model can then be separated into more specialized languages in a variety of ways—explicit conditioning on the values of meta-data such as classifier label or day of week is one possibility, as is hard clustering of the data points via maximum likelihood. This separates the data into a set of more homogenous languages, each comprised of linguistic data records from two domains— exactly the original problem, writ smaller and more manageable. This approach can track the correlations of other words with one that has apparently changed without explicitly computing N 2 such quantities for a large value of N ; words that tend to co-occur enter the same data clusters. Divisive analysis further addresses the problem of

6

significant yet meaningless changes illustrated in Figure 2, and even aids in the process of extracting knowledge by directing human attention to semantically limited sub-languages. There are drawbacks; replacing global with local tests increases the risk of missing those changes that affect multiple clusters, as well as the risk of false detection due to multiple testing on a random sample. Yet no data is missing; it has merely been re-organized. Evidence can be combined across different clusters—a word that demonstrates a modest change in probability across many different clusters would register as significant, while a word that showed borderline significant in only one of many clusters would not. The division of the data into clusters is essentially the stratification of a contingency table; rather than being tied to some covariate of interest, the stratification here is determined by some clustering algorithm. Thus investigations can be carried out within one utterance cluster, while homogeneity or trend can be tested across clusters. Statistical tests at different levels of granularity can even provide insight into the nature of changes in the global language.

C. Divisive Algorithms The most direct approach to conditioning data records is by the presence or absence of particular words. Changes in word usage might be detected by simply investigating the sub-language consisting of all utterances containing that word for significant differences. A list of significant differences in unigram probabilities makes a starting point for further investigation. The most critical missing ingredient is simply identifying which changes are “important”, which seems problematic without outside knowledge. Many score functions seem plausible—some combination of the significance and magnitude of the estimated change in probability should direct attention to the more important changes in the language. With this, it becomes possible to identify those transactions containing the most important change. Tests upon the remainder of the data would be unaffected by the isolated change; any secondary effects disappear. Should the data still exhibit interesting behavior in time, simply repeat the process. These elements provide the Word-Peeling Algorithm sketched in Figure 3. This algorithm potentially leads to an explosion in the number of domains to consider; in practice however most branches created by the presence of a reference word swiftly terminate due to sparse data. The domain splits may be likened to peeling an onion—after subpopulations containing certain words are peeled off, the bulk of the data for the two time domains will hopefully exhibit no significant changes. These splits terminate in a partition of the data set into sub-populations characterized by the presence or absence of various words, in which no sufficiently interesting changes are detected across time domains to justify further division of the data. Word-Peeling produces usable results but it can overlook important relationships among the individual words. However change importance is assigned, the scoring function may have drawbacks. Focusing on absolute magnitude

7

of the change can lead to early splits on such non-specific words as “to” or “about”, which lead to languages not much simpler than the original. Focusing on relative difference typically assigns too much weight to uncommon words, ranking as important the changes in language that are most likely to be false alarms. Employing both within the score partially addresses these issues, but this sensitivity to the scoring process highlights a fundamental concern. A hasty focus on one particular variable value as interesting could ignore useful structure within the data. Thus a less direct approach to divisive analysis might yield better results. One alternative to explicit conditioning on words is to divide the data stream into subpopulations using the criterion of language entropy. The general language model is viewed as a mixture of more specific ones. Even noisy ASR output contains considerable structure that can be used to separate transaction records. Similar data records are grouped together and used to train sub-models. Individual transactions within the data can be assigned to different clusters, including any meta-data fields such as time information, SLU labels, customer segment, etc. This gives a natural mechanism for organizing the data that does not require a complicated search through the space of Boolean conditions. This methodology has arisen within the problem of creating optimal decision trees using a large, sparse set of covariates [2]. Multiple covariates separate the observations into isolated data points or very small equivalence classes. Rather than attempting to build an explicit map from the covariate values, a divisive algorithm finds the optimal split of some node built from these classes into two new leaves. The covariates then provide a well-defined map into each leaf. For language models, evaluating a split of the data involves checking the improvement in overall entropy that follows from generating a separate language model for each part. Here is an outline of the process, the SimilarUtterance Algorithm: 1) Pool the transaction records from two time domains. 2) Split each node of the tree that fulfills some splitting criterion. Initialize by random assignment of each utterance to one of two subpopulations. Then iteratively: a) Generate the language model for each subpopulation. b) Reassign each utterance and associated data to the subpopulation with the language model that gives it the higher probability. until model convergence has been achieved. 3) If any of the newly-created leaves fulfill the splitting criterion, go to step 2. 4) Use the resulting set of leaves as the initial configuration of a k-means clustering: iteratively reassign each utterance to the leaf that gives it the highest probability and then retrain the language model for each leaf,

8

until convergence. The key stage of the algorithm creates two perturbed versions of the starting language model, then reassigns each utterance to the model for which it attains the higher probability (lower calculated entropy). The utterance assignments provide the data to retrain the two language models. Further reassignments converge to a local maximum. This is the well-known Expectation-Maximization (EM) algorithm [3], using a hard (0- or 1-valued) utterance membership distribution to split a parent node of records with corresponding language model into two sub-languages. Further iteration continues to split existing nodes according to the best improvement in overall probability of the observed data. Thus transaction records are partitioned into as many subpopulations as desired. The terminal set of leaves represent relatively homogeneous languages, easier to understand and describe than the full language model. The recursive aspects of divisive clustering simplify the task of adapting or combining different algorithms. Further analysis can be performed on the clusters emerging from Similar-Utterance. In particular, Word-Peeling can be used upon individual leaves to discover internal changes in language. This it becomes simply a special case (k=1) of a more elaborate algorithm that uses both techniques to drive different stages of the clustering process (Figure 4). The Similar-Utterance algorithm divides the data according to the structure of the language model. Since the meta-data includes time information, there is still the useful option of testing the significance of the split in time values. This provides a framework for distinguishing the type of any detected changes in word probability. A significant difference between the time distribution of a leaf and the root node might have various explanations. Changes following from a shift in the relative frequencies of known transaction types will results in data clusters that are skewed in time, but which exhibit few internal changes in language. On the other hand, distributional changes caused by some novel event in one time domain will cluster into leaves that exhibit many internal changes in word use. The hope is that many leaves will test as homogenous across time, letting users investigate a relatively small portion of the data. Analysis stratified across cluster membership leads to a very different view of the data from that provided by global tests. Table II presents some examples from the HMIHY data between January and March. Note different null hypotheses are being tested: Global H0 : The probability of the reference word is the same in both time domains. Combined H0 : The probabilities are the same within each cluster, conditioned upon observed cluster membership. Such words as “balance” show significant differences in global probability, yet demonstrate little activity within the context of each cluster; their usage has not changed, but these words have helped build utterance clusters that depart significantly from the global time distribution. Such words as “letter” exhibit significant changes in relative

9

frequency even after the language model has been stratified; callers are using these words differently.

D. Cluster Analysis Note that neither the Similar-Utterance nor the Word-Peeling Algorithm gives any guidelines for the analysis of the resulting clusters. Such details were purposely left vague since these are part of a different problem—how to present the transactions to a human being in a way that makes changes readily understood. Determining whether a change is coincidental or deserving of action requires a level of understanding difficult for a machine; automation that gave a succinct summary of the change and associated transactions would make this task easier for a human monitor. Similar issues have been addressed in the context of other problems in information retrieval [9]. This section presents several approaches to providing useful information about a cluster. Some clusters consist of multiple instances of a handful of sentences, while others consist mostly of long, rambling utterances that share a few words but no theme. Entropy gives a simple measure of cluster heterogeneity: lower entropy indicates a cluster that is easier to understand and describe. Similar notions include lexicon size and average utterance length. The Word-Peeling Algorithm provides a label for each cluster, namely the history of the splits that produced it. This can be valuable, depending on which particular presences and absences characterize the utterances. Knowing that all utterances contain “balance”, “much”, and “how” makes an adequate label, knowing all contain “to” and “T” but neither “bill” nor “yes” rather less so. A similar diagnostic is the comparison of the cluster marginal distributions with those of the remainder of the full data set. This typically generates an unwieldy list of significant differences in word and meta-data probabilities. When filtered to present the most important changes, such provides a helpful description of which covariate values are characteristic of the cluster members. Thus a user can see the words that strongly influenced the creation of the cluster, as well as any meta-data values that are significantly over-represented. These can provide useful insight into the reasons underlying any changes. Another method of summarizing clusters is to provide a few representative members. Some clustering algorithms require a “center” for each cluster; as the “most typical” member, it makes a reasonable automatic label for the group, particularly if the cluster consists of meaningful units such as a particular class of utterances. Thus a clustering algorithm that provides cluster centers might be used on a set of transactions to provide several characteristic utterances. Many further statistical tests might be performed; for example one could do a Breslow-Day test for homogeneity of odds ratio across clusters to see if individual word effects on language remain constant in time. However, such analyses might confront diminishing returns. While change detection admits some classical measures such as Type

10

I and Type II error, evaluation becomes difficult in situations in which the truth is not known. Even more difficult is deciding how well any detected changes have been explained. An automated system can present a list of candidate clusters to a human being, but the final decision as to which are actually worth attention requires human judgment. Thus the most objective measure of utility might be how much time an end-user needs to recognize and react to changes. By prioritizing detected changes and presenting relevant information, a machine can save much human time and effort. Potential benefits of additional information about each cluster must be balanced against the demands placed upon the user.

E. Results This section investigates a real data stream of records from AT&T’s HMIHY automated customer service application, with 1000 transactions sampled daily. Transaction logs include coupled information on the ASR output and SLU labeling, as well as additional information on the system’s internal state and various meta-data fields such as time and customer segment. For the following analyses, only the first turn utterance from each transaction was used in order to eliminate concerns about combining different sub-dialogs with specialized recognizers within the deployed system. The statistical tests used are based upon indicator variables Xij for word wi appearing in utterance Dj .   1, wi ∈ Dj Xij =  0, w ∈ i / Dj Ci,kl = |{Xij : Xij = k, Dj ∈ Tl }| for k ∈ {0, 1}, l ∈ {1, 2}

Each word wi thus generates a 2 × 2 contingency table. Exact or chi-squared approximate tests are made against the null hypothesis of equal ratios Ci,01 /Ci,11 = Ci,02 /Ci,12 . Significance threshold p=0.001 provides reasonable protection against false detections without crippling sensitivity. Here the importance score for each observed change in probability is a weighted mixture of its statistical significance, relative magnitude, and absolute magnitude. Cluster division ends when no significant changes are detected. For purposes of providing representative cluster members, the partitioning around medoids (PAM) algorithm [8] with k=5 clusters is used on un-weighted string edit distance between ASR transcriptions. Since running PAM can be slow, a random subsample of 200 utterances is taken from larger clusters. For the HMIHY data records for all of January and March, the first-turn dialog utterances provide 48143 automatically generated word strings (19552 from January, 28591 from March) for analysis. Analysis using WordPeeling with threshold p = 0.001 and minimum cluster splitting size of 1000 utterances divides the data into 106 clusters. The results are presented in Figure 6. Many changes, while significant, are not particularly interesting:

11

changes in the distribution of month references are expected, while utterances that consist of a single word such as “yes” or “hello” shed little light on events. Other changes prove difficult to categorize, or minor enough to seem not worth pursuing, or both. While several interesting events first appear rather late in the analysis, over half of the changes (including all those that affect more than 1% of the data) are presented within the initial 10–20 clusters: 1st cluster: introduction of unlimited plan 3rd cluster: silence becomes more common response 5th cluster: fewer cancel requests 6th cluster: more account balance requests 7th cluster: fewer customer service requests 9th cluster: more bill payments 10th cluster: more questions about unrecognized calls 17th cluster: more credit requests 28th cluster: mass mailing about rate increases 39th cluster: problem completing discount calling plan orders in March 40th cluster: fewer long distance questions 44th cluster: more transactions concerning Easy Reach plan 54th cluster: more questions about customer ID numbers Figure 5 presents sample output for the 1st and 5th clusters. Such information about each cluster speeds the knowledge discovery. A combined algorithm that initially groups the data into 50 clusters by Similar-Utterance before extracting local word probability shifts reveals some interesting differences. Figure 6 shows simple Word-Peeling generating more interesting discoveries in the initial few clusters. The combined algorithm extracts some additional pieces of knowledge, but quite far down the cluster priority queue. The score function does not distinguish a relatively small yet meaningful change from a coincidental one. The results already display some ambiguities and redundancies—certain utterance clusters incorporate multiple events, while Similar-Utterance leads to situations in which multiple clusters exhibit changes tied to the same event. A user would wish to distinguish meaningful from coincidental changes, interesting from meaningful ones, and new changes from those that merely reinforce knowledge extracted from earlier clusters. These classifications remain subjective. Since the truth of what events are driving changes in the data is unknown, cluster evaluation requires attention from a human. The actual utility is probably the knowledge extracted by a user from the first ten or twenty clusters.

12

Each algorithm generated over 100 clusters on the two months; this represents less reduction of the initial list of unigram changes than anticipated. Note however that the processed utterance cluster output is far easier to interpret. It took the author about two hours to check the full cluster output, a three- or four-fold reduction of the time spent investigating each member in a list of 130 changed unigrams. Although Similar-Utterance shows no advantage over Word-Peeling in directing human attention quickly to meaningful changes, the resulting clusters are usually more easily understood. Sometimes data splits in Word-Peeling do not simplify the domain semantics, but by focusing on specific words that have changed can isolate novel portions of the data quickly. Dividing the data based on language entropy seems better able to recognize small changes and determine the causes. Thus an analysis architecture that allows both approaches seems best. III. L EXICAL I TEM C LUSTERING A. Measures of Association Language can be viewed as a collection of isolated atomic units (typically words when analyzing English text). Atomic units combine in language to convey ideas. Many meaningful combinations appear often—while meaningless ones appear seldom or never. The most natural units to consider might comprise several words. Recognizing such common phrases as “thank you” can have many benefits—for example, translation requires the substitution of phrases and grammatical constructs rather than simple word substitutions. These techniques require some criterion for determining interesting combinations of atoms. One such measure is correlation. Consider indicator variables X1 and X2 on some linguistic data (for example, words wi in a document D):   1, wi ∈ D Xi =  0, w ∈ i / D

Let pi = P (Xi = 1) and q12 = P (X1 = X2 = 1). The correlation of X1 and X2 is then defined as: ρ(X, Y ) =

q12 − p1 p2 E(X1 X2 ) − E(X1 )E(X2 ) p =p V ar(X1 )V ar(X2 ) p1 p2 (1 − p1 )(1 − p2 )

This gives a value ranging from -1 (one word is present if and only if the other is not) to 1 (the words always occur together), with 0 indicating no relationship. Mutual information can also be used to measure the association between words. A head-tail language model links pairs of words according to the strength of their “lexical attraction” [12]: M I(X1 , X2 ) = log2 (

q12 ) p1 p2

This value is simply the X1 = X2 = 1 term from the mutual information of X1 and X2 . It has some convenient

13

invariance properties—assuming q12 ≈ αp1 p2 then any increase in p1 or p2 that leaves α constant will not affect the association score. Note however the less desirable property that for a fixed proportion M I allocates higher scores to the smaller p2 . Thus rare words with less reliable observed counts are considered more important. Other quantities have proven

more robust in practice. The paper [11] uses the statistic ν , defined as: ν(X1 , X2 ) =

q12 p1 + p2

Assuming that p1 ≈ p2 , this gives a measure of association between 0 and 0.5, but independence implies a value proportional to p1 . Thus less probable words are given less weight. A general problem with calculating associations is that estimates can be highly variable when the number of observed counts is small. The authors of [4] address this issue for intensity parameter λ, the ratio of observed counts to the number expected given independence. Co-occurrence counts are modeled as Poisson variables. Given maximum liklihood estimation of event probabilities, λ(X1 , X2 ) =

N12 N q12 q12 = = E12 N p 1 p2 p1 p2

Note that M I(X1 , X2 ) = log2 λ(X1 , X2 ). The paper uses a mixture of gamma distributions as a prior to produce an estimator for the empirical Bayes geometric mean of λ. For this work, the posterior mean of λ suffices to produce a shrinkage estimator for association (e.g. λ =

N12 +α E12 +β

using a simple gamma prior).

All such measures of association seek to recognize word combinations that appear more often than expected by chance alone. Such combinations can be viewed as more complex units within the language and used to build a better language model. Note that associations among the linguistic units can vary from domain to domain. Certain phrases may be common in one context and rare in another; as individual words show changed relative frequency between domains, so do word combinations. Identifying shifts in word association can aid in recognizing which observed changes are somehow related to one another.

B. Clustering Via Changed Measures of Association For text data from two domains, the word counts across the two domains can demonstrate significant differences in language. Should certain words tend to co-occur in the data dealing with some novel event, association scores among them would increase. Association between words might not be informative since strong relationships can hold constant throughout the data. Interest centers on those words which display major changes in their associations within the language. While it is possible to search for cliques of words with stronger mutual associations, this seems unduly restrictive.

14

Since near-synonyms tend to be negatively correlated, words fulfilling the same function will never belong to the same clique. Another problem is the high variability of association estimates, particularly for less common words or noisy data. One alternative is to view words by their relationships with many other words, seeking a similar pattern of changed associations. This would both allow words that seldom co-occur to be grouped together, and ameliorate the problem of noise by combining many pieces of evidence to make a final decision. Finding groups of similar words now simply requires a clustering algorithm with distance based on the changes in some between-word association score. A cluster that includes multiple words with significant shifts in their relative frequencies suggests that some underlying semantic class causes these changes. A list such as Table I is a natural place to search for associations. Even a long and unwieldy set of observed changes will be an order of magnitude or two less than the total lexicon. The Zipf distribution of the lexicon comes to our aid. The set of words that exhibited changes is small compared to the total lexicon, as is the set of common words. When combined, the resulting list contains most of the unigram distribution mass and all of the words that appeared interesting. It is practical to automatically check the pair-wise conditional associations of the elements in this list. This provides an algorithm that seeks to cluster related changes within the language. Given utterances for two periods: 1) Create a list of common words. 2) Create a list of individual words with significantly different probabilities in the two time domains. 3) Combine the lists into one with N words. 4) For each time domain, compute the N -by-N matrix of associational scores among all the words of the combined list. 5) Compute the differences between the two matrices. 6) Use this N -by-N difference matrix to cluster words based on pattern of changed associations. 7) Rank the clusters in terms of interest. This produces a priority queue of clusters containing changed lexical elements. Further steps might extract transaction records that include multiple members of the same cluster and characterize the resulting set of utterances. However, this direct approach of using altered word associations to drive a clustering algorithm is often not sufficient to understand the nature of the changes. Thus far, experimental results seem unduly sensitive to technical details of the clustering used; no particular method demonstrates superior recall. A word’s probability can alter with little change in its associations, and this clustering will not mark such a word as noteworthy. Precision also remains troublesome: even a modest number of clusters can include small yet apparently meaningless word classes. This technique nonetheless shows promise; words that demonstrate changed usage patterns often separate from

15

the others early in the clustering process. Sometimes the words in a cluster virtually sort themselves into a phrase describing some novel event. Even a singleton cluster warns that its member exhibits unusual changes in word association scores—that is, it is being used differently in the language—and merits further inspection. Given the combination of advantages and difficulties, word clustering by changed associations should prove most useful as part of a larger analysis. The resulting word classes can be incorporated into the algorithms of Section II-C in various ways. One approach uses the full data set to generate word clusters, which are then filtered by size to create a list of small classes of interesting words. While words are tested and scored as before, interesting class members are investigated before any other detected changes. Priority among classes is determined by summing the member scores, and the data split using all members of the highest-scoring class.

C. Results This section illustrates theses ideas upon the HMIHY records for January and March. The members of Table I were used, along with additional common words to generate a lexicon of size 500. This limits the problem of sparse data. Many combinations of word association, distance, and clustering algorithm were tested; only a few are presented here. Hierarchical clustering algorithm DIANA [8], run on correlation differences with L1 distance, extracts some useful groups once 80 clusters have emerged: Introduced unlimited plan: (unlimited) (nineteen) (ninety) Different month: (January) (February) (March) (April) (July June + 8 others) (August September + 13 others) (October) (December November) January mass mailing: (letter) (going) (received) (plan) (basic minimum monthly rate rates usage) (change changed different + 44 others) March failed orders: (complete discount unable) On this data set other measures of association appear to extract more structure at an earlier stage of clustering. From DIANA on λ differences with L2 distance, at the stage with 20 clusters: Introduced unlimited plan: (nineteen unlimited) (ninety) Different month: (April December January November first second third thousand twenty) (February) (March) January mass mailing: (letter) (basic minimum usage) March failed orders: (complete discount order unable) Questions on personal network access fee: (fee monthly personal) (network) Easy Reach service: (Easy) (Reach) (eight hundred) (reach telling)

16

Similar experiments suggest that complete linkage distance is best for isolating a compact group of related changes, while single linkage seldom creates meaningful clusters. No algorithm or statistic emerges a clear winner; most can extract some useful structure from this example. While direct MI gives poor results, the shrunken λ statistic is more robust. The classes extracted depend heavily on the distance and measure of association used. Standard clustering algorithms usually identify some structure automatically, but often fail to group important conceptual classes. Some of the classes above represent a tiny portion of the data set; it is interesting how quickly these can emerge as important using this analysis. Such word classes are filtered by size (no more than 10 members), and then given over-riding priority in Word-Splitting as suggested at the end of Section III-B. The results using λ values are contrasted with the basic algorithm in Figure 6. The divisive hierarchies differ; the good word classes tend to consolidate utterances dealing with the same event, so fewer splits result. The result shows only a modest improvement in the rate of knowledge discovery (slope of “New Interesting Detections”), but extracts the same knowledge from far fewer clusters overall. IV. C ONCLUSIONS The algorithms described in this work show promise; experimentally they have extracted useful information from automated transcriptions of real-world data. In particular they have noted some previously unknown changes to the AT&T HMIHY data records, such as the problem with order completions. This methodology provides a practical and general approach to a difficult problem. Divisive analysis provides a robust and flexible approach to understanding changes in language. It suppresses meaningless changes while isolating and helping to characterize important ones. Of particular interest is the potential for output clusters of vastly different scales. In HMIHY, common call types generated clusters of very similar utterances, sometimes comprising a large fraction (≈10%) of the total data stream, while some novel events made tight groups covering less than 0.1% of the transactions. Divisive analysis can capture both large and small changes. Separating the language model addresses the problem of secondary changes in the data. For one comparison of January and March in the experiments, 130 word frequency changes led to 106 groups of utterances using divisive clustering. Better integration of word associations within the analysis can further simplify the final output. Preliminary efforts using clustering via changed word associations reduced the number of clusters to 65 with no harm to the recall of events. Refined clustering and scoring techniques might further improve the output presented for analysis. There is great flexibility in how these analyses can be carried out. The algorithms sketched in earlier sections are quite high-level and modular. The language model features, the statistical tests, and the notion of importance used

17

are all easily modified. Many details of the implementation might be refined, but the existing tools have already demonstrated their value. Evaluation presents some difficulties since the truth about what has changed is typically not known. Multiple experiments upon the same data can be compared for a relative measure of performance, but the trade-off between such elements as true and false detections is not obvious. The best measure of utility for the end-user might be how many useful items of information were extracted per amount of human time spent analyzing the output. Thus any changes to cluster ranking, test power, or output format should be judged on whether it makes the human’s task easier, with penalties assessed for missing important intelligence. User experience studies would be needed for objective measures of these notions. R EFERENCES [1] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Neural Information Processing Systems 14, 2001. [2] P. Chou. Optimal partitioning for classification and regression trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):340–354, 1991. [3] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39:1–38, 1977. [4] W. DuMouchel and D. Pregibon. Empirical bayes screening for multi-item associations. In Proc. ACM SIGKDD, pages 67–76, 2001. [5] J.G. Fiscus, W.M Fisher, A.F. Martin, M.A. Przybocki, and D.S. Pallet.

2000 nist evaluation of conversational speech

recognition over the telephone: English and mandarin performance results.

In Proc. 2000 Speech Transcription Workshop,

http://www.nist.gov/speech/publications/tw00/html/cts10/cts10.htm, 2000. [6] A.L. Gorin, G. Riccardi, and J.H. Wright. How may i help you. Speech Communication, 23:113–127, 1997. [7] T. Hofmann. Probabilistic latent semantic indexing. In Proc. SIGIR, 1999. [8] L. Kaufman and P. Rousseauw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley Inc., 1990. [9] D.J. Lawrie. Language Models for Hierarchical Summarization. PhD thesis, University of Massachusetts, Amherst, 2003. [10] K. Nigam, A.K. McCallum, S. Thrun, and T.M. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2):103–134, 2000. [11] G. Riccardi and S. Bangalore. Automatic acquisition of phrase grammars for stochastic language modeling. In ACL Workshop on Very Large Corpora Proc., pages 188–196, 1998. [12] D. Yuret. Discovery of Linguistic Relations Using Lexical Attractions. PhD thesis, MIT, 1998.

18

TABLE I W ORD PROBABILITY CHANGE LIST IN HMIHY, JANUARY VERSUS M ARCH . T HE CHANGE IN LOG PROBABILITIES AND UNADJUSTED STATISTICAL SIGNIFICANCE ARE SHOWN , ALONG WITH SELECTED MANUALLY- GENERATED WORD CLASSES .

CLASS 1 1 2 2 2 2

3 3 4 4

WORD unlimited Reach Easy Ds June March February December letter balance talk distance long about T A .. .

∆ log(P ) + 5.7 + 5.5 + 5.3 - 5.1 + 4.7 + 2.8 + 2.2 -2.2 - 1.5 + 0.8 - 0.6 - 0.5 - 0.5 - 0.4 - 0.4 - 0.4 .. .

P-VALUE 8.1e-63 1.1e-22 2.1e-18 1.4e-17 1.0e-24 6.6e-66 2.6e-36 7.8e-34 4.0e-24 1.4e-68 1.5e-32 1.9e-31 8.1e-31 6.2e-35 1.7e-30 9.1e-30 .. .

wanna calling

+ 0.1 - 0.1

3.1e-04 8.3e-04

TABLE II R ELATIVE IMPORTANCE OF CHANGE SIGNIFICANCE , GLOBAL AND STRATIFIED BY CLUSTER . W ORDS USED IN THE SAME CONTEXTS MAY EXHIBIT NO CHANGES WITHIN CLUSTERS DESPITE SIGNIFICANT SHIFTS IN GLOBAL RELATIVE FREQUENCY.

WORD balance March unlimited February about December talk distance long .. .

∆P RANK 1 2 3 4 5 6 7 8 9 .. .

COMB.SIG. 2.1e-01 4.7e-30 4.9e-05 0.0e+00 4.6e-03 7.1e-18 6.7e-02 2.7e-01 1.3e-01 .. .

COMB.RANK (55) (2) (11) (1) (19) (3) (33) (63) (44) .. .

letter

16

1.2e-07

(6)

19

Fig. 1. Departure of the daily observed ASR unigram word distribution from that predicted by the language model. Sundays (solid) exhibit a systematic departure from the yearly average. While conditioning on observed labels leads to significant improvement (p=3.2e-03), Sundays are still modeled poorly. Combined conditioning on label and day of week represents a major improvement (p=8.3e-08) over conditioning on labels alone.

20

Fig. 2. An increase in the probability of one category within the distribution leads to significant decreases in the relative frequency of others. A conditional test excluding the fourth category reveals a much simplified set of changes.

Fig. 3. Diagram sketching the Word-Peeling Algorithm. The resulting clusters are determined by the presence or absence of various words that demonstrated potentially interesting changes in relative frequency at some stage of the divisive process.

21

Fig. 4. Sketch of one possible scheme combining multiple techniques for extracting useful information from detected changes in a stream of conversations.

22

Fig. 5. Sample analysis output for two clusters from January and March: the cluster position in the hierarchy and non-singleton sub-cluster medoids. Attaching changes to a small portion of the data makes them easy to interpret. The first cluster (41 utterances) results from a new class of utterances dealing with an “unlimited” plan. The second cluster (1904 utterances) exhibits no internal changes in language, but departs significantly from the global time distribution; the sub-cluster centers reveal that the difference is associated with a decrease in service cancellation requests.

23

Fig. 6. Knowledge extracted using global Word-Peeling (106 clusters), Word-Peeling upon 50 groups created via Similar-Utterance (130 clusters), and global Word-Peeling including changed λ-association classes (65 clusters).