âChina Bankâ, the first character of âChinaâ is kept in its abbreviation. 3.4. Data Clustering. One challenge in Chinese abbreviation modeling is collecting.
Cluster based Chinese Abbreviation Modeling Yangyang Shi, Yi-Cheng Pan, Mei-Yuh Hwang IPE, Microsoft {yangshi,ycpan,mehwang}@microsoft.com
Abstract Abbreviations in Chinese are widely observed in Chinese spoken language. Automatic generation of Chinese abbreviations helps to improve Chinese natural language understanding systems and Chinese search engine. The abbreviation generation is treated as a character-based tagging problem. Due to limited training data, Chinese abbreviation generation suffers from data sparseness. Two types of strategies are proposed to reduce the impact from data sparseness. First of all, in addition to using a traditional sequence labelling method – Conditional Random Fields (CRF), we propose to apply Recurrent Neural Network with Maximum Entropy Extension (RNNME) [9], which actually shows similar performance as using CRF in our experiment. Secondly, we propose to use training data clustering and latent topic modeling in abbreviation generation. Using training data clustering or topic modeling not only addresses the data sparseness, but also takes advantage of the fact that full-names from the same cluster or the same latent topic have similar abbreviation patterns. Our experimental results show that using manual clustering, the accuracy of abbreviation generation achieves relatively 8% improvement. Using Latent topics that are obtained from Latent Dirichlet Allocation (LDA), the accuracy achieves relative 10% improvement. Index Terms: Chinese name abbreviation, Conditional Random Field, Recurrent Neural Network
1. Introduction In modern Chinese spoken language, the usage of abbreviations is ubiquitous. The modeling of Chinese abbreviation can improve the generalization capability of many natural language processing (NLP) systems such as dialogue systems, voice search systems and search engines. Chinese language does not have a clear notion of words. The language is written in the way that there is no whitespace between characters. Readers comprehend an article by doing a mental word segmentation based on the semantics. Chinese names may consist of long sequences of characters. In this work and all mentioned prior work, all Chinese documents and sentences are pre-word broken into pseudo word units. Chinese word segmentation algorithms range from a simple longest-first segmentation given a pre-defined lexicon, to complicated CRF based tagging (tag each character as word boundary or not). Word segmentation is out of the scope of this paper. Readers may refer to [1, 2, 3, 4] for further studies. Due to pre-word segmentation, we will use the term “word” to refer to a sequence of characters delimited by whitespace, as in English. In Fig.1, you see whitespace in all three examples due to pre-word breaking. The formation of Chinese abbreviation is different from English abbreviation. In English, full-names are usually abbreviated using acronyms or truncations. However in Chinese, a character from any position in the full-name can be selected as
part of the abbreviation. Some words can even be completely ignored in abbreviation. As shown in the second example of Fig.1, the middle two words are ignored in abbreviation. Chinese abbreviation can be generated in the following three ways [5, 6] in general: (a) eliminating some characters from a long name (first example of Fig.1), (b) eliminating some characters and re-ordering the rest characters (second example of Fig.1), and (c) rephrasing a long name by another short name. The third way of abbreviation is illustrated in the last example of Fig.1. The four biggest audit companies (Deloitte, PricewaterhouseCoopers, Ernst & Young and KPMG) are rephrased as “four big”. In this paper, we only focus on the first way of generating abbreviated names. In [5], the Chinese abbreviation generation is treated as a character-based tagging problem that is tackled by Conditional Random Field (CRF) [7]. The explicit features such as the current character, current word, position of current character in the current word and the length of a full-name are used in the CRF models. In addition to these explicit features, in this paper we propose to use topic information in abbreviation modeling based on the insight that Chinese abbreviation modeling seriously suffers from data sparseness. By taking advantage of topic information, we want the abbreviation model to rely on topic information when the model meets some new characters or new words in the testing data. In this paper, we discuss two types of topic information, one is determined by manual clustering; the other one is determined automatically by Latent Dirichlet Allocation (LDA) [8]. Furthermore, we also propose to apply Recurrent Neural Network with Maximum Entropy extension (RNNME) [9] to Chinese abbreviation generation. The rest of the paper is organized as follows. Section 2 discusses the related work about Chinese abbreviation generation, topic modeling and RNNME. In Section 3, we discuss about feature selection, the usage of topic information and the usage of RNNME in Chinese abbreviation modeling. Section 4 gives the experimental results. The final section draws conclusions. Full Name
Abbreviation
工行
中国 工商 银行
北京 大学 附属 第六 医院
德勤 普华永道 安永 毕马威
工商 银行
Translation
China Industrial and Commercial Bank
北医六院
Peking University Associate No 6. Hospital
四大
Deloitte PricewaterhouseCoopers Ernst & Young KPMG
Figure 1: Examples of Chinese Abbreviation.
2. Related Work
3. Chinese Abbreviation Modeling
In [6], the authors proposed to use Hidden Markov Model (HMM) based methods to generate the most possible full-name of an abbreviation as well as the most possible abbreviation from a full-name. In order to get the full-name of an abbreviation, all the probable full-names of such abbreviation are required. In their abbreviation model, abbreviation patterns and length features are used. However, as pointed by [5], HMM based abbreviation modeling did not address the word-to-null mapping that happens in place-name abbreviation. Fu et al.,[10] also use generative modeling to deal with Chinese abbreviations but using n-gram models. In this paper, we also use the n-gram features in abbreviation models. However, due to data sparseness, such n-gram features do not improve the proposed models.
In this section, we focus on applying CRF and RNNME in Chinese abbreviation modeling, the feature selection for CRF and RNNME and the data clustering of the training data.
In [11], the authors applied a SVM method using context information to get the proper pairs of abbreviation and full-name. However, they assume that the full-name and its abbreviations appear in the same document, which requires the whole document available for the modeling. In some practical applications (e.g. dialogue systems and voice search systems), document level context information is not available. To address this problem, Yang et al.,[5] proposed to treat the Chinese abbreviation generation as a Chinese character-based tagging problem which is addressed by discriminative modeling using CRF. Each character in a word is labelled as “reserved” or “delete”. The “reserved” characters forms the abbreviation. In [5], the current character, current word, position of the current character in the current word and the length of abbreviation and full-name are used in abbreviation generation. In this paper, we propose to use clustering method and topic information on top of the method discussed in [5]. In addition to CRF, in our Chinese abbreviation modeling, we propose to use RNNME [9], which has shown improved performance over CRF in many NLP tasks. Conditional Random Fields (CRFs) [7] are discriminative undirected graphical models, which are widely used in NLP tasks (e.g. word segmentation [12] and part-of-speech tagging [13]). The advantage of the CRF models over the HMM models is that the CRF models relax the independence assumption of the observations required by HMM. In addition, CRF avoids the label bias problem, which are suffered by the HMM models and other generative directed graphical models. The label bias problem is caused by the per-state probability normalization. In generative models, when some state has a low entropy, the distribution of the following states ignores the observations. Recurrent Neural Networks [14] demonstrates state-of-theart performance in speech recognition [9, 15, 16], semantic analysis [17] and natural language understanding [18, 19]. In this paper, we apply RNN with maximum entropy extension to Chinese abbreviation modeling. Compared with CRF, RNNME not only addresses the data sparseness problem [20], but also models long distance dependencies. Latent Dirichlet allocation (LDA) [8] is a hierarchical probabilistic generative model, which represents each document by a latent topic-mixture vector drawn from Dirichlet distribution. LDA has been widely used in various NLP applications [21, 22, 23]. Based on the insight that latent topics have the potential to reduce data sparseness, in this paper, we also apply LDA in Chinese abbreviation modeling. Instead of working on a document, LDA is applied to Chinese sentences. Based on the probability distribution of latent topics given a word, we characterize each word with a latent topic feature that is used in CRF and RNNME.
3.1. CRF in Chinese Abbreviation Modeling As proposed in [5], the Chinese abbreviation generation is addressed by CRF as a character-based tagging problem. Each character of the full-name is associated with a label that indicates whether such a character is reserved or ignored in abbreviation. In CRF model, each character of the full-name is characterized by an attribute vector. Each element of the attribute vector corresponds to one feature that will be discussed in the following subsection. So the conditional probability of the abbreviation tags L = (l1 l2 , ..., ln ) given its full-name C = (c1 c2 , ..., cn ) is:
P ∝ exp(
[ ∑ ∑ t
λj fj (lt−1 , lt , C, t) +
j
∑
] µk sk (lt , C, t) ).
k
(1) Where lt is either “reserve” or “delete”; fj (lt−1 , lt , C, t) is the jth transition feature function of label and character at position t and t − 1. sk (lt , C, t) is the kth state feature function of label and character at position t. 3.2. RNNME in Chinese Abbreviation Modeling In [9], originally Tomas Mikolov applied RNNME in language modeling. Later RNNME is modified to integrate additional features in language modeling [24] and natural language understanding [18]. In this paper, we propose to use the modified version of RNNME in our Chinese abbreviation modeling. The basic structure of the modified RNNME is illustrated in Fig.2. The RNNME has three layers: the input, hidden and output layer. The input layer consists of current character, current features and previous character activated hidden layer. Both current character and current features are encoded by one-hot vectors. The output layer gives the probabilities of the different labels for the current character. In this case, there is only one output unit, where output value 1 represents “reserve”, 0 “delete”. The sigmoid function and softmax function are used as the activation function in hidden layer and output layer. As shown in Fig.2, there is a dash line directly connecting the input layer to the output layer, which can be viewed as a Maximum Entropy Model that using continuous features. Output(t): yt
label: Lt
Figure 2: RNNME structure for Chinese abbreviation modeling.
3.3. Feature Selection The feature selection plays an important role in applying CRF and RNNME in Chinese abbreviation modeling. In this paper,
the following types of features are extracted. CW
Character and word features, ct and wt , where wt is the word for which character ct belongs to. In Chinese abbreviation, many patterns are correlated with words and characters themselves. As the first example in Fig.1 shows, the first character of “bank” is usually ignored in the abbreviation.
PS
Position features include: (a) The positions of the current character in the entire input character sequence and in the current word, and (b) the position of the current word in the input word sequence.
SL
Sequence length features. For each character ct , we compute the length of wt (number of characters), and the length of the full-name (number of words). The latter is a constant for all characters of a given name. For example, for long full-names such as the second example in Fig.1, the first word “China” is completely deleted in the abbreviation. However, in some short names, e.g., “China Bank”, the first character of “China” is kept in its abbreviation.
Taking advantage of Bing query database, an abbreviation candidate is selected if such abbreviation satisfies the following two requirements: First, this abbreviation candidate has to exist in the query database. In other words, such abbreviation candidate was entered by real web users in the past as a search query. Secondly, the full-name of this abbreviation appears in the title of the web pages that the user clicked when he/she entered the abbreviated query. Using such a strict selection rule, we get more than 6000 pairs of full-names and abbreviations, from which 3612 pairs of full-names and abbreviations are eventually achieved through manual verification. The full-names in total have 23765 words, and 14235 characters. The abbreviations have 11598 words and 18467 characters. In our experiments, 80% of the full-name and abbreviation pairs are randomly selected for training, 10% for development and the rest for testing. Our CRF results are obtained using CRF ++ [26]. In the training of CRF models, the soft margin parameter is determined by grid search from 0.1 to 1.0 on the development data. All the RNNME models described below use 200 hidden units. The initial learning rate for RNNME is 0.1. 4.2. Results
3.4. Data Clustering One challenge in Chinese abbreviation modeling is collecting the training data. Usually the training data needs manual collection, which highly constrains the training data size. For example, in [5], they used 1945 pairs of organization full-names and their abbreviations. In [6], they eventually used 1547 pairs of full-names and abbreviations. In this paper, we use 3612 pairs of full-name and abbreviations for place names. Such small training data size brings a serious data sparseness issue for Chinese abbreviation modeling. Data sparseness is a common issue for NLP. It is manifested especially when the training data size is small. In language modeling area, people use class-based methods to reduce the impact from data sparseness [25]. In this paper, we propose to a similar strategy for Chinese abbreviation modeling. In this paper, two types of clustering methods are adapted to improve the performance of Chinese abbreviation models. One is based on manually selected key words and patterns. In Chinese, similar types of place-names share similar key words and patterns. For example, almost all the hospitals and banks have the key word such as “hospital” and “bank” in their full names. Most restaurants and hotels have some special Chinese characters in their names. The other one applies Latent Dirichlet Allocation (LDA) method to find the latent topics for different place names. In LDA, we use the same training data as abbreviation modeling. Each full-name is treated as a document. Basically, we expect the latent information in full-names to help abbreviation modeling.
4. Experiments 4.1. Experiment Setup In Chinese abbreviation modeling, the first challenge is that we do not have a well constructed data base. In this paper, we propose the following heuristic method to extract pairs of Chinese full-name and abbreviations. In this paper, we use more than 5M Chinese full place-names from Microsoft internal entity library. Based on these full-names, we generate all the possible abbreviation candidates. For a full-name with n characters, we could get 2n − n − 2 different abbreviations excluding single character abbreviations, empty string and original full-name.
Three sets of experiments are conducted in this paper. The first experiment shows the contribution of different types of features in Chinese abbreviation modeling. Additionally, the first experiment also shows the comparison of the CRF models and the RNNME models for Chinese abbreviation. In the second experiment, a manual data clustering method is used in Chinese abbreviation modeling. The third experiment use LDA to generate the latent topics, of the whole training data, which are treated as additional features for CRF and RNNME. 4.2.1. Feature Selection Table 1 shows the accuracy of Chinese abbreviation generation based on different types of features using CRF and RNNME. The row of “CW” shows the results only using the current character and current word information. As shown in the table, taking advantage of character-bigram and word-bigram features does not make much difference. This might be caused by the data sparseness issue. Table 1: Accuracy of CRF and RNNME abbreviation models using different features.“CW-bigram” means using character bigram and word bigram information. “PS” shows the results of only making use of the position of current character in current word, current character in character sequence and current word in word sequence. “PS-bigram” means using PS bigram feature. features
CRF
RNNME
CW CW PS CW CW CW
49.9 49.7 47.7 52.3 53.1 53.3
50.1 50.1 47.4 52.5 52.9 53.2
+CW-bigram +PS +PS +PS-bigram +PS +PS-bigram+SL
The row of “PS” shows the results of only making use of the position of current character in current word, current character in character sequence and current word in word sequence. Compared with CW features, the size of PS is much smaller. So using PS features can partly reduces the impact from data sparseness.
When some characters and words are not learned from training data, the model can still fall back on PS features. As shown in Table 1, only using PS features, both CRF and RNNME can achieve more 47% accuracy. By combining CW, PS-bigram and SL features together, the Chinese abbreviation model achieves the best performance. As shown in Table 1, using RNNME in Chinese abbreviation modeling does not achieve substantial improvement over CRF. By mapping the discrete index into a continuous vector [20, 27, 28], the RNNME is able to reduce the impact from data sparseness. However, in order to take this advantage, a reasonable size of training data should be available for achieving a good quality of embedding vectors. Even though our training data is too small to allow RNNME fully demonstrate its superiorities, we can still notice that RNNME can achieve small improvement over CRF, when both models only use sparse features such as CW. 4.2.2. Cluster-Based CRF In this experiment, in order to improve the performance of Chinese abbreviation generalization, we propose to cluster the training data to train cluster-dependent CRF models. We notice that similar types of Chinese place names share similar abbreviation patterns. By training cluster-dependent models, we aim at fitting each data cluster better. Here we manually label the cluster ID for each piece of training data and test data. That is, the topic of each testing full-name is known at run-time. Table 2 shows the results in comparing global CRF vs. cluster-based CRF in Chinese abbreviation generalization. The first column of the table has one global CRF model, while the 2nd column manually divides the training data into 7 topics and thus 7 CRF models are trained, and the topic of each testing fullname is given. We can see that in five of the seven domains, the cluster-based CRF achieves better accuracy than CRF. Overall, in terms of accuracy, the cluster-based CRF models achieve almost 8% relative improvement over the global CRF model. We believe the more training data there are, the better cluster-based CRF models will be. In Chinese spoken language, it is often that one full-name has several different abbreviations, such as the first example shown in Fig.1. Some of the abbreviations generated from our models don’t exactly match the provided single ground truth. However, in real life, people may use these alternative abbreviations. So the columns labelled “man-CRF” and “man-c-CRF” gives accuracy from manual judgement.
Table 2: Accuracy computed based on single reference vs. accuracy computed from manual judgement, using one single CRF model vs. clusters of CRF models. cluster food shopping park school bank cinema hospital overall
CRF
50.4 39.1 65.2 35.0 53.2 65.4 80.5 53.3
c-CRF 52.3 38.6 51.9 38.6 68.8 71.4 81.0 57.5
man-CRF 90.4 83.7 90.8 87.1 63.2 93.0 99.5 89.1
man-c-CRF 94.3 84.2 77.3 89.3 82.1 99.7 99.5 91.2
RNNME based abbreviation modeling is also experimented. However, we do not observe any improvement over CRF. By
using clustering method, RNNME models can also achieve improvement from 53.2% to 56.8%. 4.2.3. Latent Dirichlet Allocation Instead of manual clustering of the training data, we investigate the usage of LDA in finding the latent topics. An assignment of latent topics over words is obtained from LDA via variational inference. The latent topic vector of each word is used as one binary feature in CRF and RNNME models. Notice we are not using the latent topic information to cluster training data. Table 3 gives the accuracy of Chinese abbreviation generation by using LDA with different sizes of latent topics. As shown in the table, when CRF is used, 20 latent topics yield the best performance. However, when RNNME is used, the best accuracy is achieved by using 40 latent topics. Table 3: Accuracy of CRF vs. RNNME abbreviation models, using latent topics from LDA. The topic vector is used as one feature for CRF and RNNME. latent topics 5 10 20 40 60 80 100
CRF
RNNME
55.3 57.1 58.1 57.6 57.1 56.4 54.2
56.0 56.9 57.8 58.4 57.3 56.8 53.8
5. Conclusions and Future Work Chinese abbreviation generation was addressed in this paper. Conditional Random Field models were applied to Chinese abbreviation generation by representing abbreviation generation as a character-based tagging problem. Due to difficulties in abbreviation data collection, only small amount of training data was available for Chinese abbreviation models, which brought serious data sparseness issue. In order to improve the Chinese abbreviation modeling, Recurrent Neural Network with Maximum Entropy Extension (RNNME), training data clustering and Latent Dirichlet Allocation were applied in abbreviation modeling. RNNME represents discrete feature indices with a continuous vector, which has the potential to reduce the impact from data sparseness. However, to train such continuous vectors, a large train data set was also expected. Basically, using RNNME achieves similar performance with CRF in abbreviation modeling. In this paper, better abbreviation performance was achieved by applying data clustering and latent topic modeling. Based on manual topic clustering, the accuracy of CRF models got almost 8% relative improvement. Using LDA with 40 latent topics, the accuracy of Chinese abbreviation achieved 10% relative improvement. Due to small training data, the RNNME did not achieve substantial improvement over CRF. However, using pre-training to prepare embedding vectors for RNNME may help improve the performance of RNNME. In the future, active learning methods should be applied in Chinese abbreviation modeling to enlarge the training data.
6. References [1] Z. Wu and G. Tseng, “Chinese text segmentation for text retrieval: Achievements and problems.” JASIS, vol. 44, no. 9, pp. 532–542, 1993. [2] H. Li and B. Yuan, “Chinese word segmentation,” in Language, Information and Computation, 1998. [3] J. Gao, M. Li, C.-N. Huang, and A. Wu, “Chinese word segmentation and named entity recognition: A pragmatic approach.” Computational Linguistics, vol. 31, no. 4, pp. 531–574, 2005. [4] S. Li and C.-R. Huang, “Word boundary decision with crf for chinese word segmentation.” in PACLIC, 2009, pp. 726–732. [5] D. Yang, Y.-C. Pan, and S. Furui, “Automatic Chinese Abbreviation Generation Using Conditional Random Field,” in North American Chapter of the Association for Computational Linguistics, 2009, pp. 273–276. [6] J. shin Chang, “A preliminary study on probabilistic models for chinese abbreviations,” in Proceedings of the Third SIGHAN Workshop on Chinese Language Learning, 2004, pp. 9–16. [7] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. 18th International Conf. on Machine Learning, 2001, pp. 282–289. [8] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993– 1022, 2003. [9] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock´y, and S. Khudanpur, “Recurrent neural network based language model,” in Proceedings of Interspeech, 2010, pp. 1045–1048. [10] G. Fu, K.-K. Luke, G. Zhou, and R. Xu, “Automatic expansion of abbreviations in chinese news text,” in Proceedings of the Third Asia Conference on Information Retrieval Technology, 2006, pp. 530–536. [11] X. Sun, H. Wang, and Y. Zhang, “Chinese abbreviation-definition identification: A SVM approach using context information,” in The Pacific Rim International Conferences on Artificial Intelligence, 2006, pp. 495–504. [12] F. Peng, F. Feng, and A. Mccallum, “Chinese segmentation and new word detection using conditional random fields,” in In Proceedings of COLING, 2004, pp. 562–568. [13] P. V. S. Avinesh and G. Karthik, “Part-Of-Speech Tagging and Chunking using Conditional Random Fields and TransformationBased Learning,” in Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), 2007, pp. 21–24. [14] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990. [15] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, pp. 5528 –5531. [16] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernock´y, “Strategies for training large scale neural network language models,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2011, pp. 196–201. [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” ArXiv e-prints, Jan. 2013. [18] K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi, and D. Yu, “Recurrent neural networks for language understanding,” in Proceedings of Interspeech, 2013, pp. 2524–2528. [19] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding,” in Proceedings of Interspeech, 2013, p. to appear.
[20] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, Mar. 2003. [21] I. Bhattacharya and L. Getoor, “A Latent dirichlet model for unsupervised entity resolution,” pp. 47–58, 2006. [22] Y. cheung Tam and T. Schultz, “Unsupervised language model adaptation using latent semantic marginals,” in In Proc. of Interspeech, 2006. [23] K. Toutanova and M. Johnson, “A bayesian lda-based model for semi-supervised part-of-speech tagging.” in NIPS, 2007. [24] Y. Shi, P. Wiggers, and C. M. Jonker, “Towards recurrent neural networks language models with linguistic and contextual features,” in Proceedings of Interspeech, 2012, pp. 1664–1667. [25] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai, “Class-based n-gram models of natural language,” Computational Linguistics, vol. 18, pp. 467–479, 1992. [26] T. Kudo, “CRF++: Yet Another CRF toolkit,” 2009. [Online]. Available: http://crfpp.googlecode.com/svn/trunk/doc/index.html [27] T. Mikolov, Q. V. Le, and I. Sutskever, “Exploiting similarities among languages for machine translation,” CoRR, vol. abs/1309.4168, 2013. [28] T. Mikolov, W. tau Yih, and G. Zweig, “Linguistic regularities in continuous space word representations.” in HLT-NAACL, pp. 746– 751.