Pinyin-to-character (PTC) conversion means the auto- matic conversion of ... Harbin Institute of Technology, Harbin, China, and also with the Department of. Computing, The .... To address these issues, this paper introduces a rough set tech- nique [14] and ...... Artificial Intelligence, Las Vegas, NV, 2000, pp. 1203â1208.
834
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 2, APRIL 2004
Mining Pinyin-to-Character Conversion Rules From Large-Scale Corpus: A Rough Set Approach Wang Xiaolong, Chen Qingcai, and Daniel S. Yeung, Senior Member, IEEE
Abstract—This paper introduces a rough set technique for solving the problem of mining Pinyin-to-character (PTC) conversion rules. It first presents a text-structuring method by constructing a language information table from a corpus for each pinyin, which it will then apply to a free-form textual corpus. Data generalization and rule extraction algorithms can then be used to eliminate redundant information and extract consistent PTC conversion rules. The design of our model also addresses a number of important issues such as the long-distance dependency problem, the storage requirements of the rule base, and the consistency of the extracted rules, while the performance of the extracted rules as well as the effects of different model parameters are evaluated experimentally. These results show that by the smoothing method, high precision conversion (0.947) and recall rates (0.84) can be achieved even for rules represented directly by pinyin rather than words. A comparison with the baseline tri-gram model also shows good complement between our method and the tri-gram language model. Index Terms—Data generalization, data mining, Pinyin-to-character conversion, rough set.
I. INTRODUCTION
A
Pinyin is the phonetic symbol of a Chinese character. Pinyin-to-character (PTC) conversion means the automatic conversion of Pinyin to characters. Since there are only 410 nonpitched-Pinyin as against more than 6700 Chinese characters, the main problem of PTC conversion is in correctly matching characters and Pinyin in a given context. The most common way to address this problem is to construct an appropriate language model. One popular, simple, and efficient model is the statistical -gram language model (SLM) [1]–[3]. In [4] it has been successfully applied to many problems in natural language processing (NLP), but it can capture only the short-distance context dependency within an n-word window. Recent research has reported that the largest practical value for in natural language processing is 3 [4]. A variety of methods for the long-distance dependency problem can be found in [4]–[13], [25].
Manuscript received May 22, 2002; revised November 21, 2002. This work was supported by the National Natural Science Foundation of China (60175020) and by a Hong Kong Polytechnic University Senior Research Fellowship. This paper was recommended by Associate Editor D. Cook. W. Xiaolong is with the Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, and also with the Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong. C. Qingcai is with the Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. D. S. Yeung is with the Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong. Digital Object Identifier 10.1109/TSMCB.2003.817101
In Zhou and Lua’s model, the long-distance dependency problem was addressed by interpolating -grams with trigger pairs [25] selected according to their mutual-information. This method addresses the long-distance dependency problem, but too many hypotheses must be made before an acceptable number of trigger pairs can be found. Siu and Ostendorf posit a variable n-gram model [4] for the representation of long-distance information. By representing -grams with trees, this model uses node merging and node combination to reduce redundant data, so the maximum can be much larger than in conventional -gram language models while keeping storage space and computational requirements to a reasonable scale. In Kuhn’s model [5], the topical dependency on the level of article or discourse was considered by adding a cache component to model long-distance constraints and then this component was interpolated with SLM. The underlying assumption of this model is that if a word is used in a context, it is likely to be used again. Goodman’s research [6] showed that in speech recognition tasks the cache model got relatively larger improvements in perplexity than in error rates. A similar model has been provided by H. Ney, et al. [7]. This model uses word association to identify long-distance dependencies between words. The dependency between a word and its history is analyzed into pair-wise word associations and then the conditional probabilities of all the word associations are interpolated. This model could be thought of as a generalization of Kuhn’s catch model. Rosenfeld [8] put forward another trigger-like language model. In this model, different kinds of knowledge sources for the history of a word are integrated according to the concept of a trigger pair. The maximum entropy principle is applied to weight knowledge sources and to select trigger pairs. Since the concept of a “trigger pair” is so general that this model can be regarded as another generalization of trigger or cache-like models. The performance of language models may be improved by using grammatical as well as lexical information. In Roark’s model, probabilistic top-down parsing was integrated into a tri-gram by interpolation [9]. Models based on the same knowledge source include Chelba and Jelinek’s structured language model [10]. In this model, the search procedure is strictly done from left to right, making it easy to integrate the parsing results with an -gram language model at word level. Charniak’s model [12] made use of the immediate-head information of the parser to derive a revising procedure when more information was provided. Charniak’s model complicates the search procedure and restricts the combination of parsing information with -grams at the sentence level, yet, compared
1083-4419/04$20.00 © 2004 IEEE
XIAOLONG et al.: MINING PINYIN-TO-CHARACTER CONVERSION RULES
with Chelba and Jelinek’s model, it does report an improvement in perplexity. Category- or class-based language models such as those proposed in [11], [12] could also be used to address the long-distance dependency problem. This is because the number of different categories or classes is generally much smaller than the number of words in a vocabulary; so a larger is feasible both from the statistical and the storage viewpoint. Another important technique is the integration of linguistic rules. In [13] an example of a language model based on rules making use of a transformation-based method was presented. By combining the error-driven learning process with a manually annotated corpus, linguistic information was learned and represented in a concise and easily understood rule form. These rules were then applied to the part-of-speech tagging problem. In this area, good performance was reported, but there are two major problems that must be addressed before this technique can be used in other applications. One is the uncertainty and contradictory nature of natural language, which may cause inconsistent rules to be extracted. The second problem is that the stop condition of the transformation process may need to be relaxed. Although these models greatly improved the performance of the classical -gram language model, most of them take the approach of combining different knowledge sources with -grams. Such knowledge sources usually take the form of linguistic information such as syntactic knowledge, or corpus-extracted word triggers. Where the knowledge source is such a word trigger, there remains the problem of how, in an NLP problem, to extract useful knowledge and remove redundant data. The integration of word-based triggers or syntactic information is important in the improvement of model performance, but they are not always suitable or efficient in a special NLP application. Furthermore, with the advantage of general or application-independent, for the ignoring of application-dependent features, the knowledge pruning methods applied in present models are also coarse for some applications. This paper seeks to solve the above-mentioned problems, by developing a rule-mining method for PTC conversion. The rules mined by our model could be used independently or they could be integrated with other statistical techniques in any application that involves PTC conversion. In addition to the long-distance dependency problem, this paper will also consider the following issues: the consistency of the extracted rules, the acceptability of the storage requirements of the rule base, and the simplicity of the model parameters. To address these issues, this paper introduces a rough set technique [14] and presents a method for mining PTC conversion rules based on a rough set technique. The paper is organized as follows. Section II briefly reviews rough set theory. In Section III we describe a Language Information System for transforming unstructured textual data into structured features. We then present data generalization and rule extraction methods and describe how the mined rules can be used to convert Pinyin to characters. In Section IV, we present an analysis of experimental results, and in Section V we discussion our method and future research problems.
835
II. ROUGH SET The rough set theory was proposed by Pawlak [14] as a way to analyze vague or fuzzy data sets. It allows concept approximating and rule mining from large-scale evidences that contain vague, inconsistent or incomplete data, and has been applied, with success, to knowledge discovery from databases (KDD) [15]. Except the unstructured form of a large-scale corpus, we face the same problems in mining linguistic rules as in KDD, that is, the uncertainty and inconsistency of natural language, the huge scale of data, and incomplete information. The following discussion discusses these issues, adopting, for the most part, notations and definitions from Düntsch [16]. In rough set theory, knowledge is represented via relational tables. An Information system can be defined as follows: (1) where is a non-empty set of objects; is a non-empty set of , there is an attribute value attribute ’s; for each attribute and an information function . set A. Set Approximation Definition 1: Let be an information system. An equivalence on set is called an indiscernibility relation, if and only if satisfies the following. for , 1) Reflexivity: That is , then for , 2) Transitivity: If then for . 3) Symmetry: If is called an approximation space. ApproxThe bi-tuple imation space is a core concept in rough set data analysis. It determines the state space of the problem and the relationships between objects. Definition 2: Let be an information system. For each indison as follow. cernibility relation on , define a partition , if and only if they belong to the same For any can be represented as class of partition . The classes of
Definition 3: For an object set
, the set
is called lower approximation (or positive region) of set
is called upper approximation of . The set is called definable if and only if Otherwise is called rough with respect to . B.
, and the
.
-Approximation
Although lower and upper approximations can be applied to modeling vague concepts, sometimes the lower approximation set maybe too strict for some problems while the upper approximation set is too relaxed. For example, we may want to approximate a concept under a statistical framework. In this case,
836
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 2, APRIL 2004
imposing a limitation for approximation sets is convenient. Considering the intention of this paper, the concept of -approximation is provided and defined as follows [23]: and a given value Definition 4: For an object set , the set
is called -lower approximation of
, and the set
kinds of rules can be expected?” Obviously the answer to this question depends on the textual features that have been chosen for examination. Traditionally, a conversion rule is of the form: , where is a subset of features and defor the sentence . The right hand side notes the value of should be converted to charmeans that the Pinyin under the context of . But then what feature set acter should be chosen? It is in general not advisable to select all of the features. Example 1 explains why this is the case. Example 1: The following four sentences are in both Pinyin and characters:
is called -upper approximation of . Here is an external parameter imposed on Pawlak’s rough will back to the conventional definition of set model. Let set approximation. III. RULE MINING A. Problem Description Rough set is a useful tool for data mining [18]. As mentioned previously, a structured information table must be constructed before the application of the rough set technique. In general, much of the research in data mining has focused on ways of applying knowledge discovery methods to structured data stored in a single relational database. Most often the data is even stored in a single relational table. Much work has been done on text mining [17], [18], but it has mainly been in relation to extracting knowledge or information implied in the content of textural documents. This work has not been concerned with the form of the text, nor with the sentences themselves. In what follows, we would like to show that not only is it important to consider the content of the document, but that the form of the text or the sentence itself is even more important in the extraction of Chinese Pinyin-to-character conversion rules. Two questions must be answered by anyone who wishes to build a structure for a textural document. First, what feature of the text should be chosen for a specified task? In the Chinese input method or speech recognition, a string composed of Pinyin is firstly provided, and then one Chinese character for each Pinyin is chosen from its homophones based on the contextual information. So the main and obvious features that and their positions can be used are other Pinyin relative to . Because most systems provide the Pinyin string without any space indicating the Pinyin boundary, they usually provide a preprocess for the segmenting of Pinyin. Although in most cases this preprocess task is easy to complete, in some cases it is too hard to address. For example, the independent Pinyin “xian” and the combined Pinyin “xi’an” are frequently encountered and on most occasions the user must distinguish between them manually. In this paper, we assume that the given is always segmented into independent Pinyin. Pinyin string Since the Pinyin strings are provided by the character-to-Pinyin converting procedure, this assumption will not affect the rule extraction and performance evaluation of our model. The second question, which must be answered by anyone who wishes to build a structure for a textural document is: “what
If all of the Pinyin in a sentence are included in the left hand side item of the Pinyin-to-character rule for the Pinyin “zhi”, then, because of their uniqueness in the example, four rules could be constructed:
In fact, for a free form text, one can easily construct hundreds of sentences with different forms, but with the same meanings as illustrated by Example 1. If we do that, the rule base will become unnecessarily large, and hence not practical. It is easy to suggest two conversion rules for to see that sentences the Pinyin “zhi”:
This means that sentences to provide just the same information as the two Pinyin “mao” and “hua” do in the Pinyin-to-character conversion of “zhi”, while the other features (Pinyin) are redundant. Though rough set techniques can be applied to cut off redundant information and look for functional dependencies (or rules) in a table, there are two factors that make it too difficult to apply rough set techniques directly, namely, the time complexity of the algorithm and the variations of the lengths of the sentences. The time complexity to check whether or not two attributes in a table are function[19]. Assuming that the average ally dependent is
XIAOLONG et al.: MINING PINYIN-TO-CHARACTER CONVERSION RULES
length of a sentence is , the time complexity for checking . the functional dependencies for each Pinyin is Because the number of possible sentences for a given language (be it Chinese or English) is extremely large (for instance, assuming the number of characters in a language is , then the . Even distotal number of possible sentences is given by counting unlikely sentences, the number of possible sentences is still astronomical). Such a time complex is not acceptable. Furthermore, variations in sentence lengths would make it impossible to construct a complete information table of the attribute values, which further complicate the problem at hand. B. Construction of LIT To address the problems described above, we first construct a language information table (LIT) for each Pinyin. Rather than using all the features as attributes of an object in the LIT, only a subset of the features will be used. As indicated in (1), , the attribute set is composed of two parts. A , where is the set of all Pinyin and decision attribute is a set of homophones respect to Pinyin . A feature set that is used to predict the -value of an object . All LITs of Pinyin together form a language information system (LIS), denoted by . A LIT for Pinyin is denoted by , and . In this paper, the cardinality of the feature set is assumed to be the same for all LITs in a LIS. This is not a requirement, but the conversion algorithm would be greatly simdepends on the plified by this assumption. The value of form of rules to be mined, which also determines the way that the features would be extracted. Our rules will be divided into two types. One is the uni-feature rule (UFR) denoted by , where is a single feature. The other type of rule is the multi-feature rule (MFR) denoted by , where is a set of features which may be of the same kind. The UFR is useful when the feature space is larger than those for the decision attribute. Otherwise, the MFR will be preferred. and in Example 1 are of UFR type. For this The rules type of rule, the context behind the rules is ignored. The merits of UFR are the concise form of the rule and the simplicity of the extraction and the prediction algorithms, but it is not suitable in many applications. For example, if we have the following sentence converted from Pinyin to characters: Example 1 (continued): : Yi ding mao zi gua zai shu zhi shang Then the Pinyin “zhi” should be converted to character
by the rule extracted from sentences to . But this result is incorrect. The character with respect to “zhi” is
in this context and the whole sentence should be
837
Obviously, this error is caused by ignorance of context. In this case, it is necessary to use MFR rules. As mentioned in Section III-A, there are two kinds of contextual information provided by the Pinyin-to-character conversion problem, i.e., one is in the Pinyin in a sentence and the other is in their relative positions. Both can be used to mine MFR rules. The rules to are extreme examples of the MFR rules. In these rules, all contextual information is included but they are still not capable of determining the right character for the Pinyin “zhi” in sentence . In fact, no prediction is made at all, neither correct nor incorrect. So the MFR rules could be considered to have performed marginally better than the UFR rules by adopting an approach of “if can’t be better, do nothing”. For the Pinyin-to-character problem, because the state-space of Pinyin is smaller than the state-space of characters, the UFR is used is not being considered and the parameter to denote the features used in the MFR. We call the order of rules. Although the LIS is composed by multi LITs, it can be constructed by one-pass scanning of the corpus. There are large amount of raw corpora that just contain Chinese characters, so the first step in constructing an LIS is to convert each sentence in the corpora into a Pinyin string. Then for a given rule order and a string pair composed of Chinese character string and -tuples will corresponding Pinyin string, each of be extracted and inserted into a LIT respected to Pinyin is an order feature set. Because the amount character . of possible -tuples will be too large to be acceptable for is necessary. In this a large-scale corpus, simplification of paper, the values of an order feature set are limited to those of length that appear composed of Pinyin strings successively in a sentence and of an integer that represents the and in the same sentence. For exdistance between , ample, in the sentence in Example 1, under the case of
is a 4-tuple that should be inserted into the LIT respected to or Pinyin “zhi”, but tuples with respect to Pinyin set will not be considered. Algorithm 1 formally presents the method of constructing a LIS for a given and a textural corpus. Algorithm 1. The construction of a LIS for a given value and a training of Chinese text: corpus : a string of Pinyin, : a string of Chinese characters, : the length (the number of Pinyin or characters) of the string , , , the th Pinyin or charFor acter of , : an object in the LIS, : the value of the th attribute, : the value of the decision attribute , and
838
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 2, APRIL 2004
: the operator that denotes the Pinyin of the character . The following pointers are also used. points to the next accessible string of , points to the next Pinyin in , points to the next character in and , begin points to beginning of , empty all LITs of the LIS do while not reaching the end of Read a string from to , , points to the beginning of , to string , Convert string do while points to the beginning of , do while then if , , , insert into . LIT then else if , , insert
into LIT
.
is firstly segmented into words by an a) The string unsupervised segmentation method [20]. Nearly 99.9% of multi-character words correspond to a unique Pinyin string, so Pinyin that refer to multi-character words can be determined by querying a dictionary. b) The Pinyin of all non-polyphonic mono-character words can also be directly determined by querying a dictionary. c) Left polyphonic characters are dealt with by first constructing a tri-gram-like database from the training corpus. Each element in this database is a tri-tuple of words that consists of the polyphone that needs to be converted, the word before it and the word after it in a same sentence. Taking as an example the polyphone
that relates to the Pinyin zhang and chang the tri-tuple
is extracted from the Chinese sentence
(she grow up) and
end if . end while
is extracted from
end while end while end
In Algorithm 1 we took as the decision attribute both other Pinyin and their positions relative to the Pinyin used. The time , where complexity of the read-file operation is is the number of characters in the corpus . The time complexity of the insertion operations can be calculated in the following way. Let be the average length of a Chinese sentence. There are at most insert operations for , and there are totally each sentence of length . Usually, sentences in , so the time complexity for the insertion . Since each inserting operation inserts operations is one object into LIS and each object respects to attribute values, then the space requirement for LIS is . Because the value of is usually close to 1, the term of space requirement cannot be reduced. One problem remains to be treated in detail, i.e. the conversion of string to string (named character-to-Pinyin conversion or CTP conversion). Taking this conversion will face to the polyphone problem, in this paper, it means that for every Chinese character there are multiple “toneless Pinyin”. In fact, among frequently used Chinese characters, about 1% is polyphones but for a given character there is a maximum of 3 toneless Pinyin. This problem is addressed by following steps:
(after long time negotiating). Next, the Pinyin for each polyphone of the 100 000 most frequently occurring trituples in the database is marked manually, and a database denoted as CtP is constructed to store these tri-tuples and the marked Pinyin. To convert a polyphonic character into Pinyin, the tri-tuple is first constructed in the same way as the tri-tuple database. This tri-tuple is then sought in CtP, and if the tri-tuple exists in Ctp, the Pinyin related to this tri-tuple in CtP is the Pinyin into which the character should be converted. d) After steps a) to c), there remains some ambiguous words or characters that resist CPT conversion. In such cases, we expand the Pinyin string into a series of Pinyin strings, each string referring to one candidate conversion of the ambiguous character or word. The Algorithm 1 without the CPT conversion step is then applied to each Pinyin string. This progress will introduce some contradictions but it is not a significant problem when one considers the number of polyphonic words or characters and the contradiction wiping capability of the rule extraction method to be provided in the later part of this paper. Table I(a) is an example of a LIT with 20 objects for Pinyin , which are constructed from sentences to “zhi” and of Example 1.
XIAOLONG et al.: MINING PINYIN-TO-CHARACTER CONVERSION RULES
839
TABLE I A LIT FOR PINYIN “zhi” WITH n = 2
(a)
(b)
C. Data Generalization Usually, data generalization is performed attribute-by-attribute using attribute removal, so that different tuples may be generalized to identical ones, and the number of distinct tuples in the generalized relation is reduced [21]. This paper presents a different way to generalize data, based on the indiscernibility relation in a LIT. ( denotes natural) is A natural indiscernibility relation defined such that for any two objects , if all of their attribute values are equal, then . By this definition, in Table I(a), there are six pairs of objects that are indiscernibly related to , , , each other, i.e., and , . So there are only 14 distinct objects in the LIT as shown in Table I(a). Although this definition seems natural for an indiscernibility relation, in some situations there is a need for a different definition. Let us consider the following example. Example 2: Given a Chinese Character String
We can add another distinct object to Table I(a) in the condias shown in Table I(b). tion of But this object essentially carries the same information as Obj 5 or Obj 10 even though their F2 values are different from each other. This means that Obj21 should be considered the same as Obj5 and Obj10. In order to satisfy this, a new indiscernibility would be needed. We propose to derelation rather than fine this new as follows. For any two objects , if the first features equal and their attribute values of the th feature (or denoted as ). Based on have the same sign, then this new definition, a revised data generalization algorithm is given by Algorithm 2. This algorithm will merge all indiscernibility objects into one equivalent class under and add a new to record the number of objects in each equivalent attribute class. Each equivalent class is represented by an arbitrary object belonged to it.
Algorithm 2 (Generalization Algorithm for LIT) with duplicate objects, Input: a LIT defined an indiscernibility relation for any pair of objects without duplicate obOutput: a LIT jects with a count of each object in the added to as a new original table attribute named “Cnt”. begin is not empty do while to a temp obRemove an object from ject if there is an object in satisfying , by 1. then increase into and is set else insert to 1. end if end while end After the generalization process, the data are generalized to higher-level concepts as defined by and the number of tuples in the generalized LIT (GLIT) is substantially reduced. Table II is a GLIT with respect to the LIT of Table I and the indiscernibility relation defined in Example 2. Because only the sign of the attribute F2 is considered here, the values of F2 in the GLIT are simplified to 1 or 1, depending on the value in F2 being negative or positive respectively. The GLITs for all Pinyin compose a generalized LIS (GLIS). D. Rule Extraction A rough information system can be divided into many equivalent classes using a decision attribute, let the th class containing a sub object set . Then the intention to extract PTC rules can be fulfilled by constructing the lower approximation set of , . Here is the number of equivalence classes for the decision attribute. Considering the statistical property of
840
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 2, APRIL 2004
TABLE II THE GLIT FOR TABLE I AND DEFINED IN EXAMPLE 2
natural language, here we apply the -approximation defined in Definition 4 rather than conventional approximation. The following provides the consequent definition of a PTC rule set. , an -order Pinyin-to-charDefinition 5: given an acter conversion rule set for the Pinyin is a subset of the GLIT with respect to and the parameter that is defined as:
Algorithm 3 formally provides the rule extracting procedure respected to Definition 5. Algorithm 3 (Rule Extraction Algorithm) Input: , a GLIT of the Pinyin , Output: an -order rule set begin while is not empty do Remove one object from and calculate the maximum similarity estimation of conditional probability under by
(2) Where is the value of attribute for object and the summary for the denominator of the right item will be taken through all objects that satisfy in . is applied to denote the partial indiscernibility induced from , i.e. for two objects , if they satisfy all conditions provided in the definition of when all other attribute values are ignored except values for the subset of attribute. if then insert into end if end while end
E. Pinyin-to-Character Conversion In this section we present the method for using the rough statistical rules extracted from the GLIS as described in the last section. The following Algorithm 4 uses the extracted rules to convert Pinyin to characters.
Algorithm 4 (Pinyin-to-character conversion algorithm) Input: a string of Pinyin , the order of the rule sets, an indiscernibility used in Algorithm 2 and the relation for all Pinyin conversion rule sets extracted by Algorithm 3. of characters corOutput: a sentence . responding to the Pinyin in begin , be the th Pinyin of Let while , do from for by Construct a LIT Algorithm 1, with the exception that the values of the decision attribute in are not set. to a GLIT by Generalize the Algorithm 2 using the indiscernibility relation . is empty, then the conversion if result for is marked as “leak” else for each object , satisif there is an object , then fying else remove from end if end if For each value of the attribute , count the number of objects that have this same value of in . Choose the value of the attribute , which has the maximum count of objects, as the
XIAOLONG et al.: MINING PINYIN-TO-CHARACTER CONVERSION RULES
conversion result for . If there is more than one value that has the maximum count of objects, the decision cannot be determined and a “rejection” is marked for . On the other hand, values are zero, if the counts of all the decision still cannot be determined and a “leak” is marked for . end while end Algorithm 4 is called a voting algorithm due to the way that its decision is made. It is noted that usually a “leak” is caused by data sparsity. In the case of our model, data sparsity means that there are not enough data for the extraction of all decision rules. In fact, it is nearly impossible to get a text corpus containing all conversion rules. Therefore some smoothing methods used in the statistical methods of NLP such as a Back Off algorithm and Interposed method should be taken into account, and this will be discussed in the next section. The “rejection” is usually caused by the uncertainty and contradiction that are inherent in a natural language. To address this problem, a finer information grain is needed.
841
test corpus and were used to test the performance of our model. All tests were run on the same test set consisting of 20 articles (totaling 29 898 Chinese characters) randomly chosen from the test corpus. Experiments were conducted in two stages. In the first stage, various rule sets for different rule orders were constructed from various scales subsets of training corpus. These rule sets were applied on the test set remarked above to convert Pinyin into characters. The purpose of this experiment was to determine the effects of corpus scale and rule order on the performance of the model. In the second stage, several rule sets with respect to different values of rule precision threshold were constructed and applied to the Pinyin-to-character conversion task to evaluate the relationship between and the quality of the rule sets. After these two stages of experiments, the PTC conversion results were compared with the base-line results provided by a conventional word-based tri-gram language model. The dictionary used for the construction of the tri-gram library consisted of 112 922 words. Two metrics were introduced to quantitatively evaluate the conversion quality of a rule set, i.e. precision rate and recall rate . They are defined as (3)
F. Rule Smoothing In a manner similar to the way that the statistical language model is used in NLP, we can take advantage of the fine granularity of higher order rules and less data sparsity of lower order rules to improve the decision making performance. There are two main approaches to smoothing in NLP [22]. One is the “Back off” method and the other is the “interpolated” technique. In our case, “Back off” means that the lower order rule sets will not be used unless we can not make a decision based on the higher order rule sets alone. On the other hand, the interpolated method makes use of, at each time, both higher and lower order rules to make a decision. As mentioned earlier, higher order rules contain fine granularity of information. Thus, this information should be given higher priority. Accordingly, we have adopted the back-off algorithm as our smoothing method. It is briefly described as follows. At first, the higher order rules are applied via the voting algorithm to make a decision for the Pinyin-to-character conversion in a specific context. If no “leak” and “rejection” is encountered, the task is terminated. Otherwise, the lower order rules are invoked and the same steps are repeated. Finally, if we fail to make a decision by both higher and lower order rules, a “leak” or “rejection” is recorded. IV. EXPERIMENT AND RESULT ANALYSIS A corpus, which consists of 12 million of Chinese characters extracted from news articles of the People’s Daily, was used to evaluate the performance of our model. The corpus was divided into two parts: training corpus and test corpus. The training corpus consists of 10 million words that were used to extract the decision rules. The other 2 million characters compose the
where denotes the number of Pinyin correctly converted, the number of errors converted, the number of rejections, and the number of leaks. Experimental results for stage 1 and stage 2 are showed in Fig. 1 and Fig. 2 respectively. Fig. 1(a) and (b) show precision and recall rates of 2, 3, 4-order rules and their smoothing for Pinyin to character (the reason for conversion under the condition of choosing this value will be shown in the stage 2 experiments). Comparing Fig. 1(a) and 1(b), we noticed that as the size become more of training corpus increased, the trends of complex than those of for different orders of rules (especially for 4-order). In Fig. 1(a), as the corpus size increased, the of the lowest order rules increases precision rate very slowly but its value is the highest, and the increasing speed of for 3-order rules is faster than 2-order rules and the value of tends to closing which of 2-order rules. Compared with lower order rules, the trend of for 4-order rules is not clear under this scale of corpus. In Fig. 1(b), as the recall rate of other rule sets increased following proportion with the size of the training corpus, that of 2-order rules steadily decreases. Based on the results in Fig. 1(a), we can conclude that because of the limited state space, increasing the size of training corpus can improve the total performance of higher order rules, but it is not so helpful in improving the total performance of 2-order rules. Another interesting result that can be seen in Fig. 1 is that the recall rate in the smoothing case is obviously higher than the highest one of the non-smoothing rule sets, but at the same time, its precision rate tends to be the same as the worst of the non-smoothing rule sets. So if we are concerned only about the
842
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 2, APRIL 2004
(a) Fig. 3. Comparison of rough rule-based and word tri-gram based PTC conversion approach, = 0:95.
(b) Fig. 1. Effects of corpus size and rule order on the precision and recall rates of Chinese Pinyin-to-character conversion, = 0:95.
(a)
(b) Fig. 2. Effects of on the precision and recall rates of Chinese Pinyin-to-character conversion, for a training corpus of 8.2 million characters.
total performance rather than the precision rate of the model, this smoothing method is still helpful. Experiments in stage 2 concentrate on the relationship and model performance. between the precision threshold Though is defined as precision threshold, the experimental results in Fig. 2 show that the precision rate is not always increasing with , especially in the zone of closing 1, however, the recall rate decreased quickly in this zone, which means that
a large amount of useful constraint rules contain uncertainty. Removing this rule would lad to the loss of much important information. On the other hand, more uncertainty will lead to a lower precision rate. Fig. 2(a) and 2(b) show that, for the best model performance, a value for of about 0.95 is suitable unless we wish to emphasize one side between and . Fig. 2 also shows that the change of has a more marked effect on lower order rules than on higher order rules. To be comparable with the tri-gram PTC conversion method, the error rate of conversion (ERC) is defined as the count of wrongly converted characters divided by the total number of converted characters (not including “leaks” and “rejections” in the case of rough rule-based PTC conversion as in these two cases no decision is made). This metric is not meant to compare the overall performance between the tri-gram based and our proposed rule-based PTC conversion methods. Our objective is to show the complementary capability of the two methods. Fig. 3 shows that, before the training size reaches 8.5 M characters, the error rate is greater when using rule smoothing than when using the baseline tri-gram model. Just in the latter case, the decrease of error rate is apparent. On the other hand, a stable and . Stage improvement on error rate is shown for 1 thus demonstrates the greatest benefit from the smoothing method is the improvement on recall, so when the emphasis is on precision, for example, when combining rough rules and tri-grams, a single order of rules is preferable. In [24], a metric “Error Correlation” is introduced. They defined the following parameter:
(4) is the number of models being compared (in our case, ); is a Pinyin variable; represents the correct Chinese character that Pinyin should be converted into, is the actual conversion result of provided by the th model, and is the probability of models and making the same PTC conversion errors. In our case, is computed as the number of Pinyin , in the testing corpus that satisfy the conditions: , divided by the total number of converted characters. In [24], it was shown that when combining different models, a smaller value of indicates a better performance for the combined model. where
XIAOLONG et al.: MINING PINYIN-TO-CHARACTER CONVERSION RULES
We obtained an error correlation of 1.35% when applying the smoothing method to a training corpus of 8.5 M characters. This result shows that the degree of error correlation between the tri-gram and our rule-ba‘sed models is very low. Under the same condition, 73.8% of characters that are wrongly converted under the rule-based method are converted correctly by the tri-gram method. In contrast, 34.4% of characters wrongly converted by the tri-gram method are converted correctly by the rule-based method. These results encourage us to believe that gains in performance could be obtained by combining the two methods. As mentioned in Section I, a number of methods are available, including the maximum entropy method [8] and the transformation-based model [13] that can be applied to combine the two models together.
V. CONCLUSION In this paper, we report on our use of a rough set technique to address the problem of automatic Pinyin-to-character conversion and present algorithms for feature extraction, data generalization and rule extraction. Compared with the classical statistical technique, this method has the following advantages. 1) More precise conversion. As can be seen in Fig. 2, the highest precision rate is 0.978 with a 0.54 recall rate. When the recall rate reaches 0.84, a precision rate of 0.947 is obtained by the smoothing method. The results are shown in Fig. 1. 2) Lower storage requirements. By using data generalization and rule extraction, rule set redundancy can be eliminated, greatly reducing the storage requirements of the rule database. 3) Capable of addressing the long-distance-constraint problem. This capability is important not only in Pinyinto-character conversion, but also in speech recognition, word sense disambiguation, and other areas of natural language processing.
843
[5] R. Kuhn and R. de Mori, “A cache based natural language model for speech recognition,” IEEE Trans. Pattern Anal. Machine Intell., vol. 14, pp. 570–583, 1999. [6] J. Goodman, “A bit of progress in language modeling,” Comput. Speech Lang., vol. 15, pp. 403–434, 2001. [7] H. Ney, U. Essen, and R. Kneser, “On structuring probabilistic dependences in stochastic language modeling,” Comput. Speech Lang., vol. 8, pp. 1–38, 1994. [8] R. Rosenfeld, “A maximum entropy approach to adaptive statistical language modeling,” Comput. Speech Lang., vol. 10, pp. 187–228, 1996. [9] B. Roark, “Probabilistic top-down parsing and language modeling,” Computat. Linguist., vol. 27, no. 2, pp. 249–276, 2001. [10] C. Chelba and F. Jelinek, “Structured language modeling,” Comput. Speech Lang., vol. 14, no. 4, pp. 283–332, 2001. [11] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer, “Class-based n-gram models of natural language,” Computat. Linguist., vol. 8, pp. 467–479, 1992. [12] T. R. Niesler and P. C. Woodland, “Variable-length category n-gram language models,” Comput. Speech Lang., vol. 13, pp. 99–124, 1999. [13] E. Brill, “Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging,” Computat. Linguist., vol. 21, no. 4, pp. 543–565, 2001. [14] Z. Pawlak, “Rough sets,” Int. J. Comput. Inform. Sci., vol. 11, pp. 341–356, 1982. , “Rough set theory and its applications to data analysis,” Cybern. [15] Syst., vol. 29, pp. 661–668, 1998. [16] I. Duntsch and G. Gediga, “Uncertainty measures of rough set prediction,” Artif. Intell., vol. 106, pp. 109–137, 1998. [17] Y. Yang, “An evaluation of statistical approaches to text categorization,” Inform. Retrieval, vol. 1–2, no. 1, pp. 69–90, 1999. [18] J. He, A. H. Tan, and C. L. Tan, “A comparative study on chinese text categorization methods,” in PRICA 2000 Workshop on Text and Web Mining, Melbourne, Australia, 2000, pp. 24–35. [19] J. W. Guan, D. Bell, and A. Bell, “Rough computational methods for information systems,” Artif. Intell., vol. 105, pp. 77–103, 1998. [20] X. Wang and F.Fu Guohong et al., “Models and algorithms of chinese word segmentation,” in Proc. Int. Confer. Artificial Intelligence (ICAI’2000), Las Vegas, NV, 2000, pp. 1279–1284. [21] X. Hu and N. Cercone, “Mining knowledge rules from databases: a rough set approach,” in Proce. 12th Int. Conf. Data Engineering (ICDE’96), New Orleans, LA, 1996, pp. 96–105. [22] S. F. Chen and J. T. Goodman, “An empirical study of smoothing techniques for language modeling,” Comput. Speech Lang., vol. 13, pp. 359–394, 1999. [23] Q.-D. Wang, X.-J. Wang, and X.-P. Wang, “Variable precision rough set model based dataset partition and association rule,” in Proce. IEEE Int. Conf. Machine Learning and Cybernetics, vol. 4, 2002, pp. 2175–2179. [24] K. M. Ali and M. J. Pazzani, “Error reduction through learning multiple descriptions,” Mach. Learn., vol. 23, no. 3, pp. 173–202, 1996. [25] Z. GuoDong and L. KimTeng, “Interpolation of n-grams and mutualinformation based trigger pair language models for mandarin speech recognition,” Comput. Speech Lang., vol. 13, pp. 125–141, 1998.
One further problem remains to be addressed, however, with the conversion method presented in this paper: increasing the recall rates of the rule sets. In addition to providing a larger training corpus, the method of combining the rough set based approach with the traditional statistical method may be considered.
REFERENCES [1] Z. Chen and K.-F. Lee, “A new statistical approach to Chinese Pinyin input,” in ACL-2000, Hong Kong, 2000, pp. 241–247. [2] G. Jianfeng, J. Goodman, M. Li, and K.-F. Lee, “Toward a unified approach to statistical language modeling for Chinese,” ACM Trans. Asian Lang. Inform. Process., vol. 1, no. 1, pp. 3–33, 2002. [3] X. Wang, D. Yeung, and X. Wang, “Chinese intelligent input method,” in Proc. Int. Conf. Artificial Intelligence, Las Vegas, NV, 2000, pp. 1203–1208. [4] M. Siu and M. Ostendorf, “Variable n-grams and extensions for conversational speech language modeling,” IEEE Trans. Speech Audio Processing, vol. 8, pp. 63–75, Jan. 2000.
Wang Xiaolong received the B.E. degree in computer science from the Harbin Institute of Electrical Technology, China, the M.E. degree in computer architecture from Tianjin University, China, and the Ph.D. degree in computer science and engineering from Harbin Institute of Technology in 1982, 1984, and 1989 respectively. He joined Harbin Institute of Technology as an Assistant Lecture in 1984 and became an Associate Professor in 1990. He was a Senior Research Fellow in the Department of Computing, Hong Kong Polytechnic University from 1998 to 2000. Currently, he is a Professor of computer Science at Harbin Institute of Technology. His research interest includes artificial intelligence, machine learning, computational linguistics, and Chinese information processing.
844
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 2, APRIL 2004
Chen Qingcai received the B.E. degree in engineering mechanics and the M.E. degree in solid mechanics, both from Harbin Institute of Technology, China, in 1996 and 1998, respectively. Currently, he is pursuing the Ph.D. degree in computer science at Harbin Institute of Technology. His research interest includes computational linguistics, text summarization, text mining and soft computing.
Daniel S. Yeung (M’89-SM’99) received the Ph.D. degree in applied mathematics from Case Western Reserve University, Cleveland, OH, in 1974. He has previously been an Assistant Professor of mathematics and computer science at Rochester Institute of Technology, as a Research Scientist in the General Electric Corporate Research Center, and as a System Integration Engineer at TRW. He was the Chairman of the Department of Computing, The Hong Kong Polytechnic University, for seven years, where now he is a Chair Professor. His current research interests include neural-network sensitivity analysis, data mining, Chinese computing, and fuzzy systems. Dr. Yeung was the President of IEEE Hong Kong Computer Chapter (1991 and 1992), an Associate Editor for both IEEE TRANSACTIONS ON NEURAL NETWORKS and IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B. He is also the Coordinating Chair (Cybernetics) for the IEEE SMC Society, and the General Co-Chair of the 2002 and 2003 International Conference on Machine Learning and Cybernetics.