Multilingual Hierarchy Generation and Alignment

0 downloads 0 Views 179KB Size Report
performed monolingual text mining. B. Related work on hierarchy creation and alignment. In knowledge discovery research, we often like to organize.
Multilingual Hierarchy Generation and Alignment Using Self-Organizing Maps Hsin-Chang Yang

Chung-Hong Lee

Heng-Tzu Tsai

Department of Information Management Department of Electrical Engineering Department of Information Management National University of Kaohsiung National University of Kaohsiung National Kaohsiung University Kaohsiung, Taiwan 811 Kaohsiung, Taiwan 811 of Applied Sciences Email: [email protected] Kaohsiung, Taiwan Email: [email protected]

Abstract—Multilingual information retrieval has attracted lots of attention in recent years due to the explosive increase of multilingual Web pages. It will not be easy to retrieve documents written in languages other than the query if the relationships among entities of different languages were not found. In this work, we will develop a method based on self-organizing maps to organize documents into hierarchy. When hierarchies of individual languages are developed, an alignment process is then applied on these hierarchies to discover the associations between entities of different languages. The method had been performed on a set of parallel corpora containing Chinese and English documents and obtained promising result.

I. I NTRODUCTION Nowadays many of existing documents are stored as Web pages. The amount of Web pages has been exceeded a number of billions and is still growing. Although these pages are expressed in standard description languages such as HTML or XML, their contents are written in different languages that are generally native for the authors of the pages. In the mean time, most of the users tend to use their native language to query Web pages, which may be written in various languages. Only those pages that are written in or contain the same language as the query will be retrieved, unless some kinds of translation or mapping are done between query and Web pages. Such translation, however, is a task that is difficult to be tackled. Thus users were forced to relinquish their familiar languages and use the one that suits the target pages. This is rather inconvenient for most users that have limited ability in expressing their need using non-native languages. To remedy such inconvenience, cross-lingual or multilingual information retrieval (MLIR) techniques were developed for decades. Traditionally, machine translation schemes were often adopted to translate queries and Web pages into the same language to perform MLIR. Unfortunately, there is still no well recognized scheme to provide precise translation between two languages. A different approach is to match queries and documents directly without a priori translation. This approach is also difficult since we need some kind of measurements to measure the semantic relatedness between queries and documents. Such semantic measurements are generally not able to be explicitly defined, even with human intervention. Thus we need a kind of automated process to discover the

relationships between different languages. Such process is often called multilingual text mining (MLTM). In this work, we will develop a MLTM technique that creates multilingual hierarchies and align these hierarchies automatically. Our method applies the self-organizing map (SOM) model [1] to cluster a set of parallel texts. We will then develop a method to structure the trained maps into hierarchies. An alignment algorithm will also be developed to align hierarchies of different languages according to their semantic relatedness. Both relationships between multilingual text documents as well as multilingual linguistic terms could be discovered after the alignment. We performed experiments on a set of Chinese-English parallel texts. The evaluation results suggest that our approach should be plausible in constructing and aligning multilingual hierarchies, as well as tackling MLIR tasks. II. R ELATED W ORK We will briefly review some articles in two related fields. In Sec. II-A we review works in MLTM. Later in Sec. II-B we will discuss works of hierarchy construction and alignment. A. Related work in multilingual information text mining Many multilingual text mining techniques, especially for machine learning approaches, are based on comparable or parallel corpora, which were used in this work. Chau and Yeh [2] generated fuzzy membership scores between Chinese and English terms and clustered them to perform MLIR. Lee and Yang [3] also used SOM to train a set of ChineseEnglish parallel corpora and generate two feature maps. The relationships between bilingual documents and terms are then discovered. Rauber et al [4] applied GHSOM to cluster multilingual corpora which contain Russian, English, German, and French documents. However, they translated all documents into one language first before training. Thus they actually performed monolingual text mining. B. Related work on hierarchy creation and alignment In knowledge discovery research, we often like to organize data, especially textual data, into hierarchies since they are well perceived by humans. In one of the early work by Han and

Fu [5], they developed a technique for refining existing concept hierarchies and generating new concept hierarchies. However, their method can only applied to numerical attributes. McCallum and Nigam [6] used a bootstrapping process to generate new terms from a set of human-provided keywords. This approach was then applied in portal site construction [7]. Human intervention is still required in their work. Probabilistic methods were widely used in exploiting hierarchy. Weigend et al [8] proposed a two-level architecture for text categorization. The first level of the architecture predicts the probabilities of the meta-topic groups, which are groups of topics. This allows the individual models for each topic on the second level to focus on finer discrimination within the group. They used a supervised neural network to learn the hierarchy where topic classes were provided and already assigned. A different probabilistic approach by Hofmann [9] used an unsupervised learning architecture called Cluster-Abstraction Model to organize groups of documents in a hierarchy. Liu and Yang [10] used the link information within a Web page to build a topic hierarchy of a Website. Their method relies on the precise analysis and specification of link features which are implausible for normal texts. A close approach to our work was proposed by Chuang and Chien [11], which uses the search results of search engines for feature extraction and applies a modified hierarchical agglomerative clustering algorithm to organize short text segments into shallow hierarchies. Their work resembles ours in two aspects: First, hierarchical clustering approach is used. Second, it also creates shallow, multi-way hierarchies instead of binary ones. However, our method does not rely on the support of search engines for feature extraction. Besides, our approach can simultaneously identify themes and organize documents merely using the contents of the textual documents.

Alignment of concept hierarchies resembles the task of ontology mapping or alignment since a concept hierarchy is similar to an ontology in both structure and function. Many works on ontology alignment or mapping have been proposed [12], [13], [14]. According to Noy and Stuckenschmidt’s classification, the ontology mapping discovery schemes can be based on common reference ontology, lexical similarity, structural similarity, user input, external resources such as Wordnet’s annotation, or prior matching. Our method, however, cannot be easily classified in such classification. We may consider our approach a combined scheme of using lexical and structural similarities. A close work of our method is the HICAL system [15]. The HICAL system generated alignment rules for concepts using the statistic method in a top-down manner. They first examined the similarity of the root concepts of distinct hierarchies and generated a rule if these concepts are similar. This process is recursively applied to lower-levels concepts to obtain more specific rules. Their method is similar to ours in respect of the use of data instances within concepts in the hierarchies. However, the alignment algorithm is significantly different.

III. M ULTILINGUAL H IERARCHY G ENERATION To obtain the hierarchies, we first perform a clustering process on a set of parallel documents. We then apply a hierarchy generation process to the clustering result and obtain the hierarchies. We will start from the preprocessing steps. A. Document preprocessing In this work, we first used a common segmentation program to segment possible keywords for English documents. Processing steps such as stopword elimination and stemming were then applied to reduce the number of keywords. All keywords of all documents are collected to build a vocabulary for English keywords. This vocabulary is denoted as VE . A document is encoded into a binary vector according to those keywords that occurred in it. When a keyword occurs in this document, the corresponding element of the vector will have value 1; otherwise, the element will have value 0. With this scheme, a document Ej will be encoded into a binary vector Ej . The size (number of elements) of Ej is just the size of the vocabulary VE , i.e. |VE |. The segmentation of Chinese words is more difficult than English words because we have to separate these consecutive letters into a set of words. There are several segmentation schemes for Chinese words. We adopt the segmentation program developed by the CKIP team of Academia Sinica to segment words [16]. Different from English text processing, Chinese words require no stemming in general. Stopword elimination could be applied to Chinese words as in English. However, there is no standard stopword list available. In this work, we omit this process since we select only nouns as keywords. As in English case, the selected keywords are collected to build Chinese vocabulary VC . Each Chinese document Cj is encoded into a vector Cj in the same manner as English. The size of Cj is just the size of the vocabulary VC , i.e. |VC |. B. Self-organizing map clustering and labeling When documents are encodned into vectors as described in Sec. III-A, we apply SOM training algorithm to train maps using these vectors. When the training process is accomplished, we then perform a labeling process on the trained network to establish the association between each document and one of the neurons. The labeling process is described as follows. The feature vector of document Di , xi , 1 ≤ i ≤ M , where M is the number of documents, is compared to the synaptic weight vectors of every neuron in the map. We then label Di to the jth neuron if its synaptic weight vector is closest to xi . After the labeling process, each document is labeled to some neuron or, from a different point of view, each neuron is labeled by a set of documents. We record the labeling result and obtain the document cluster map (DCM). In the DCM, each neuron is labeled by a list of documents which are considered similar and are in the same cluster. We will also construct the keyword cluster map (KCM) by labeling each neuron in the trained network with certain keywords. Such labeling is achieved by examining the neurons’ synaptic weight vectors. For the weight vector of the

jth neuron wj , if its nth component exceeds a predetermined threshold, the corresponding word of that component, i.e. the nth word in the vocabulary, is labeled to this neuron. By virtue of the SOM algorithm, a neuron may be labeled by several keywords which often co-occur in a set of documents, making a neuron a keyword cluster. C. Hierarchy generation To generate a hierarchy, we should first cluster documents by the SOM using the method described in Sec. III-B to obtain the DCM and the KCM. We may generate a cluster of similar clusters, or a super cluster, by aggregating neighboring neurons. This should essentially create a two-level hierarchy such that the parent node is the constructed super cluster and the child nodes are the clusters that compose the super cluster. The hierarchy generation process can be further applied to every super cluster to establish the next higher level of this hierarchy. The overall hierarchy can then be established iteratively until a stop criterion is satisfied. Here we use a top-down approach to generate hierarchies based on the construction of super clusters. To form a supercluster, we first define the distance between two clusters in the map: D(i, j) = ||Gi − Gj ||, (1) where i and j are the neuron indices of the two clusters and Gi is the two-dimensional grid coordinates of neuron i in the√map. For√a square formation of neurons, Gi = (i mod J, i div J). ||Gi − Gj || measures the Euclidean distance between the two coordinates Gi and Gj . We also define the dissimilarity between two clusters: D(i, j) = ||wi − wj ||,

(2)

where wi is the synaptic weight vector of neuron i. Now we try to find some significant clusters among all clusters as super clusters. Since neighboring clusters in the map represent similar clusters, we can determine the significance of a cluster by examining the overall similarity that is contributed by its neighboring clusters. Such similarity, namely the aggregated cluster similarity, is defined by: 1 X doc(i)doc(j) Si = , (3) Bi F (D(i, j), D(i, j)) j∈Bi

where doc(i) is the number of documents associated with neuron i, i.e. cluster i, in the DCM and Bi is the set of neuron indices in the neighborhood of neuron i. The function F : R+ → R+ is a monotonically increasing function which takes D(i, j) and D(i, j) as arguments. The super clusters are developed from a set of dominated clusters in the map. A dominated cluster is a cluster which has locally maximal aggregated cluster similarity. We may select all dominated clusters in the map by the following algorithm: Step 1 Find the cluster which has the largest aggregated cluster similarity among all clusters under consideration. This cluster is a dominated cluster.

Step 2 Eliminate its neighboring clusters so that they will not be considered as dominated clusters in succeeded steps. Step 3 If 1) there is no cluster left, or 2) the number of dominated clusters ND exceeds a predetermined value, or 3) the level of generated hierarchy so far exceeds a predetermined depth σ, stops the process. Otherwise, decrease the neighborhood size in Step 2 and goto Step 1. The child clusters of a super cluster can be found by the following rule: Cluster i belongs to super-cluster k if for all l super-clusters: D(i, k) < D(i, l). The above process creates a two-level hierarchy. In the following we will show how to obtain the overall hierarchy. In the first application of the super cluster generation process (denoted by STAGE-1), we obtain a set of super clusters. We aggregate these super clusters under one pseudo root node and form the first and second levels of the hierarchy. To find the children of the super clusters obtained in STAGE-1, we may apply the super cluster generation process on each super cluster (STAGE-2). Notice that in STAGE-2 we only consider clusters which belong to the same super cluster. A set of child nodes will be obtained and be used as the third level of the hierarchy. The overall hierarchy can then be revealed by recursively applying the same process to each new-found super cluster (STAGE-n). We decrease the size of neighborhood in selecting dominated clusters when the super cluster generation process proceeds. In the generated hierarchy, a node represents an individual neuron which is associated with a cluster of documents. We also assign labels to each node to exhibit the themes of the clusters. The labels of a node are those keywords associated with the same neuron in the KCM. In this regard, a node in the hierarchy is labeled by a document cluster as well as a keyword cluster. Nodes in higher levels should have more general coverage than those in lower levels. We will use these labels to align hierarchies, which will be discussed in the next section. IV. M ULTILINGUAL H IERARCHY A LIGNMENT In this section we will demonstrate a method to align two hierarchies generated using the method described in Sec. III-C. Let H1 = {V1 , E1 } and H2 = {V2 , E2 } be the hierarchies generated using documents of two different languages, where V1 and E1 denote the set of vertices (nodes) and edges, respectively, of H1 . Likewise, V2 and E2 denote the set of vertices and edges, respectively, of H2 . We will first introduce an alignment method to map nodes between H1 and H2 . That is, a mapping M : V1 → V2 will be derived. A. Hierarchy alignment We want to map a node Ck ∈ V1 to some node El ∈ V2 . They are considered related if they have similar themes. Meanwhile, the theme of a node could be determined by the documents

labeled to it. Thus we could associate two nodes according to their corresponding document clusters. Since we use parallel corpora to train the SOM, the correspondence between a document and its counterpart in another language is known a priori. We should then use such correspondences to associate document clusters of different languages. We use a voting scheme to calculate the likelihood of associating Ck in H1 with El in H2 . For each pair of documents Ci and Cj in Ck , we should find the nodes to which their counterparts Ei and Ej are labeled in H2 . Let these nodes be Ep and Eq , respectively. The shortest path between them in H2 is also found. We then add a score of 1 to both Ep and Eq . Meanwhile, a score of dist(Ci1,Cj )−1 is added to all other clusters on this path, where function dist(Ci , Cj ) returns the length of the shortest path between Ei and Ej in H2 . The same scheme is applied to all pairs of documents in Ck and the cumulative scores for all nodes in H2 are calculated. We associate Ck with El when it has the highest score. When there is a tie with scores, we should accumulate the largest score of their adjacent nodes to break the tie. If a tie still happens, arbitrary selection could be made. A possible choice is to select the node in the same or nearest layer of Ck since it may has similar coverage of themes. As a result, the above process defines the mapping M between H1 and H2 . B. Association discovery After establishing the mapping M from H1 and H2 , the associations between documents in different languages can then be defined by such mapping. A document Ci in L1 is associated with a document Ej in L2 if their corresponding nodes in H1 and H2 , respectively, are associated. Likewise, the keyword associations are created according to the found mapping. A keyword labeled to node k in H1 will be associated with a keyword labeled to node l in H2 if Ck and El are associated. Associations between documents and keywords, in either language, could be easily defined by the mapping M. When Ck is associated with El , all documents and keywords labeled to these two nodes are associated. This includes the association between a document in L1 and a keyword in L1 , the association between a document in L2 and a keyword in L2 , the association between a document in L1 and a keyword in L2 , and the association between a document in L2 and a keyword in L1 . V. E XPERIMENTAL R ESULT We constructed the bilingual parallel corpora by collecting parallel documents from Sinorama corpus. The corpus contains segments of bilingual articles of Sinorama magazine. Each document is a segment of an article. The corpus contains 976 parallel documents. This is a relatively small corpus that we majorly used for explanation purpose. Each Chinese document had been segmented into a set of keywords though the segmentation program developed by the CKIP team of Academia Sinica. The program is also a part-of-speech tagger. We selected only nouns and discarded stopwords. As a result,

we have a vocabulariy of size 3436. For English documents, common segmentation program and part-of-speech tagger are used to convert them into keywords. Stopwords were also removed. Furthermore, Porter’s stemming algorithm was used to obtain stems of English keywords. Finally, we obtained an English vocabularies of size 3711. These vocabularies were then used to convert each document into a vector, as described in Sec. III-A. To train the corpus, we constructed a self-organizing map which contains 100 neurons in 10 × 10 grid format. Each neuron in the map contains 3436 and 3711 synapses for training Chinese documents and English documents, respectively. After training we labeled the map by documents and keywords, respectively, by the methods described in Sec. III-B and obtained the DCMs and the KCMs for both languages. After the clustering process, we applied the hierarchy generation process to the DCMs to obtain the hierarchies. In our experiments we limited the number of dominated clusters to 10. We limited the depth of hierarchies to 3. In Figure 1 we show part of the Chinese hierarchy developed from the corpus. Each leaf node in a hierarchy represents a cluster in the corresponding DCM. The parent node of some child nodes in level n of a hierarchy represents a super cluster found in STAGE-(n-1). In this example, the root node is one of the six dominated clusters found in STAGE-1. This node has 4 children which are the 4 dominated clusters obtained in STAGE-2. These child nodes comprise the third level of the hierarchy. Likewise, the child nodes of the third-level nodes are the found dominated clusters after STAGE-3. The keywords in the KCM are used to label every node in the hierarchies. We merge all keywords that belong to those clusters that are not dominated clusters into their nearest dominated clusters, as described in Sec. III-C. It is clear in this example that the generated hierarchy comprises similar clusters which have related keywords. We omitted the rest of the hierarchies due to space limitation.

A. Hierarchy alignment After generating bilingual hierarchies, we applied the hierarchy alignment algorithm described in Sec. IV-A. The quality of the alignment is measured by the amount of parallel documents in each pair of associated clusters. For example, let Chinese cluster Ck be associated with English cluster El in which Ck contains documents C1 , C3 , C5 and El contains documents E1 , E2 , E3 , and E4 . Here Ci and Ei are a pair of parallel documents. In this case, the number of parallel documents in this pair of clusters is then 2, namely C1 /E1 and C3 /E3 . The total number of such parallel document pairs over all associated clusters is summed and divided by the total number of parallel document pairs in the training set. We have ratios 0.63 and 0.58 for Corpus-1 and Corpus-2, respectively. This means that about 60% of parallel documents fall in associated clusters.

(mainland China) (Taiwan) (Taiwanese businessman) (Dongguan) (fiber) (textile)

(textile) (domestic)

A

B

(Dongguan) (Taiwanese businessman) (mainland China)

(enterprise) (South Africa)

(Dongguan) (Taiwanese businessman) (Palau) C

(South Africa) (Taiwanese

(South Africa) (Taiwanese

Businessman)

Businessman) (Derlon)

(enterprise) (environment) !(security)

(Taiwanese

(Taiwanese

Businessman)

 (enterprise) (environment) (mainland

Businessman)

 (enterprise) (mainland

"China) (US dollar)

China) #$(regulation) %&(trade)

A

(fiber) (nylon)

(fiber) (nylon) (manufacturer)

(fiber)

(fiber) (metal)

(fiber) (metal) (product) (patent)

(textile)

(textile) 

B

(industry) (product)

(industry) (product) (manufacturer) (domestic) '(foreign)

(textile)  (manufacturer)

(manufacturer)

(textile) C

(Taiwanese Businessman)

(Taiwanese Businessman) ()*(local)

(Taiwanese +,

(Palau) () -./(local) (travel

(afterthought)

agency)

Businessman)

 -./(Palau) (travel agency)

Fig. 1. One of the generated Chinese hierarchies. English translations are enclosed in the parentheses for reference.

VI. C ONCLUSIONS In this work, we proposed an automatic method to generate and align multilingual hierarchies. Our method applies the self-organizing map model to cluster bilingual documents and creates two feature maps for each language. A hierarchy generation process is then applied on these maps to create bilingual hierarchies. These hierarchies are then aligned according to the relatedness of their nodes. The associations between nodes are determined by the documents associated with them. We conducted experiments on two sets of corpora and obtained promising result. ACKNOWLEDGMENT This work is supported by National Science Council under grant NSC 97-2221-E-390-017. R EFERENCES [1] T. Kohonen, Self-Organizing Maps. Berlin: Springer-Verlag, 2001. [2] R. Chau and C. H. Yeh, “A multilingual text mining approach to web cross-lingual text retrieval,” Knowledge-Based Systems, vol. 17, no. 5-6, pp. 219–227, 2004.

[3] C. H. Lee and H. C. Yang, “A multilingual text mining approach based on self-organizing maps,” Applied Intelligence, vol. 18, no. 3, pp. 295– 310, 2003. [4] A. Rauber, D. Merkl, and M. Dittenbach, “The growing hierarchical selforganizing map: exploratory analysis of high-dimensional data,” IEEE Transactions on Neural Networks, vol. 13, no. 6, pp. 1331–1341, 2002. [5] J. Han and Y. Fu, “Dynamic generation and refinement of concept hierarchies for knowledge discovery in database,” in Proceedings of AAAI’94 Workshop on Knowledge Discovery in Database. AAAI Press, 1994, pp. 157–168. [6] A. McCallum and K. Nigam, “Text classification by bootstrapping with keywords, em and shrinkage,” in Proceedings of ACL’99 Workshop for Unsupervised Learning in Natural Language Processing. Morgan Kaufmann, 1999, pp. 52–58. [7] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the construction of internet portals with machine learning,” Information Retrieval, vol. 3, no. 2, pp. 127–163, 2000. [8] A. S. Weigend, E. D. Wiener, and J. O. Pedersen, “Exploiting hierarchy in text categorization,” Information Retrieval, vol. 1, no. 3, pp. 193–216, 1999. [9] T. Hofmann, “The cluster-abstraction model: unsupervised learning of topic hierarchies from text data,” in Proceedings of the Sixteen International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 1999, pp. 682–687. [10] N. Liu and C. C. Yang, “A link classification based approach to website topic hierarchy generation,” in Proceedings of WWW 2007. Association for Computing Machinery, 2007, pp. 1127–1128. [11] S. L. Chuang and L. F. Chien, “Taxonomy generation for text segments: A practical web-based approach,” ACM Transactions on Information Systems, vol. 23, no. 4, pp. 363–396, 2005. [12] N. F. Noy and H. Stuckenschmidt, “Ontology alignment: an annotated bibliography,” in Semantic Interoperability and Integration, Y. Kalfoglou, W. M. Schorlemmer, A. P. Sheth, S. Staab, and M. Uschold, Eds. IBFI, Schloss Dagstuhl, Germany, 2005. [13] Y. Kalfoglou and W. M. Schorlemmer, “Ontology mapping: The state of the art,” in Semantic Interoperability and Integration, Y. Kalfoglou, W. M. Schorlemmer, A. P. Sheth, S. Staab, and M. Uschold, Eds. Schloss Dagstuhl, Germany: IBFI, 2005. [14] N. Choi, I. Y. Song, and H. Han, “A survey on ontology mapping,” SIGMOD Record, vol. 35, no. 3, pp. 34–41, 2006. [15] R. Ichise, H. Takeda, and S. Honiden, “Rule induction for concept hierarchy alignment,” in Proceedings of the Workshop on Ontology Learning at the 17th International Joint Conference on Artificial Intelligence (IJCAI). CEUR-WS.org, 2001. [16] K. J. Chen and M. H. Bai, “Unknown word detection for chinese by a corpus-based learning method,” International Journal of Computational linguistics and Chinese Language Processing, vol. 3, no. 1, pp. 27–44, 1998.