Multilingual Information Retrieval using GHSOM Hsin-Chang Yang National University of Kaohsiung Department of Information Management Kaohsiung, Taiwan, ROC
[email protected]
Chung-Hong Lee National Kaohsiung University of Applied Sciences Department of Electrical Engineering Kaohsiung, Taiwan, ROC
[email protected]
Abstract The Web pages nowadays were written in various languages including English, Chinese, Spanish, etc. There are increasing needs in searching Web pages of different languages using single query. This task is called multilingual information retrieval (MLIR). However, MLIR is difficult to achieve since we need some kind of method to find the associations between linguistic elements of different languages. In this work, we provide a method based on GHSOM to discover the associations between different languages and apply this method on MLIR task. The experiments show that our method provide a promising approach to tackle MLIR task.
1. Introduction Most of the Internet users use some search engines to find documents they want in the Web. Unfortunately, most of the search engines provide only monolingual search interface, i.e. the queries and target documents are written in the same language, mostly English. There are increasing needs for searching documents written in a different language other than that of queries. The users must translate their queries into another language to meet the limitation of traditional search engines. However, the translations are always ambiguous and imprecise, especially for untrained users. It would be convenient for these users to express their queries in familiar language and search documents in other languages. Cross-lingual or multilingual information retrieval (MLIR) techniques meet such need. The task of multilingual information retrieval intends to break the language barrier between the queries and the target documents. A user can express her need with one language and search documents of another language. However, this task is not easy to conquer since we have to relate one language to the other. One straightforward way is to translate one language into the other through some machine
translation schemes. Unfortunately, there is still no well recognized scheme to provide precise translation between two languages. A different approach is to match queries and documents directly without a priori translation. This approach is also difficult since we need some kind of measurements to measure the semantic relatedness between queries and documents. Such semantic measurements are generally not able to be explicitly defined, even with human intervention. Thus we need a kind of automated process to discover the relationships between different languages. Such process is often called multilingual text mining (MLTM). In this work, we will develop a MLTM technique and apply it to MLIR. Our method applies GHSOM to cluster and structure a set of parallel texts. The reason we adopt GHSOM is that it could effectively construct a hierarchical structure of the documents. We could then develop algorithms to discover the relationships between different languages based on such hierarchical structure. Such relationships are then used for MLIR task. We perform experiments on a set of Chinese-English parallel texts. Evaluation results suggest that our approach should be plausible in tackling MLIR tasks.
2. Related Work The aim of MLIR is to provide users a way to search documents written in a different language from the query. Query translation thus plays a central role in MLIR research. Three different strategies for query translation were used [10], namely dictionary-based [1], thesaurus-based [4], and corpus-based methods. For dictionary-based approaches, Hull and Grefenstette [7] performed experiments using dictionary-based approach without ambiguity analysis. The experiments showed that the precisions are 0.393 and 0.235 for retrieving monolingual and multilingual texts, respectively. They, especially the multilingual case, are not adequate in most retrieval tasks. In another experiment, Davis [6] reported precisions of 0.2895 and 0.1422 for monolingual and multilingual retrieval, respectively. The
performance of multilingual retrieval is only half of the monolingual case. For thesaurus approaches, Peters and Picchi [11] indicated that MLIR could be achieved through precise mapping between thesauri of different languages. Salton [14] also suggested that the performance of MLIR systems could resemble those monolingual ones provided a correct multilingual thesaurus is established. However, the difficulties of thesaurus-based approach come from the construction of the thesaurus, which needs much human effort and time cost. Moreover, a precise and complete thesaurus is difficult to construct. Furthermore, the mapping between different thesauri is not obvious and needs thorough investigation. In corpus-based approaches, many multilingual text mining techniques are based on comparable or parallel corpora. Chau and Yeh [3] generated fuzzy membership scores between Chinese and English terms and clustered them to perform MLIR. Lee and Yang [8] also used SOM to train a set of Chinese-English parallel corpora and generate two feature maps. The relationships between bilingual documents and terms are then discovered. Rauber et al. [13] applied GHSOM to cluster multilingual corpora which contain Russian, English, German, and French documents. However, they translated all documents into one language first before training. Thus they actually performed monolingual text mining. Rauber et al. proposed GHSOM [13] that could expand the map and develop hierarchical structure dynamically. GHSOM will expand the SOMs if current map could not learn the data well. When a cluster contains too much incoherent data, it could also expand another layer to further divide data into finer cluster. The depth of hierarchical and the size of each SOM are determined dynamically. These properties make it a good choice for document clustering and exploration. GHSOM have been applied to various fields such as expertise management [16], failure detection [9], etc. The application on MLIR has only been done by the GHSOM team. However, what they did is actually monolingual text mining that applies to MLIR.
meanings. These stopwords will be removed to reduce the number of keywords. Stemming process will be also applied to convert each keyword to its stem (root). This will further reduce the number of keywords. After these processing steps, we will obtain a set of keywords that should be representative to this document. All keywords of all documents are collected to build a vocabulary for English keywords. This vocabulary is denoted as VE . A document is encoded into a vector according to those keywords that occurred in it. When a keyword occurs in this document, the corresponding element of the vector will have a value greater than 0; otherwise, the element will have value 0. In practice, the value is just the traditional tf · idf (term frequency times inverse document frequency) value, where tf counts the frequency of the keyword in this document and idf measures the frequency of the keyword occurs across all documents. With this scheme, a document Ej will be encoded into a vector Ej . The size (number of elements) of Ej is just the size of the vocabulary VE , i.e. |VE |. The segmentation of Chinese words is more difficult than English words because we have to separate these consecutive letters into a set of words. We adopt the segmentation program developed by the CKIP team of Academia Sinica to segment words [5]. In this work, we omit stop word elimination and stemming processes here since we select only nouns as keywords. As in English case, the selected keywords are collected to build Chinese vocabulary VC . Each Chinese document Cj is encoded into a vector Cj in the same manner as English. The size of Cj is just the size of the vocabulary VC , i.e. |VC |. After encoding documents into vectors, we uses Ej and Cj individually to train GHSOM. Two hierarchies will be constructed after training. These hierarchies are then used to obtain the associations between different languages, which will be discussed in the following section.
3. Document processing and clustering by GHSOM
In this section we will develop methods to discover various types of associations, namely document associations, keyword associations, and document-keyword associations, from the constructed hierarchies.
A document should be converted to proper form before GHSOM training. Here the proper form is a vector that catches essential (semantic) meaning of the document. In this work we adopt bilingual corpora that contain Chinese and English documents. The encoding of English documents into vectors has been well addressed. In this work, we first used a common segmentation program to segment possible keywords. A part-of-speech tagger is also applied to these keywords so that only nouns are selected. These selected keywords may contain stopwords that have trivial
4. Finding associations of multilingual documents
4.1. Document and keyword associations To find the associations between keywords of different languages, we should map a neuron in one hierarchy to some neuron(s) in another hierarchy. Since the GHSOM will label a set of keywords on a neuron, this neuron forms a keyword cluster as mentioned above. Therefore, what we need is a way to associate a Chinese keyword cluster to an
English keyword cluster. A Chinese keyword cluster is considered to be related to an English one if they represent the same theme. Meanwhile, the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it. Thus we could associate two clusters according to their corresponding document clusters. Since we use parallel corpora to train the GHSOM, the correspondence between a Chinese document and an English document is known a priori. We should use such correspondences to associate document clusters of different languages. To associate a Chinese cluster Ck with some English cluster El , we use a voting scheme to calculate the likelihood of such association. For each pair of Chinese documents Ci and Cj in Ck , we should find the neuron clusters which their English counterpart Ei and Ej are labelled to in the English hierarchy. Let these clusters be Ep and Eq , respectively. The shortest path between them in the English hierarchy is also found. A score of 1 is added to both Ep and Eq . A score of dist(C 1,C )−1 is added to all other clusters on this path, i
j
where dist(Ci , Cj ) is the length of the shortest path between Ei and Ej in the English hierarchy. The same scheme is applied to all pairs of documents in Ck and the overall scores for all clusters in the English hierarchy are calculated. We associate Ck with El when it has the highest score. When there is a tie in scores, we should accumulate the largest score of their adjacent clusters to break the tie. If a tie still happens, arbitrary selection could be made. A possible choice is to select the cluster in the same or nearest layer of Ck since it may has similar coverage of themes.
The associations between documents can then be defined by such cluster associations. Chinese document Ci is associated with English document Ej if their corresponding clusters are associated. Likewise, the keyword associations are created according to the found cluster associations. A Chinese keyword labelled to neuron k in the Chinese hierarchy will be associated with an English keyword labelled to neuron l in the English hierarchy if Ck and El is associated.
4.2. Document-Keyword associations When cluster associations have been found as mentioned above, associations between documents and keywords, no matter in either language, could be easily defined. When Ck is associated with El , all documents and keywords labelled to these two neurons are associated. This includes the association between a Chinese document and a Chinese keyword, association between an English document and an English keyword, association between a Chinese document and an English keyword, and association between an English document and a Chinese keyword.
4.3. MLIR application When a query is submitted, it is first preprocessed to a set of keywords Q as described in Sec. 3. The documents associated with query keyword q ∈ Q are retrieved according to the document-keyword associations found in Sec. 4.2. Both Chinese and English documents will be retrieved. We may also retrieve documents in each language when necessary. The ranking mechanism is described as follows. A training vector contains elements for either Chinese or English keywords. Thus, an element of a neuron’s synaptic weight vector corresponds to a Chinese or an English keyword. For a Chinese query keyword q, its corresponding element in the synaptic weight vector is first obtained. The value of this element will then be used to calculate the ranking scores of documents. The ranking score of a document Dj is composed of two components. The first is the cluster score, SC (q, Dj ), which measures the importance of the cluster that Dj belongs to. The other is the keyword score, SK (q, Dj ), which measures the importance of q in document Dj . SC (q, Dj ) is defined as: SC (q, Dj ) =
1 , ∆(Eq , EDj ) + 1
(1)
where Eq is the cluster that Cq , which is the Chinese cluster that q associates with, is associated with, EDj is the document cluster associated with Dj in the English hierarchy, and ∆(Eq , EDj ) measures the shortest path length between Eq and EDj . SK (q, Dj ) is simply the value of the element corresponding to q in the document vector of Dj . Thus the ranking score of Dj in responding to q is defined as: SR (q, Dj ) = SC (q, Dj )SK (q, Dj ).
(2)
The ranking score of a Chinese document in responding to an English query keyword is also calculated in the same way by exchanging the languages of the query and document.
5. Experimental result We constructed the parallel corpora by collecting parallel documents from Sinorama corpus. The corpus contains segments of bilingual articles of Sinorama magazine. Each document is a segment of an article. The corpus contains 10672 parallel documents. Each Chinese document had been segmented into a set of keywords though the segmentation program developed by the CKIP team of Academia Sinica. The program is also a part-of-speech tagger. We selected only nouns and discarded stopwords. As a result, we have vocabulary of size 12941. For English documents, common segmentation program and part-of-speech tagger are used to convert them into keywords. Stopwords were also removed. Furthermore, Porter’s stemming algorithm
!"
#$ %&'()*+,-.* /0 123 45 6*7*' 89 :;? @A *B'*,- 'C ;*,()B'*D; .)-7=,;< EF GH IJK )'C*< ,-:)( LM :*,N*:; 6,;%-=;