Overcome Cross-lingual Semantic Interoperability ... - Semantic Scholar

2006 IEEE International Conference on Systems, Man, and Cybernetics October 8-11, 2006, Taipei, Taiwan

Using Associate Constraint Network with Forward Evaluation to Overcome Cross-lingual Semantic Interoperability Challenge for Crime Information Extraction

Christopher C. Yang, Chih-Ping Wei, and Kar Wing Li Abstract-Information extraction is important for crime analysis. Due to the popularity of the Web, information related to crime and terrorism is available in multiple languages. As a result, cross-lingual semantic interoperability is essential when we extract information across multiple languages. In our previous work, we have developed several techniques to generate an automatic cross-lingual thesaurus to support cross-lingual information retrieval based on a parallel corpus collected from the Web. The techniques include Hopfield network and associate constraint network with backmarking. Although these techniques obtain satisfactory performance, they have weaknesses in efficiency, consistency, precision or recall. In this work, we develop a new searching technique, namely forward evaluation, on the basis of our previously developed associate constraint network model. We have conducted an experiment and show that the proposed forward evaluation technique outperforms both Hopfield network and associate constraint network with backmarking in terms of precision and recall. In addition, its efficiency is better than Hopfield network but is not as good as associate constraint network with backmarking. I. INTRODUCTION

THE cross-lingual semantic interoperability is an inevitable problem when we want to search across languages to extract information for crime and terrorism analysis, especially when the number of crimes organized by international criminal organizations is increasing at a seemingly accelerating pace. For example, there are many cross-border international crimes and terrorism. Illegal migration is facilitated by organized alien smuggling networks. Children and women are smuggled across borders for sexual exploitation and forced labor. Pollutants and dangerous chemicals such as toxic wastes are exported to countries in Eastern and Central Europe, Asia, and Africa illegally. The advance of Internet also creates opportunities for international high-tech crimes and intellectual property rights violation. Terrorist organizations are smuggling bomb making material across border. Information about these international crimes and terrorism are available in many different languages on the Web. To address the cross-lingual semantic interoperability

challenge, we generate the cross-lingual thesaurus automatically using techniques such as Hopfield network and associate constraint network. Based on the statistical co-occurrence analysis on a parallel corpus, we derive the relevance weights between terms in any languages. These relevance weights will be taken as inputs to the Hopfield network or associate constraint network to represent the associations between multilingual terms. In our previous work [16],[17],[19], given an input term in any languages, spreading activation technique for Hopfield network or backmarking technique for associate constraint network is employed to generate the cross-lingual thesaurus. A. Parallel Corpus Parallel corpus available on the Web can be organized in one of the three structures: parent page structure, sibling page structure, and monolingual sub-tree structure [5],[15],[18],[19]. Documents of different languages in parallel corpus organized in the parent page structure or sibling page structure can be easily aligned using the hyperlinks. However, most of the parallel corpus is provided through monolingual sub-tree structure. That means hyperlinks are not available between the pairs of documents in different languages. In our previous work [8],[15], we have developed an alignment technique so that Web documents of mono-lingual sub-tree structure in Language 1 (LI) and Language 2 (L2) will be aligned as pairs of documents in the parallel corpus using longest common subsequence and edit distance, as we show in Figure 1. Language 1

(Lj)

+=mv

Aligned Document

IPairs of L, and L2

Figure 1. Document alignment to construct parallel corpus from the Web C. C. Yang is with the Department of Systems Engineering and Engineering Management at the Chinese University of Hong Kong (corresponding author, e-mail: yangdse.cuhk.edu.hk). C. Wei is with the Institute of Technology Management at the National Tsing Hua University, Taiwan. K. W. Li is with the City University of Hong Kong.

1-4244-0100-3/06/$20.00 C2006 IEEE

B. Co-occurrence Analysis Given a parallel corpus collected from the Web, we extract terms from documents in LI and L2 and then conduct the co-occurrence analysis to compute the relevance weights

1125

between all pairs of terms in both languages (Figure 2). The formulation to compute the relevance weight (W4) from term i to termj is given as below [16],[17]: NI

technique using that of the backmarking technique as a performance benchmark. II. ASSOCIATE CONSTRAINT NETWORKS

N

yftZdkY dki where N is the number of document pairs in the parallel corpus, dki tfkj x log (N/df; x 1,) where l, is the length of term i, tfkj is the frequency of term i in document pair k, df is the document frequency of term i, dk, - tfky x log N/df1 where tfk,y is the minimum of tfkj and tfk and df1 is the document frequency of term i and

termj.

Term

XxX

7*

L

Extraction

Automatic Cross-lingual Co-occurrence

Em:*

Thesaurus Generation

(via Hopfield Network or Associate Constraint Network)

Aligned Document Pairs of Li and L2 Figure 2. Generating cross-lingual thesaurus from parallel corpus.

C. Automatic Cross-lingual Thesaurus Hopfield network is the first technique that we have developed to construct an automatic cross-lingual thesaurus [16]. The Hopfield network [9],[10],[13] models the associate network and transforms a noisy pattern into a stable state representation. When a term in LI is taken as an input, the spreading activation process of the Hopfield network will identify other relevant terms in both LI and L2 and gradually converge towards a set of highly associated terms [6],[7],[18]. However, the shortcomings of Hopfield network are low efficiency and plausible inconsistency. The convergence process of the Hopfield network is time consuming. In some cases, it may not converge, especially when the parameters are not tuned appropriately. The random processes in the Hopfield network also cause the inconsistency as results generated from different convergence processes may be different. In another work [17], we have investigated the use of the associate constraint network for constructing a cross-lingual thesaurus and developed the backmarking technique for identifying the solution tuple given the constraints defined in the associate constraint network. The performance was satisfactory but there are still rooms for improvement. In this work, we aim to develop a new technique, namely forward evaluation. Forward evaluation estimates the fitness of a partial solution tuple and determines if it should be discarded or can be extended to search for the complete solution. In the next sections, we shall introduce the associate constraint network and empirically evaluate the effectiveness of the forward evaluation

A. Associate Constraint Network Model Associate constraint network simulates the associate memory in a human brain and has been employed to construct a cross-lingual concept space [17]. The associate constraint network is an associate network of the extracted terms from a parallel corpus with constraints imposed on the nodes of the associate network. Searching techniques are developed to search for a feasible solution that satisfied the constraints in order to generate a cross-lingual thesaurus for an input term in any languages. A constraint satisfaction problem (CSP) is a problem composed of a finite set of variables, each of which is associated with a finite domain, and a set of constraints that restricts the values the variables can simultaneously take [14]. The task is to assign a value to each variable satisfying all the constraints. As a result, a CSP can be represented as a triple (V, D, C). V is a set of variables, {vI, v2, ..., v,}. D is a function to map every variable in Vto a set of possible values. D: V - a set ofpossible values. C is a set of constraints on an arbitrary subset of variables in Vrestricting the values that the variables can simultaneously take. The construction of the cross-lingual thesaurus is modeled as a constraint satisfaction problem [12], [14] and the constraints are depicted by the associate constraint network. The nodes of an associate constraint network (xI, x2, ... x,) represent the extracted terms ofthe parallel corpus. xi can be a term in LI and L2. The values of the nodes are binary, xj {O, I}. xj 1 if xj is a term in the cross-lingual thesaurus for a given input term. xj = 0 if xj is not a term in the cross-lingual thesaurus for a given input term. The arcs of the associate network represent the associations between the extracted terms. For instance, a directed arc from xi and xj corresponds to the association from term i to termj. Wij is the label on the directed arc which corresponds to the relevance weights from term i to term j. The directed arcs between any two nodes are asymmetric. The constraint cj is applied on xj. A term, xj, is considered to be relevant to the other terms in the thesaurus if the sum of the associate weights from the other terms is sufficiently large; otherwise, it should not be considered as relevant in the thesaurus. cj is given as below. 1

xi

n

x, >2 threshold

=

-

EWx,