BILINGUAL SENTENCE MATCHING USING KERNEL ... - Research

2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2010) August 29 – September 1, 2010, Kittilä, Finland

BILINGUAL SENTENCE MATCHING USING KERNEL CCA Abhishek Tripathi

Arto Klami, Sami Virpioja

University of Helsinki Department of Computer Science P.O.Box 68, 00014 UH, Finland

Aalto University School of Science and Technology Department of Information and Computer Science P.O.Box 15400, FI-00076 Aalto, Finland

ABSTRACT The problem of matching samples between two data sets is a fundamental task in unsupervised learning. In this paper we propose an algorithm based on statistical dependency between the data sets to solve the matching problem in a general case when samples in both data sets have different feature representations. As a concrete example, we consider the task of sentence-level alignment of parallel corpus based on monolingual data. Multilingual text collections with sentence-level alignment are required by statistical machine translation methods. We show how statistical dependencies between feature representations of partially aligned (e.g., paragraph-level alignment) corpora can be used to learn sentence-level alignment in a data-driven way. Our novel matching algorithm based on Kernel Canonical Correlation Analysis (KCCA) outperforms an earlier algorithm using linear CCA. 1. INTRODUCTION We study the abstract problem of matching samples of two collections of unordered data items, so that each sample in one collection is matched with one in the other. It is assumed that there is a single true matching of the items but that it is unknown. The task is to learn the matching in a data-driven way, that is, based on measurement vectors attached to each item in both collections. Such matching is useful in various applications involving a range of measurement sources that are not directly commensurable. Practical examples include matching gene activity measurements obtained using different brands of microarrays, finding correspondence of metabolites between two species, or matching a photo with its textual description in multimodal retrieval. Following [2, 3], we base our solution for this general problem on a very simple principle: The correct matching results in the highest statistical dependency between the two collections. Assuming mutual information as the measure of dependency, the problem is to find the matching that maximizes it. In practice, measuring mutual information between two arbitrary vectorial measurements is difficult and hence approximations are needed. In

c 978-1-4244-7876-7/10/$26.00 2010 IEEE

[4, 2, 3] the problem was solved by using Canonical Correlation Analysis (CCA) to find maximally correlating linear subspaces of the representations and measuring the dependency in that subspace. CCA de-correlates the dimensions, and the total dependency becomes a sum over the components. Using CCA to measure the dependency results in an easy-to-understand optimization problem which can be solved using a two-step iterative algorithm. In the first step, given a random matching, CCA is used to find a subspace where dependency is maximally expressed. The second step takes as input the CCA subspace from first step, and uses a generic assignment solver to find a matching that maximizes the dependency. These two steps, both optimizing the same cost function, are iterated until convergence. The whole process is illustrated in Figure 1. While the above algorithm is easy to understand and implement, resorting to linear dependency by using CCA affects performance in real applications. An alternative solution by [5] is to use a kernel-based measure on statistical dependency and directly maximize that with respect to the matching. The Hilbert-Schmidt Independence Criterion (HSIC) [6] measures dependency over a range of non-linear functions spanned by kernel functions defined for the two data spaces. However, directly maximizing HSIC results in a complex algorithm that requires approximative solutions. In this work we combine the advantages of both approaches. We introduce a matching method that retains the simplicity of the solution by [3], but removes the assumption of linear dependency. Similar to [5] we define kernel functions for the two spaces, but use kernel CCA (KCCA) [7, 8] to learn a subspace that maximizes the dependency. We then show how the distances between the samples computed in the kernel canonical subspace can be used as a cost function for the assignment problem, and hence given the distances we can again learn the matching using the same iterative algorithm. We apply the method in the task of finding correspondence of the same text in two languages at sentence-level such that the matched sentences are translations of each other. The same text in two languages can be seen as two

130

Subspace detection:

X

Y

x1 x2 x3 . . . xn

y1 y2 y3 . . . yn . ym

Unmatched samples Dotted lines show possible matches

Given permutation p, update projections f and g such that Corr(f(X), g(Y(p))) is maximized.

X

Y

x1 x2 x3 . . . xn

y1 y2 y3 . . . yn . ym

Solve matching: Given projections f and g update permutation p such that Σ || xi W Tx − yp(i) W yT ||2 is i

Final matching of samples

minimized. The two−step iterative algorithm

Fig. 1. Illustration of the general algorithm for matching samples of two sources. The algorithm iterates between finding the projections (top box) and finding the alignment in the subspace (bottom box). The crux of the method is that both steps use the same criterion, maximization of statistical dependency between the sources. The approach is also highly modular, supporting various alternatives for both steps, as demonstrated by improvement to the subspace detection in this paper. completely different representations of the same entities, and in the absence of bilingual lexicon the representations are not comparable. However, they are statistically dependent due to the same semantic content. The same idea and a KCCA-based correlation analysis has been applied in [13] for cross-language document retrieval. Large sentence-aligned corpora are needed for learning statistical machine translation models. The sentence alignment is typically based on available ”anchor” cues (such as speaker identifiers and paragraph markers) and sentence lengths, but also translation lexicons or more complex models of co-occurrence have been applied [1]. We approach the problem of sentence-alignment as a matching task rather than an alignment task. The reason is that the order of sentences may change during translation, and alignment methods typically ignore cross-correspondences. The matching problem does not assume ordering, and is more suitable in such cases. Also, matching the sentences directly based on vectorial representations has an advantage that it is applicable also to the cases where translation lexicons or other traditional alignment cues are either not available or are only partially known. We represent the sentences as vectors, and find the alignment by matching the sentences based on their vector representations only. The representations can be anything that capture the similarities between the sentences, but the ac-

curacy is naturally improved by choosing a representation that captures the semantic content as well as possible. In this work, statistical dependency is used to capture the similarities or semantic content between two sentences. Here, we use simple bag-of-words representations, and as a sideresult show how the applied global term weightings affect the matching accuracy. We apply the novel method to match sentences in partially aligned Finnish-English corpora from Europarl [11]. We assume the rough location of each sentence in the translation to be known (e.g. based on document-level alignment), and find the sentence-level alignment. The matching method based on KCCA outperforms the earlier approach of using linear CCA in matching, and we show how the performance can be further improved in semi-supervised settings where sentence-level alignment is known for a subset of the sentences. We also provide experimental evidence on the relative quality of different bag-of-words representations for sentences, and explain how soft constraints on the alignment can be incorporated in the assignment problem. 2. METHODS In this section, we describe the new method for matching the samples in the two views. We start by recapitulating the algorithm of [3] using linear CCA, and then extend it for

131

for non-linear dependencies using kernel CCA. Finally, we show how the problem can be supervised with known partial alignments.

dimensions, and the Eq. 2 can be written as min p

2.1. CCA-based matching Let X ∈ RN ×Dx and Y ∈ RN ×Dy , be two data matrices representing two text collections. Each row xi in X and yj in Y is a sentence represented by a set of Dx and Dy features, respectively. Assuming one-to-one matching, we want to infer a permutation p of the samples in Y such that each sentence xi ∈ X is matched with its translation yp(i) ∈ Y. The matching is based on the assumption that the correct matching is the one that best captures the statistical dependency between the two sets. To measure the dependency, we search for a common subspace for the two sources. Formally, we want to infer a permutation p and lower dimensional mappings f (x) and g(y) such that the statistical dependency is maximized max

Dep (f (X), g(Y(p)) .

p,f ,g

(1)

Here Dep(·, ·) denotes any measure of dependency between the two arguments, and Y(p) ∈ RN ×Dy is a matrix obtained by picking the rows as indicated by p. In case of linear mapping, f (X) = XwxT and g(Y) = YwyT , and Pearson correlation as dependency measure, the optimization problem becomes max corr XwxT , Y(p)wyT . p,wx ,wy

The problem can be solved by a simple iterative procedure, alternating between learning the matching and learning the projections (Figure 1). Below the two steps of the general algorithm are explained in more detail. Given fixed projections, the sample estimate of correlation leads to max p

wx XT Y(p)wyT . kXwxT kkY(p)wyT k

min p

kxi WxT − yp(i) WyT k2 .

(3)

i=1

Given fixed permutations p, the projections are learnt by CCA [8] that finds the subspace where correlations are maximal. Since the CCA projections are scale-invariant, the components can be re-weighted for maximal matching accuracy, and [3] empirically showed that weighting each dimension with the corresponding canonical correlation outperforms uniform weighting. The full algorithm then simply repeats these two steps until convergence, starting from an initial pairing chosen randomly. 2.2. Kernel matching based on KCCA Linear projections are not necessarily sufficient for capturing the relationships between the representations. The matching problem can, however, easily be extended for non-linear mappings f (x) and g(y). In this section, we describe how non-linear dependencies can be captured by introducing nonlinear mappings and kernel canonical correlation analysis (KCCA). Let φx : x → F x and φy : y → F y denote feature space mappings with kernel functions kx (xi , xj ) = hφx (xi ), φx (xj )i and ky (yi , yj ) = hφy (yi ), φy (yj )i, respectively. The data sets X and Y in the feature space can be represented as ΦX ≡ [φx (x1 ), . . . , φx (xN )]T and ΦY ≡ [φy (y1 ), . . . , φy (yN )]T . The kernel Gram matrices KX = ΦX ΦTX ∈ RN ×N and KY = ΦY ΦTY ∈ RN ×N can be computed element-wise as KX (i, j) = kx (xi , xj ) and KY (i, j) = ky (yi , yj ). The KCCA of X and Y is analogous to applying classical CCA on the corresponding Gram matrices. For details on KCCA computation, see for instance [7, 8]. Here we describe the method directly for the matching problem, starting from the assumption of having a fixed p. KCCA finds the canonical vectors in terms of the expansion coefficients αi , β i ∈ RN , by solving argmax

As shown by [3], this can be reduced to solving an assignment problem where the cost of assignment is defined by the distance in the projection space N X

N X

αi ,β i ∈RN

subject to

αi KX KY(p) β Ti αi KX KX αTi = 1 β i KY(p) KY(p) β Ti = 1,

(xi wxT − yp(i) wyT )2 ,

(2)

i=1

and the assignment problem can be efficiently solved e.g. with the Hungarian algorithm [9]. In case of multivariate projections Wx and Wy , the total correlation can be summed over different dimensions, assuming uncorrelated

where KY(p) is the Gram matrix for Y(p). The expansions coefficients A = [αT1 , . . . , αTq ] and B = [β T1 , . . . , β Tq ] are analogous to the projection matrices Wx and Wy of the classical CCA, and q = min(Rank(KX ), Rank(KY(p) )). Given the kernel canonical vectors, the permutation p is again solved as the assignment problem. The costs in (3)

132

are re-written in the kernel space as min p

N X

kKiX A − KiY(p) Bk2 ,

(4)

i=1

where KiX and KiY(p) represent the ith rows of the Gram matrices. Otherwise this step of the algorithm is identical for the linear case. In practice, each KCCA projection dimension is re-scaled with the corresponding kernel canonical correlation as in the linear case. While KCCA has the advantage of allowing non-linear dependencies, it has also certain drawbacks. High-dimensional feature mappings result in overlearning the correlations and hence poor generalization, which means that the KCCA solution needs to be regularized heavily. On the positive side, the fact that we only need to compute the distances between the samples and weight the dimensions according to the canonical correlations means that choosing the complexity of the KCCA subspace is not an issue. We adopt the regularization strategy of [10], computing KCCA with Gram matrices of the type KX + γI instead of KX . The regularization parameter γ, as well as potential parameters of the kernel matrices, such as the standard deviation of a Gaussian kernel, will be learned by using a validation set. 2.3. Using partial alignments The basic algorithm is applicable to fully unsupervised settings, but partial alignments can be used to improve the accuracy. We will rule out a priori unlikely matches by modifying the distances (4) based on paragraph- or documentlevel alignment. Earlier works [3] have utilized hard constraints, whereas we show how also soft constraints chosen by application domain expertise can easily be incorporated in the assignment problem. In practice, we penalize alignments moving any sentence more than 5 sentences forward or backward in the known document location. The task can also be supervised with sets of known sentence-level alignments, resulting in a semi-supervised matching task. The data matrices X and Y can be complemented with samples for which p is known. The algorithm need not be changed in any other way than by ignoring that part of sentences when solving for p. Furthermore, the aligned part can be used to learn kernel parameters and also to initialize (K)CCA projections, which guarantees reasonably good subspace is found already in the beginning and less iterations are needed until convergence. 3. EXPERIMENTS 3.1. Data We use a part of the Europarl corpus consisting of the proceedings of the European Parliament meetings in 11 lan-

guages [11]. We take two languages, Finnish and English, and align one month of the data (September 2003) using the sentence alignment tool included in the corpus to get the ground-truth alignment. The data has 21 358 sentences divided into 8 days. After preprocessing, the data has 496 044 English and 382 866 Finnish tokens (words, numbers and punctuation marks). Vectorial representations are produced separately for the sentences of the two languages. We apply bag-of-words representations, using all the words in the corpus, but then reduce the dimensionality of the representations to 200 with truncated Singular Value Decomposition (SVD), thus obtaining two data matrices of size 21 358 × 200. We compare four different global term weightings for the bag-ofwords representations: no weighting, logarithmic inverse document frequency (idf), square root idf, and linear idf. As local weighting, we apply linear term frequency. 3.2. Experimental setup We compare the CCA-based matching method with the new KCCA-based method using the standard Gaussian kernel k(xi , xj ) = exp(−λkxi − xj k2 ) in two different experiments; unsupervised and semi-supervised. For both setups, the first day data, 1266 sentences, is used as a development set for learning the model parameters,while the rest are used as left-out data for measuring the final performance. In semi-supervised setup, the development set is also included while learning the alignment for the left-out data. For CCA-based matching method, we use the development set to select the best normalization strategy for the vectorial representation of sentences given four normalization variants, and to learn the dimensionality of the sentence description (first d SVD dimensions). For KCCA-based method, we additionally learn the regularization parameter γ and the inverse kernel width λ, using a basic grid search. We then apply the models with the best parameters to the test data, treating each day as an independent alignment problem. Furthermore, we split the days with most sentences into smaller parts to obtain more test cases, resulting in sizes ranging between 1085 and 1750 sentences. In the end, we average the results over the 14 matching problems. As prior information, we assume the order of the sentences is roughly preserved in translation. For a given sentence, the translation should be reasonable close in the corresponding document. We allow free search within a limited neighborhood and add penalty beyond that neighborhood to discourage matching with far-off sentences. We add no penalty for translations within ±5 sentences, and for sentences further away we add a penalty of 0.01 × d(xi , yj ) × |j − i|, where d(xi , yj ) is the distance between the vectorial representations of the two sentences and |j − i denotes the difference in positions. The penalty is chosen multiplicative to avoid the need to validate the magnitude of the

133

60

Linear CCA matching Baseline−Random subspace Baseline−Euclidean

0

10

Matching accuracy(%) 20 30 40 50

60 0

10

Matching accuracy(%) 20 30 40 50

Logarithmic idf No global weighting Square root idf Linear idf

10

20

30 50 70 90 Dimensionality of data

120

200

10

(a) Comparison of vector representations

30

50

70 90 110 130 150 Dimensionality of data

170

190

(b) Comparison with baseline

Fig. 2. Linear matching based on CCA for sentence alignment in two languages as a function of data dimensionality. Results are averaged over 20 random initializations in each case. The curves represent the mean matching accuracy, that the is percentage of correctly aligned sentences, and bars represent standard deviation. Subfigure (a) compares different vector representations with each other, showing how logarithmic document weighting obtains clearly the best accuracy. Subfigure (b) compares the accuracy of this best variant against two baseline methods. See text for descriptions of the baseline methods. penalty to match the average scale of the distances, making the approach conceptually simple. Furthermore, sentences more than 20 positions off are excluded from the assignment problem, introducing a hard constraint that makes the assignment problem solver more efficient. 3.3. Results In Figure 2, accuracy of sentence alignment using linear matching based on CCA is shown as a function of data dimensionality. Figure 2(a) shows the performance of the matching method for the different global weightings for the vector representation. The matching accuracy is clearly the best with the logarithmic idf weighting. This was expected, as it is widely regarded as a good measure for the importance of the terms in document retrieval [12]. Our results show that emphasizing the inverse frequency in a logarithmic manner is useful also when matching the representations of sentences in different languages. Figure 2(b) shows the comparison of CCA-based matching with two baselines roughly measuring the information provided by the penalty term. The baseline-Random subspace is the result of the CCA-based algorithm after the first iteration, effectively measuring the accuracy in a random subspace, and baseline-Euclidean directly solves the assignment problem in the original data space. While the baseline methods suffer in absence of paired features, the dependency-based matching performs considerably well. As shown in Figure 2(a), the choice of representation is, how-

ever, crucial for the sentence alignment, and a poor representation (linear idf) will result in an alignment accuracy no better than the random baseline. Based on the above validation we chose dimensionality of 30 and logarithmic idf for the remaining experiments with linear CCA. As KCCA has two other user-selected parameters besides the representation, we simplified the setup and chose the dimensionality only over two choices, the one optimal for linear CCA (30) and full dimensionality (200). We then validated the regularization parameter γ over the range [0.1, 50] and the precision parameter of the kernel λ over [0.1, 0.00005]. These resulted in maximal accuracy of 58.6% with dimensionality of 30 and parameter values γ = 1 and λ = 0.001. Given the optimal values learned with the data of the first day, we applied the two algorithms in the 14 test scenarios, for both the unsupervised and the semi-supervised setup. Table 1 shows that KCCA-based matching outperforms linear CCA-based matching, and the semi-supervised setup clearly improves the accuracy for both algorithms. The differences between all four results are statistically significant (p < 0.001, paired t-test). 4. DISCUSSION Matching samples of two un-ordered data matrices is a general problem with applications in a range of domains. We presented a novel algorithm for the matching problem based on maximizing the dependency between datasets, and also

134

Science TKK-ICS-R8, Helsinki University of Technology, 2008

Table 1. Matching accuracies on the left-out data, ± standard deviation comparing linear matching based on CCA and kernel matching based on KCCA. All differences are statistically significant (p < 0.001, paired t-test). Unsupervised Semi-supervised Linear matching 49.4 ± 3.0% 58.4 ± 2.4% Kernel matching 53.0 ± 2.7% 61.1 ± 2.2%

[3] Abhishek Tripathi, Arto Klami, and Samuel Kaski, “Using dependencies to pair samples for multi-view learning,” in Proceedings of ICASSP 09, the International Conference on Acoustics, Speech, and Signal Processing, 2009, pp. 1561–1564. [4] Aria Haghighi, Percy Liang, Taylor Berh-Kirkpatrick, and Dan Klein, “Learning bilingual lexicons from monolingual corpora,” in Proceedings of ACL-08: HLT, 2008, pp. 771–779, Association for Computational Linguistics.

extended the matching solution in [3] by utilizing non-linear dependencies through kernel mappings. Using kernelized representations also opens up the possibility of applying the matching method to non-vectorial data. We also explained how the matching can be solved in a semi-supervised setting, and how known soft constraints can be incorporated in the solution. As an application, we studied the task of finding the corresponding sentences in two different languages. We tested several different global term weightings for the bagof-words representations of the sentences, and showed that it has a considerable effect also on this task. Furthermore, we demonstrated how the matching accuracy can be improved by taking both the nonlinear dependencies and partial alignment into account. While the resulting accuracy is still far from perfect, it is worth noticing that the matching performance is based only on vectorial representations created from monolingual data sets. Unlike typical sentencealignment methods used in practice, we do not use information like anchor cues, sentence length, or translation lexicons. Such sources of information could however be exploited, for example, as prior knowledge to find partial alignments, or for obtaining secondary representations that would be directly commensurable, likely improving the accuracy. 5. ACKNOWLEDGMENTS AT, and AK belong to the Helsinki Institute for Information Technology HIIT, and are partially supported by the PASCAL2 EU Network of Excellence. AK, and SV belong to the Adaptive Informatics Research Centre, an Academy of Finland Center of Excellence. AK is supported by the Academy of Finland, decision number 133818, and SV by the Graduate School of Language Technology in Finland. 6. REFERENCES [1] D. Melamed, “Bitext maps and alignment via pattern recognition,” Computational Linguistics, vol. 25, no. 1, pp. 107–130, 1999. [2] Abhishek Tripathi, Arto Klami, and Samuel Kaski, “Using dependencies to pair samples for multi-view learning,” TKK Reports in Information and Computer

[5] N. Quadrianto, L. Song, and A. Smola, “Kernelized sorting,” in Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds., 2009, pp. 1289–1296. [6] A. J. Smola, A. Gretton, L. Song, and B. Sch¨olkopf, “A hilbert space embedding for distributions,” in Algorithmic Learning Theory, E. Takimoto, Ed. 2007, Lecture Notes on Computer Science, Springer, [7] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of Machine Learning Research, vol. 3, pp. 1–48, 2002. [8] David R. Hardoon, Sandor Szedmak, and John ShaweTaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004. [9] Harold W. Kuhn, “The Hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, pp. 83–97, 1955. [10] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis, “kernlab – an S4 package for kernel methods in R,” Journal of Statistical Software, vol. 11, no. 9, pp. 1–20, 2004. [11] Philipp Koehn, “Europarl: A parallel corpus for statistical machine translation,” in Proceedings of the 10th Machine Translation Summit, 2005, pp. 79–86. [12] Papineni Kishore, “Why inverse document frequency?,” in Second meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001). 2001, pp. 1–8, Association for Computational Linguistics. [13] Alexei Vinokourov, John Shawe-Taylor, and Nello Cristianini, “Inferring a semantic representation of text via cross-language correlation analysis,” Advances in Neural Information Processing Systems, 2003, pp. 1497–1504.

135