283, boulevard Alexandre-Taché. Gatineau, J8X 3X7, Canada. Methods. â¢. Stage I - Single-view clustering. A corpus of N multilingual documents in V languages.
Multi-view Clustering of Multilingual Documents Young-Min Kim†, Massih-Reza Amini‡, Cyril Goutte‡, Patrick Gallinari† †University
Pierre and Marie Curie Computer Science Laboratory of Paris 6 4, Place Jussieu 75005 Paris, France
Objective
Research Council Canada Institute for Information Technology 283, boulevard Alexandre-Taché Gatineau, J8X 3X7, Canada
Experiments
• Handle data in multiple feature sets or views (web pages translated into several languages), • Enhance document clustering using the relations between the multiple views, • A two-step multi-view clustering method which uses clustering results obtained on each view as a voting pattern.
Methods •
‡National
Stage I - Single-view clustering A corpus of N multilingual documents in V languages d = (d1, …, dV) : a document. dv , v Î {1, …, V}.
Probabilistic Latent Semantic Analysis : over each language. - Pick a document dv with probability p(dv), - Choose a topic with probability p(z|dv), - Generate a word with probability p(w|z). Maximize log-likelihood function with EM
æ ö L = åå n (d v , w) logç å p (d v ) p ( z d v ) p ( w z ) ÷ dv w è z ø
topic(dv) = argmaxz p(z|dv) 1
Voted-PLSA : Our proposition Conc-PLSA : PLSA over concatenated feature representation Fusion-LM: A late fusion approach for multilingual data (Sigir 09) Data Collection Reuters RCV1/RCV2 corpus : Originally a comparable corpus in 5 languages with 6 classes. → Machine translation in all languages. e.g. Originally English corpus → translated FR, GR, IT, SP (A multilingual document corpus)
Origin lan. # documents English 18758 French 26648 Germany 29953 Italian 24039 Spanish 12342
Finally, we have five multilingual document corpus.
Results
Evaluation Criteria : micro-averaged Precision/Recall and NMI
10 experiments with random initializations → average • Single-view clustering then voting : pre-assigned clusters Original language (Corpus name) English (L1) French (L2) Germany (L3) Italian (L4) Spanish (L5) Average
% of documents in Pre-assigned clusters 51.18 63.85 67.44 58.03 73.73 62.84
Micro-averaged precision over pre-assigned clusters 0.79 0.78 0.80 0.60 0.81 0.76
• Multi-view clustering Performance comparison of our approach and others.
V
For each d, obtain ( zd ,..., zd ) : voting pattern. zdv : estimated topic index of d, on vth language. • Stage II - Voting & Multi-view clustering Voting : grouping documents with similar voting patterns. How much similar ? – at least V -1 topics are common. Cluster signatures
…
(6, 3, 3, 3, 2) (2, 1, 6, 2, 6) (5, 6, 1, 4, 5) (5, 2, 5, 1, 1) (3, 4, 2, 6, 3) (1, 4, 2, 6, 3) (4, 5, 4, 5, 1)
⁞
Clustered docs C1 C2 C3 C4
C1
…
Remained docs
Average performance over five languages
C6
C5 C6
…
- Cluster signature: documents having the same voting pattern,
Regroup the remaining documents (having voting patterns different from any of the selected cluster signature) by applying a PLSA to the concatenated feature vectors. Final PLSA model in the concatenated feature space Subject to p(c|d) = 1 if dÎ{selected cluster signature for c}, = 0 otherwise.
Strategy Voted-PLSA Conc-PLSA Fusion-LM
Micro-averaged Precision 0.65 0.63↓ 0.61↓
NMI 0.44 0.41↓ 0.41↓
Conclusion • A new multi-view clustering approach for multilingual document clustering. • Voting mechanism over different views allows the cluster signatures pre-cluster the strongly related documents. • With pre-assigned clusters, enhance clustering performance via PLSA learning.