Multi-view Clustering of Multilingual Documents

2 downloads 0 Views 557KB Size Report
283, boulevard Alexandre-Taché. Gatineau, J8X 3X7, Canada. Methods. •. Stage I - Single-view clustering. A corpus of N multilingual documents in V languages.
Multi-view Clustering of Multilingual Documents Young-Min Kim†, Massih-Reza Amini‡, Cyril Goutte‡, Patrick Gallinari† †University

Pierre and Marie Curie Computer Science Laboratory of Paris 6 4, Place Jussieu 75005 Paris, France

Objective

Research Council Canada Institute for Information Technology 283, boulevard Alexandre-Taché Gatineau, J8X 3X7, Canada

Experiments

• Handle data in multiple feature sets or views (web pages translated into several languages), • Enhance document clustering using the relations between the multiple views, • A two-step multi-view clustering method which uses clustering results obtained on each view as a voting pattern.

Methods •

‡National

Stage I - Single-view clustering A corpus of N multilingual documents in V languages d = (d1, …, dV) : a document. dv , v Î {1, …, V}.

Probabilistic Latent Semantic Analysis : over each language. - Pick a document dv with probability p(dv), - Choose a topic with probability p(z|dv), - Generate a word with probability p(w|z). Maximize log-likelihood function with EM

æ ö L = åå n (d v , w) logç å p (d v ) p ( z d v ) p ( w z ) ÷ dv w è z ø

topic(dv) = argmaxz p(z|dv) 1

Voted-PLSA : Our proposition Conc-PLSA : PLSA over concatenated feature representation Fusion-LM: A late fusion approach for multilingual data (Sigir 09) Data Collection Reuters RCV1/RCV2 corpus : Originally a comparable corpus in 5 languages with 6 classes. → Machine translation in all languages. e.g. Originally English corpus → translated FR, GR, IT, SP (A multilingual document corpus)

Origin lan. # documents English 18758 French 26648 Germany 29953 Italian 24039 Spanish 12342

Finally, we have five multilingual document corpus.

Results

Evaluation Criteria : micro-averaged Precision/Recall and NMI

10 experiments with random initializations → average • Single-view clustering then voting : pre-assigned clusters Original language (Corpus name) English (L1) French (L2) Germany (L3) Italian (L4) Spanish (L5) Average

% of documents in Pre-assigned clusters 51.18 63.85 67.44 58.03 73.73 62.84

Micro-averaged precision over pre-assigned clusters 0.79 0.78 0.80 0.60 0.81 0.76

• Multi-view clustering Performance comparison of our approach and others.

V

For each d, obtain ( zd ,..., zd ) : voting pattern. zdv : estimated topic index of d, on vth language. • Stage II - Voting & Multi-view clustering Voting : grouping documents with similar voting patterns. How much similar ? – at least V -1 topics are common. Cluster signatures



(6, 3, 3, 3, 2) (2, 1, 6, 2, 6) (5, 6, 1, 4, 5) (5, 2, 5, 1, 1) (3, 4, 2, 6, 3) (1, 4, 2, 6, 3) (4, 5, 4, 5, 1)



Clustered docs C1 C2 C3 C4

C1



Remained docs

Average performance over five languages

C6

C5 C6



- Cluster signature: documents having the same voting pattern,

Regroup the remaining documents (having voting patterns different from any of the selected cluster signature) by applying a PLSA to the concatenated feature vectors. Final PLSA model in the concatenated feature space Subject to p(c|d) = 1 if dÎ{selected cluster signature for c}, = 0 otherwise.

Strategy Voted-PLSA Conc-PLSA Fusion-LM

Micro-averaged Precision 0.65 0.63↓ 0.61↓

NMI 0.44 0.41↓ 0.41↓

Conclusion • A new multi-view clustering approach for multilingual document clustering. • Voting mechanism over different views allows the cluster signatures pre-cluster the strongly related documents. • With pre-assigned clusters, enhance clustering performance via PLSA learning.

Suggest Documents