Multi-view Clustering of Multilingual Documents

Recommend Documents

Multi-view Clustering of Multilingual Documents - Semantic Scholar

Jul 19, 2010 - Multilingual document clustering, Multi-view learning, PLSA. 1. INTRODUCTION. Much data is now available

Multi-view Clustering of Multilingual Documents - Google Sites

‡National Research Council Canada. Institute for Information ... Enhance document clustering using the relations between

Multi-view Clustering of Multilingual Documents - Google Sites

Jul 19, 2010 - and have been shown to improve over traditional single-view clustering. [2] proposed an extension of k-me

Multiview Hierarchical Agglomerative Clustering for Identification of ...

Cirebon. K2. 2 Kab. Indramayu K2. 14 Kab. Karawang K2. 3 Kota Bogor. K4 ... 21 Kota Cirebon. K2 ..... economic development in the regency/municipality.

Multilingual Phone Clustering for Recognition of Spontaneous ...

Indonesian. In order to achieve this task we incorporate a two tiered approach to perform the cross-lingual porting of the multilingual models to a new language.

Linguini: Language Identification for Multilingual Documents - CiteSeerX

Linguini: Language Identification for Multilingual Documents. John M. Prager. IBM T.J. Watson Research Center, P.O, Box704, Yorktown Heights, NYlO598.

Multilingual Medical Documents Classification Based on MesH ...

Multilingual Medical Documents Classification. Based on MesH Domain Ontology. Elberrichi Zakaria Taibi Malika Belaggoun Amel. Computer Science ...

Multilingual Word Spotting in Offline Handwritten Documents

Hence while working with line images containing multiple scripts, the identifier is completely ignored. In- stead, the filler models from all scripts are connected in.

Tracking Inconsistencies in Parallel Multilingual Documents

product documentations such as technical manuals, software documentations in several .... mdf quality checks the (a) modification of content with the addition or ...

Structural Clustering Multimedia Documents: An

Nbr of. Nodes/ cluster. Nbr of. Paths / cluster. Average similarity. Standard dervation. C1. 17 .... l'UniversitÃ© de Paul Sabatier, Toulouse France 2010. [7] Genane ...Missing:

Regularized Clustering for Documents - CiteSeerX

Then for a test document t, we can determine its label by l = sign(wâT u),. (4) where sign(Â·) is the sign function. A natural problem in Eq.(3) is that the matrix XXT ...

Multilingual Visual Sentiment Concept Clustering and Analysis

... missing the differ- ences between cultures, e.g. how an old house or good food .... study, instead of using automatic sentiment tools to detect the sentiment of a ...

DISCRIMINATIVE CLUSTERING OF TEXT DOCUMENTS Jaakko ...

relevant characteristics of the documents, yet the clustering is defined ... text documents as probability distributions. ..... signing the document to a single cluster.

DISCRIMINATIVE CLUSTERING OF TEXT DOCUMENTS Jaakko ...

Jaakko Peltonen, Janne Sinkkonen, and Samuel Kaski. Helsinki University of ..... mation retrieval, McGraw-Hill, New York, 1983. [8] J. Sinkkonen and S. Kaski, ...

Online Evolving Clustering of Web Documents

the monstrous amount of web pages that already exists. ... The evolving on-line clustering method developed in this paper can be ... is not efficient enough and therefore so called 'stop list' is used to remove function words from all documents.

Conceptual Hierarchical Clustering of Documents ... - Semantic Scholar

Clustering (CHC) technique of documents, using a document representation .... concepts that appear together in a fraction of the whole document set greater.

Fuzzy Clustering of Web Documents Using ...

Mrs. Mamta Kathuria. (Assistant Professor). Department of Computer Engineering,. YMCA University of Sc. & Tech.,. Faridabad. Dr. A. K. Sharma. (Professor ...

Multiview Spectral Clustering via Structured Low-Rank ... - IEEE Xplore

Low-Rank Matrix Factorization. Yang Wang , Lin Wu, Xuemin Lin, Fellow, IEEE, and Junbin Gao. AbstractâMultiview data clustering attracts more attention.

Multilingual Summarization of Single and Multi-Documents, On-line ...

Sep 2, 2015 - Bowie, MD [email protected] ... (MSS) task (Kubina and Conroy, 2015a) was cre- ..... mary evaluation: Together we stand NPowER-ed. In.

PILLS: Multilingual generation of medical information documents with ...

From this 'master model', specialised models for a range of document types are derived automatically; .... The idea of drawing information from a 'master docu-.

Indexing and Weighting of Multilingual and Mixed Documents

Multilingual query, Mixed document, Indexing, Weighting. 1. INTRODUCTION .... given term (qi) in a source language can be treated as synonyms in the same ...

Clustering Documents using a Wikipedia-based Concept ...

Clustering Documents using a Wikipedia-based Concept. Representation. Anna Huang, David Milne, Eibe Frank, and Ian H. Witten. Department of Computer ...

DOCUMENTS CLUSTERING BASED ON MAX-CORRENTROPY ...

Oct 3, 2014 - 2Department of Computer Science, University of North Georgia, Oakwood, ..... ings of the 25th annual international ACM SIGIR conference on.

Text Documents Clustering using Genetic ... - Semantic Scholar

[10] Jiawei Han and Micheline Kamber, âData Mining. Concepts and Techniquesâ, 2nd Edition, Elsevier, 2008. [11] David E. Goldberg, âGenetic Algorithms in ...

Multi-view Clustering of Multilingual Documents

Download PDF

2 downloads 0 Views 557KB Size Report

Comment

283, boulevard Alexandre-TachÃ©. Gatineau, J8X 3X7, Canada. Methods. â¢. Stage I - Single-view clustering. A corpus of N multilingual documents in V languages.

Multi-view Clustering of Multilingual Documents Young-Min Kim†, Massih-Reza Amini‡, Cyril Goutte‡, Patrick Gallinari† †University

Pierre and Marie Curie Computer Science Laboratory of Paris 6 4, Place Jussieu 75005 Paris, France

Objective

Research Council Canada Institute for Information Technology 283, boulevard Alexandre-Taché Gatineau, J8X 3X7, Canada

Experiments

• Handle data in multiple feature sets or views (web pages translated into several languages), • Enhance document clustering using the relations between the multiple views, • A two-step multi-view clustering method which uses clustering results obtained on each view as a voting pattern.

Methods •

‡National

Stage I - Single-view clustering A corpus of N multilingual documents in V languages d = (d1, …, dV) : a document. dv , v Î {1, …, V}.

Probabilistic Latent Semantic Analysis : over each language. - Pick a document dv with probability p(dv), - Choose a topic with probability p(z|dv), - Generate a word with probability p(w|z). Maximize log-likelihood function with EM

æ ö L = åå n (d v , w) logç å p (d v ) p ( z d v ) p ( w z ) ÷ dv w è z ø

topic(dv) = argmaxz p(z|dv) 1

Voted-PLSA : Our proposition Conc-PLSA : PLSA over concatenated feature representation Fusion-LM: A late fusion approach for multilingual data (Sigir 09) Data Collection Reuters RCV1/RCV2 corpus : Originally a comparable corpus in 5 languages with 6 classes. → Machine translation in all languages. e.g. Originally English corpus → translated FR, GR, IT, SP (A multilingual document corpus)

Origin lan. # documents English 18758 French 26648 Germany 29953 Italian 24039 Spanish 12342

Finally, we have five multilingual document corpus.

Results

Evaluation Criteria : micro-averaged Precision/Recall and NMI

10 experiments with random initializations → average • Single-view clustering then voting : pre-assigned clusters Original language (Corpus name) English (L1) French (L2) Germany (L3) Italian (L4) Spanish (L5) Average

% of documents in Pre-assigned clusters 51.18 63.85 67.44 58.03 73.73 62.84

Micro-averaged precision over pre-assigned clusters 0.79 0.78 0.80 0.60 0.81 0.76

• Multi-view clustering Performance comparison of our approach and others.

V

For each d, obtain ( zd ,..., zd ) : voting pattern. zdv : estimated topic index of d, on vth language. • Stage II - Voting & Multi-view clustering Voting : grouping documents with similar voting patterns. How much similar ? – at least V -1 topics are common. Cluster signatures

…

(6, 3, 3, 3, 2) (2, 1, 6, 2, 6) (5, 6, 1, 4, 5) (5, 2, 5, 1, 1) (3, 4, 2, 6, 3) (1, 4, 2, 6, 3) (4, 5, 4, 5, 1)

⁞

Clustered docs C1 C2 C3 C4

C1

…

Remained docs

Average performance over five languages

C6

C5 C6

…

- Cluster signature: documents having the same voting pattern,

Regroup the remaining documents (having voting patterns different from any of the selected cluster signature) by applying a PLSA to the concatenated feature vectors. Final PLSA model in the concatenated feature space Subject to p(c|d) = 1 if dÎ{selected cluster signature for c}, = 0 otherwise.

Strategy Voted-PLSA Conc-PLSA Fusion-LM

Micro-averaged Precision 0.65 0.63↓ 0.61↓

NMI 0.44 0.41↓ 0.41↓

Conclusion • A new multi-view clustering approach for multilingual document clustering. • Voting mechanism over different views allows the cluster signatures pre-cluster the strongly related documents. • With pre-assigned clusters, enhance clustering performance via PLSA learning.