Text Mining (Question Answering etc.) ... clustering of short-text corpora problems? Which are .... Data Sets. We select
Particle Swarm Optimization for Clustering Short-Text Corpora Diego A. Ingaramo, Marcelo L. Errecalde, Leticia C. Cagnina {daingara,merreca, lcagnina}@unsl.edu.ar LIDIC-Universidad Nacional de San Luis-Argentina
Paolo Rosso
[email protected] NLE Lab-DSIC-Universidad Politécnica de Valencia-Spain
Introduction
What is Document Clustering ? Finding groups of documents such that the documents in a group will be similar (or related) to one another and different from (or unrelated to) the documents in other groups.
Introduction
What is the problem we are working on ?
Main goal: to develop effective algorithms for the problem of clustering short-text corpora.
These algorithms assign documents to unknown categories in an unsupervised way.
Introduction
What is the problem we are working on ?
Our interest is on clustering of: short-texts (in general) narrow domain short-texts (in particular).
Introduction
Why it is important ?
Applicability in different areas of text processing: Text Mining (Question Answering etc.) Summarization Information Retrieval
Tendencies of people to use “Small-Languages” Blogs Text-messages Snippets
Introduction
Why is this problem difficult ?
General problems of text clustering: Synonymy Polysemy
Additional difficulties due to: Low frequencies of the document terms. High overlapping degree of their vocabularies.
These aspects can negatively affect the estimation of how similar the documents are and (in consequence) the whole clustering process.
Introduction
What questions are we trying to answer in our work ? Are
bio-inspired algorithms suitable approaches to solve clustering of short-text corpora problems? Which are most effective?
Could Is
we consider clustering as an optimization problem?
it possible to use unsupervised measures of cluster validity as functions to be optimized?
Clustering process: an overview
A more detailed look to the document clustering process…
Clustering process: an overview
A more detailed look to the document clustering process…
Clustering process: an overview
A more detailed look to the document clustering process…
Clustering process: an overview
A more detailed look to the document clustering process…
Clustering process: an overview
A more detailed look to the document clustering process…
Clustering process: an overview
A more detailed look to the document clustering process…
Clustering as Optimization
Document clustering is the assignment of document to unknown categories.
In real document clustering problems, the results cannot be evaluated with typical external measures such as F-Measure, because the correct categorization is not available.
Clustering as Optimization
For that, the quality of resulting groups is evaluated with respect to Internal Clustering Validity Measures, like Global Silhouette (GS) and Expected Density Measure (ρ).
GS and ρ were selected as optimization functions because they give a reasonable estimation of the quality of the groups obtained.
Clustering as Optimization
Global Silhouette (GS): is the average of cluster silhouette of all clusters. The cluster silhouette of a cluster C is the average of all objects’ silhouette coefficients s(i), which are obtained with:
Clustering as Optimization
Global Silhouette (GS): is the average of cluster silhouette of all clusters. The cluster silhouette of a cluster C is the average of all objects’ silhouette coefficients s(i), which are obtained with:
The average dissimilarity of object i to all objects in the nearest cluster
Clustering as Optimization
Global Silhouette (GS): is the average of cluster silhouette of all clusters. The cluster silhouette of a cluster C is the average of all objects’ silhouette coefficients s(i), which are obtained with:
The average dissimilarity of object i to the remaining objects in its cluster
Clustering as Optimization
Global Silhouette (GS): is the average of cluster silhouette of all clusters. The cluster silhouette of a cluster C is the average of all objects’ silhouette coefficients s(i), which are obtained with:
Expected Density Measure of a clustering C is
C is the clustering of a weighted graph The density θ of the graph from equation is computed as
CLUDIPSO Clustering with a Discrete Particle Swarm Optimizer (CLUDIPSO)
Representation of solutions (particles) Equations to update particles and velocities Process of updating of particles Dynamic mutation operator to avoid premature convergence
CLUDIPSO
Representation of solutions (particles): A particle represents a valid clustering. For a collection of n documents, the particle is a n-dimensional vector. Each dimension represents a document, and the value is the group for clustering.
1
3
3
2 … 1
1
2
3
4
particle k-groups
n-documents
…
n
CLUDIPSO
Equations to update velocities and positions:
CLUDIPSO
Equations to update velocities and positions:
CLUDIPSO
Equations to update velocities and positions:
CLUDIPSO
Equations to update velocities and positions:
CLUDIPSO
Equations to update velocities and positions:
CLUDIPSO
Equations to update velocities and positions:
CLUDIPSO
Equations to update velocities and positions:
CLUDIPSO
Equations to update velocities and positions:
CLUDIPSO
Process of updating of particles: In our approach the process of updating particles is not as direct as in the continuous case. In CLUDIPSO, the updating process is not carried out on all dimensions at each iteration. To determine which dimensions of a particle will be updated we do the following steps: 1) all dimensions of the velocity vector are normalized in the [0..1] range; 2) a random number r in [0..1] is calculated; 3) all the dimensions (in the velocity vector) higher than r are selected in the position vector, and updated using:
CLUDIPSO
Dynamic mutation operator to avoid premature convergence: applied to each particle with a pm-probability.
This operator swaps two random dimensions of the particle.
CLUCOPSO Clustering with a Continuous Particle Swarm Optimizer (CLUCOPSO)
Representation of solutions (particles, representing centroids) Equations to update particles and velocities Process of updating of particles Dimensionality reduction in document representation
CLUCOPSO
Representation of solutions (particles, representing centroids): In the continuous version (CLUCOPSO), the particles are (K x T) dimensional real vectors, where K centroids (one for each cluster) of T terms are stored in contiguous form:
CLUCOPSO
Equations to update velocities and positions: This version is similar to the classical PSO algorithm with respect to the gbest and pbest particles and the updating formulas for position and velocity:
CLUCOPSO
Dimensionality reduction in document representation: A dimensionality reduction phase is often applied in order to reduce the size of the document representations to a much smaller number. Usually aims to make the problem more manageable for the clustering method. PSO-based methods can be seriously affected by a high dimensionality in the document representation. The dimensionality of each centroid directly depends on the dimensionality of the vectors used for representing the documents, in consequence, a dimensionality reduction phase will be usually required to obtain an acceptable size of the particles.
CLUCOPSO
Dimensionality reduction in document representation: Dimensionality reduction often takes the form of feature selection or feature extraction For CLUCOPSO, we used both forms of dimensionality reduction:
a feature selection method named Transition Point (use of terms whose frequency is closest to tp as indexes for VSM)
a feature extraction method known as Random Indexing (accumulation of context vectors based on the occurrence of words in contexts)
Data Sets We select 3 short-text collection to test our approach:
CICling-2002: considered in many research works, as a high complexity corpus, with short-length documents and high vocabulary overlapping.
EasyAbstracts: collection easier than CICling-2002 with respect to the overlapping degree of the documents vocabulary.
Micro4News: collection easier than EasyAbstracts and CICLing-2002 with respect to the document length and vocabulary overlapping.
Test & results Parameters for CLUDIPSO: 50 independent runs for each collection. 10.000 iterations per run. Swarm size: 50 particles. pm_min=0.4 pm_max=0.9 w = 0.9 γ1 = γ2 = 1.0 Parameters for CLUCOPSO: 50 independent runs for each collection. 10.000 iterations per run. Swarm size: 50 particles. pm_min=0.4 pm_max=0.9 w = 0.7 γ1 = γ2 = 1.7
Test & results
Test & results
Test & results
Conclusions We introduce new ideas for clustering short-text corpora:
CLUDIPSO, a novel discrete PSO-based algorithm adapted for this kind of problem,
CLUCOPSO, a continuous PSO-based algorithm adapted for this kind of problem,
the use of two interesting Internal Clustering Validity Measures, Global Silhouette and Expected Density, as explicit objective functions to be optimized.
The preliminary results obtained by CLUDIPSO and CLUCOPSO indicate that our approach is a highly competitive alternative to solve problems of clustering short-text corpora.
Thanks / Grazie
Paolo Rosso. Una formalizzazione degli schemi senso-motori di Arbib. Tesi di laurea. Università di Pisa. 1992. Co-relatori: Andrea Maggiolo-Schettini e Antonina Starita