Particle Swarm Optimization for Clustering Short-Text ... | Google Sites

Particle Swarm Optimization for Clustering Short-Text Corpora Diego A. Ingaramo, Marcelo L. Errecalde, Leticia C. Cagnina {daingara,merreca, lcagnina}@unsl.edu.ar LIDIC-Universidad Nacional de San Luis-Argentina

Paolo Rosso [email protected] NLE Lab-DSIC-Universidad Politécnica de Valencia-Spain

Introduction

What is Document Clustering ? Finding groups of documents such that the documents in a group will be similar (or related) to one another and different from (or unrelated to) the documents in other groups.

Introduction

What is the problem we are working on ?

Main goal: to develop effective algorithms for the problem of clustering short-text corpora.

These algorithms assign documents to unknown categories in an unsupervised way.

Introduction

What is the problem we are working on ?

Our interest is on clustering of: short-texts (in general) narrow domain short-texts (in particular).

Introduction

Why it is important ?

Applicability in different areas of text processing: Text Mining (Question Answering etc.) Summarization Information Retrieval

Tendencies of people to use “Small-Languages” Blogs Text-messages Snippets

Introduction

Why is this problem difficult ?

General problems of text clustering: Synonymy Polysemy

Additional difficulties due to: Low frequencies of the document terms. High overlapping degree of their vocabularies.

These aspects can negatively affect the estimation of how similar the documents are and (in consequence) the whole clustering process.

Introduction

What questions are we trying to answer in our work ? Are

bio-inspired algorithms suitable approaches to solve clustering of short-text corpora problems? Which are most effective?

Could Is

we consider clustering as an optimization problem?

it possible to use unsupervised measures of cluster validity as functions to be optimized?

Clustering process: an overview

A more detailed look to the document clustering process…











Clustering as Optimization

Document clustering is the assignment of document to unknown categories.

In real document clustering problems, the results cannot be evaluated with typical external measures such as F-Measure, because the correct categorization is not available.


For that, the quality of resulting groups is evaluated with respect to Internal Clustering Validity Measures, like Global Silhouette (GS) and Expected Density Measure (ρ).

GS and ρ were selected as optimization functions because they give a reasonable estimation of the quality of the groups obtained.


Global Silhouette (GS): is the average of cluster silhouette of all clusters. The cluster silhouette of a cluster C is the average of all objects’ silhouette coefficients s(i), which are obtained with:



The average dissimilarity of object i to all objects in the nearest cluster



The average dissimilarity of object i to the remaining objects in its cluster



Expected Density Measure of a clustering C is

C is the clustering of a weighted graph The density θ of the graph from equation is computed as

CLUDIPSO Clustering with a Discrete Particle Swarm Optimizer (CLUDIPSO)

Representation of solutions (particles) Equations to update particles and velocities Process of updating of particles Dynamic mutation operator to avoid premature convergence

CLUDIPSO

Representation of solutions (particles): A particle represents a valid clustering. For a collection of n documents, the particle is a n-dimensional vector. Each dimension represents a document, and the value is the group for clustering.

1

3

3

2 … 1

1

2

3

4

particle k-groups

n-documents

…

n

CLUDIPSO

Equations to update velocities and positions:

CLUDIPSO


CLUDIPSO


CLUDIPSO


CLUDIPSO


CLUDIPSO


CLUDIPSO


CLUDIPSO


CLUDIPSO

Process of updating of particles: In our approach the process of updating particles is not as direct as in the continuous case. In CLUDIPSO, the updating process is not carried out on all dimensions at each iteration. To determine which dimensions of a particle will be updated we do the following steps: 1) all dimensions of the velocity vector are normalized in the [0..1] range; 2) a random number r in [0..1] is calculated; 3) all the dimensions (in the velocity vector) higher than r are selected in the position vector, and updated using:

CLUDIPSO

Dynamic mutation operator to avoid premature convergence: applied to each particle with a pm-probability.

This operator swaps two random dimensions of the particle.

CLUCOPSO Clustering with a Continuous Particle Swarm Optimizer (CLUCOPSO)

Representation of solutions (particles, representing centroids) Equations to update particles and velocities Process of updating of particles Dimensionality reduction in document representation

CLUCOPSO

Representation of solutions (particles, representing centroids): In the continuous version (CLUCOPSO), the particles are (K x T) dimensional real vectors, where K centroids (one for each cluster) of T terms are stored in contiguous form:

CLUCOPSO

Equations to update velocities and positions: This version is similar to the classical PSO algorithm with respect to the gbest and pbest particles and the updating formulas for position and velocity:

CLUCOPSO

Dimensionality reduction in document representation: A dimensionality reduction phase is often applied in order to reduce the size of the document representations to a much smaller number. Usually aims to make the problem more manageable for the clustering method. PSO-based methods can be seriously affected by a high dimensionality in the document representation. The dimensionality of each centroid directly depends on the dimensionality of the vectors used for representing the documents, in consequence, a dimensionality reduction phase will be usually required to obtain an acceptable size of the particles.

CLUCOPSO

Dimensionality reduction in document representation: Dimensionality reduction often takes the form of feature selection or feature extraction For CLUCOPSO, we used both forms of dimensionality reduction:

a feature selection method named Transition Point (use of terms whose frequency is closest to tp as indexes for VSM)

a feature extraction method known as Random Indexing (accumulation of context vectors based on the occurrence of words in contexts)

Data Sets We select 3 short-text collection to test our approach:

CICling-2002: considered in many research works, as a high complexity corpus, with short-length documents and high vocabulary overlapping.

EasyAbstracts: collection easier than CICling-2002 with respect to the overlapping degree of the documents vocabulary.

Micro4News: collection easier than EasyAbstracts and CICLing-2002 with respect to the document length and vocabulary overlapping.

Test & results Parameters for CLUDIPSO: 50 independent runs for each collection. 10.000 iterations per run. Swarm size: 50 particles. pm_min=0.4 pm_max=0.9 w = 0.9 γ1 = γ2 = 1.0 Parameters for CLUCOPSO: 50 independent runs for each collection. 10.000 iterations per run. Swarm size: 50 particles. pm_min=0.4 pm_max=0.9 w = 0.7 γ1 = γ2 = 1.7

Test & results

Test & results

Test & results

Conclusions We introduce new ideas for clustering short-text corpora:

CLUDIPSO, a novel discrete PSO-based algorithm adapted for this kind of problem,

CLUCOPSO, a continuous PSO-based algorithm adapted for this kind of problem,

the use of two interesting Internal Clustering Validity Measures, Global Silhouette and Expected Density, as explicit objective functions to be optimized.

The preliminary results obtained by CLUDIPSO and CLUCOPSO indicate that our approach is a highly competitive alternative to solve problems of clustering short-text corpora.

Thanks / Grazie

Paolo Rosso. Una formalizzazione degli schemi senso-motori di Arbib. Tesi di laurea. Università di Pisa. 1992. Co-relatori: Andrea Maggiolo-Schettini e Antonina Starita

Particle Swarm Optimization for Clustering Short-Text ... | Google Sites

Particle Swarm Optimization for Clustering Short-Text ... | Google Sites

Suggest Documents

Parallel Particle Swarm Optimization Clustering ... - Semantic Scholar

Fuzzy Clustering Using Automatic Particle Swarm Optimization

Particle Swarm Optimization for clustering short-text corpora1 - UPV

Particle Swarm Optimization for Minimax

Particle Swarm Optimization for Simultaneous

Particle swarm optimization - CiteSeerX

Memetic particle swarm optimization

simulation optimization embedded particle swarm optimization for ...

Darwinian Particle Swarm Optimization

Particle Swarm Optimization Method for Constrained Optimization ...

clustering of lidar data using particle swarm optimization ... - CiteSeerX

Auto-Clustering Using Particle Swarm Optimization and ... - CiteSeerX

Particle Swarm and Quantum Particle Swarm Optimization ... - CiteSeerX

Data Classification Particle Swarm Optimization

Accelerating parallel particle swarm optimization

Floating boundary particle swarm optimization

parallel particle swarm optimization - sdiwc

Particle Swarm Optimization - Semantic Scholar

PARTICLE SWARM OPTIMIZATION: EFFICIENT GLOBALLY ...

A Particle Swarm Optimization (PSO)

Quantum-Inspired Particle Swarm Optimization

Evolutionary Hybrid Particle Swarm Optimization

Orthogonal Particle Swarm Optimization Algorithm

Improved Particle Swarm Optimization Algorithm for ... - NepJOL