2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing
Immune Network Based Text Clustering Algorithm
Ma Li
Yang Lin, Bai Lin
Wang Rongxi
Information Center
Computer School
Schools of Mechanical Engineering
Xi’an University of Posts and
Xi’an University of Posts and
Xi’an JiaoTong University
Telecommunications
Telecommunications
Xi’an, China
Xi’an, China
Xi’an, China
e-mail :
[email protected]
e-mail :
[email protected]
e-mail :
[email protected]
Abstract—The principles of the immune system and
the ability in finding the new clusters. Hang X applied the
Monoclonal were introduced briefly. Focused on the text
aiNet model to the processes of
expressed by the vector space model which was processed by
clustering, which has the factors of finding the new cluster
semantic computation, an adaptive polyclonal clustering
and the low clustering accuracy. Focused on the problems
algorithm was proposed. Firstly, the calculation method was
in clustering approaches based on the artificial immune
defined for the affinity between antibody and antigens and
network, semantic computation of alphabet was brought
the affinity of antibodies, the genetic operation factors were
Web documents
into the text feature dimension reduction and polyclonal
designed, replacement, inverse, colonel, crossover, mutation,
selection algorithm was used in the processes of clustering
death, concatenate and clustering included; secondly, the process was given; and lastly, the clustering processes and
to get the rational cluster number and the high clustering
analysis were done based on the text sets in a corpora. The
effect and method.
experiments verifies that the algorithm proposed above can
II.
get the rational clustering number and have a better correct
THE BIONIC THEORY OF IMMUNE NETWORK AND CLONAL SELECTION
identification rate and recall rate. Keywords- Artificial Immune Network; Clonal selection; Text
As a complex system, immune system composed with
Clustering analysis
cells, molecular and organ, is used to against to the I.
infringement by external substances. The body will
INTRODUCTION
generate a kind of protective effect to get ride of these Artificial Immune System is one of the intelligent
external substances after the external substances(antigens)
information processes methods[1] which simulates the
enters it. In other words, the normal cells in body will
functions of immune system of the natural biological specimens.
Through
simulating
mechanism of biological specimens
natural
generate antibody to against to these antigens. This kind of
defense
protective effect was completed through two stages, the
against to the
first immune response and second response. The former is
external substances such as clonal selection, genetic
the processes of learning new knowledge, in other words,
valuation, feedback and memory, it formed a model for
it is the processes of identifying the diversity, and the later
information processes and problem solving, and it also
is the memory of the entered antibody. The immune
was demonstrated and validated that it can be used in the
network is the mechanism for immune system expressing
analysis of clustering. De Castro and Von Zuben proposed
and storing the learned knowledge, and the number of
artificial immune net model for data clustering, aiNet[3],
antibody, the tension strength of antibodies of this network
based on the mathematic model of immune network given
change with the entering of antigens dynamically.
by Jerne[2]. A text clustering algorithm[4] was proposed
Clone is the Asexual reproduction of cell. As one of
based on the artificial immune network by Tang Na and
the most important concepts, it was proposed by Jerne
Rao VemuriV, and it has a better clusters quality than
firstly and expressed completely by Burnet. The
other algorithms through data compression, but it also has
concerning of this concept is as following: after the
some disadvantages such as higher complexity time and
978-0-7695-4761-9/12 $26.00 © 2012 IEEE DOI 10.1109/SNPD.2012.111
Lymphocyte identified the antigens, B cell was active and 746
reproduced itself(clone); secondly, the cloned cell
So, the cloned child generation needs the further
mutates(small changing) and generates the special
processing.
antibody. The mutated immune cells are divided into
Based on the theory of clonal selection, the
antibody and memory cells, the classes and the scale of
information processes model [8]which was structured by a
these cells are developed with the direction of destroying
group of antibody has been abstracted. The entrance of the
the antigens. But, in the process of cloning, the
external antigens touches off the clonal selection of
information between the parent generation and the child
antibody colony. The process of clonal selection can be
generation is just the copying, and has no exchanging of
expressed as the following random process:
different information which can not enhance the evolution. clone immune genic operation selection A(k ) ⎯⎯⎯ → A′(k ) ⎯⎯⎯⎯⎯⎯⎯ → A′′(k ) ⎯⎯⎯⎯ → A(k + 1),
in which, A(k) stands for the kth generation antibody
the later algorithm inherits more features of the parent
colony and clone, immune genetic operator and selection
generation than the former algorithm. It enhances the
stands for the stages of clone, immune gene and
ability of local searching and separating the minimal value
compression selection separately. A(k+1) stands for the
and the convergence speed, through the crossover and
(k+1)th generation antibody colony. The clonal selection
mutation and selection of the polyclonal algorithm.
is the random mapping induced by affinity in the artificial
III.
immune system. In order to enlarge the search area and
DOCUMENT PREPROCESSING
against to the premature of evolution, the local minimal
The core subject of a document can be represented by
value and accelerate the convergence speed, nearby the
the key words which possess of semantic feature. The
optimal situation of each generation, the clone generates a
features can be obtained by participle processing,
mutation situation colony based on the affinity.
part-of-speech tagging and dimension reduction. The processing and the results that the document is represented
In the above process of clonal selection, the immune
as a set of the feature keywords are shown in Figure 1.
genetic operation includes the crossover and mutation. In which, the clone algorithm only used the mutation operation is called monoclonal algorithm, and polyclonal is called with the both operation. Because of the crossover,
Software-defined The computer system is running programs to realize various applications. The various of programs, including user' Applications which written for the specific purpose of inspection.
participle
Software/n
Software
defined/n
system
computer/n
computer
system/ad is/v
Keyword
program
to/p
Extraction
application
running/vn
entity
program/n to/f
………
realize/v
the
various
of/r
Figure 1. Document preprocessing schematic
The input data for Artificial Immune Network must be
presence of the corresponding keyword in the document.
normalized. So the vector space model is used to represent
In particular, 1 represents presence, 0 for absence. The
the document in this paper. The feature term stands for the
specific operation process is as follows:
keyword of a document. The term weight represents the
747
If the document set is
D = {D1, D2, ...Di , ...D N } ,
distribution. So the clustering analysis of antigens (document) is achieved.
Di is the i-th
while N is the total number of documents.
B. The Definition The task of document clustering is to get clustering
document. After document segmentation and feature set
prototype. In this paper, the antibodies after dimension
D' = {D'1 , D' 2 ,..., D' i ,...D' N } can be obtained. Here,
reduction processing are considered as input data.
dimension
reducing
processing,
the
keyword
Antibodies are selected randomly of initialization.
D'i = {ki ,1, ki , 2 ,...ki , j ,...ki , n ( i ) } is that the keyword set of the i-th document. document.
Set antigen as generation
ki , j is the j-th key word of the i-th
Set
K = {k1 , k2 ,..., k M } . M is the number of model.
Because
of
the
Ai (k ) = {ai1 , ai2 ,..., aim } . The
Definition 1
affinity between
widespread
Ai (k ) = {aii ,..., aim } and
The affinity between
emergences of synonyms and near synonyms in the nature language, the word can be consolidated by using the
A j (k ) = {a ji ,..., a j m } equals the measured distance
semantic similarity computational method. That can compress the scale of the keywords, and reduce the
d ( Ai (k ), A j (k )) between them. The smaller it is, the
dimension of vector space. IV.
any two
antibodies:
keyword set. Thus, all documents can be represented as a space
as
is the I th antibody of the k th Generation antibody group.
document. So all the key words come into being a
vector
group
A(k ) = { A1 (k ),..., Ai (k ),..., An (k )} , where Ai (k )
n(i ) is the number of key words of the i-th
keyword set
Ag = {g1 ,..., gi ,...g m } .Set the kth
Ai (k ) and Aj (k ) .
less the difference between
ADAPTIVE POLYCLONAL NETWORK DOCUMENT CLUSTERING ALGORITHM
A. The idea of algorithm
d ( Ai (k ), A j (k )) =
A novel document clustering algorithm based on adaptive polyclonal clustering algorithm[9] is presented in
1
¦φ (a
ik
t =1
this paper. The processing idea as follows: A group of antibodies is selected randomly as the
Where,
network nodes and each vector representing a document is created and treated as an antigen. The algorithm which is
(1)
m
, a jt )
φ (.,.) is: 1, x =y ¯ 0, x ≠ y
φ ( x, y ) = ®
directed by the affinity function between Antibodies (Ab) and antigens (Ag) contains seven operators: replacement, inverse, crossover, non-consistence mutation, clonal, death
Definition
and concatenate. So the scope of antibody group and the
2
The
affinity
function
between
Ai (k ) = {aii ,..., aim } and Ag = {g1 ,..., g m } :
adaptability between Ab-Ag can be self-regulated. And both the global search and the local search are directed.
m
These lead to the matching of antigen and antibody much
f ( Aii (k ), Ag ) =
more effective. And the right group called clustering prototype is gained by iteration through network learning and suppression. The prototype is the description for data
¦ φ (a t =1
ii
m
, gt ) (2)
The affinity function which measures the similarity between antigens and antibodies, and that between
748
antibodies is formed in accordance with the principle that
crossover point is d, dę˄0,m˅, after crossover, the
the smaller the optimization function is, the better the
follow equation is gained:
clustering result is and the higher the antigen-antibody
Ai (k) = {ai1 , ai2 ,...,a jd −1 , a jd , a jd +1 ,...a jm }
affinity is. Definition 3 Replacement Operator: Select T% antibodies from
Aj (k) = {a j1 , a j2 ,...,a jd −1 , aid , aid +1 ,...aim } (6)
A(k ) whose affinities
Definition 7 Mutation operator: Perform the mutation in accordance with the principle
are higher to replace T% ones whose affinities are lower to generate a novel population.
that the higher the affinity of antibodies is, the smaller the
A(k ) = { A1 (k ), A2 (k ), A j (k ), A1 (k ), A2 (k ),... Al (k )}
mutation rate pm is. And for the antigens called
(3) Where
the current generation,
j + l = n, l = floor (T % * n)
,
and
Ag of
p m = 1 − f ( Ag , Ag (k )) . Set
every gene of individual code as mutation point. Then for the point, set certain elder generation as an individual:
floor (x) is a integral function, representing the
Ai (k ) = {aik1 ,...aikt ,..., aikm } By the mutation operator
maximum integer less than x. The population scale is pre-determined as n.
on
Definition 4 Inverse Operator: Select some antibodies in accordance with the inverse
Ai (k ) , A j (k ) = {aik1 +1 , aik2 +1 ,..., aikk +1 ,..., aikm+1 } is
obtained. Here, the mutation principle of
rate pi to invert A(k) so as to generate a novel population. Set p and q , p