Document not found! Please try again

Immune Network Based Text Clustering Algorithm - IEEE Computer ...

1 downloads 0 Views 253KB Size Report
Immune Network Based Text Clustering Algorithm. Ma Li. Information Center. Xi'an University of Posts and ..... [7] Zhou Guangyan. Principles of Immunology[M].
2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

Immune Network Based Text Clustering Algorithm

Ma Li

Yang Lin, Bai Lin

Wang Rongxi

Information Center

Computer School

Schools of Mechanical Engineering

Xi’an University of Posts and

Xi’an University of Posts and

Xi’an JiaoTong University

Telecommunications

Telecommunications

Xi’an, China

Xi’an, China

Xi’an, China

e-mail :[email protected]

e-mail :[email protected]

e-mail :[email protected]

Abstract—The principles of the immune system and

the ability in finding the new clusters. Hang X applied the

Monoclonal were introduced briefly. Focused on the text

aiNet model to the processes of

expressed by the vector space model which was processed by

clustering, which has the factors of finding the new cluster

semantic computation, an adaptive polyclonal clustering

and the low clustering accuracy. Focused on the problems

algorithm was proposed. Firstly, the calculation method was

in clustering approaches based on the artificial immune

defined for the affinity between antibody and antigens and

network, semantic computation of alphabet was brought

the affinity of antibodies, the genetic operation factors were

Web documents

into the text feature dimension reduction and polyclonal

designed, replacement, inverse, colonel, crossover, mutation,

selection algorithm was used in the processes of clustering

death, concatenate and clustering included; secondly, the process was given; and lastly, the clustering processes and

to get the rational cluster number and the high clustering

analysis were done based on the text sets in a corpora. The

effect and method.

experiments verifies that the algorithm proposed above can

II.

get the rational clustering number and have a better correct

THE BIONIC THEORY OF IMMUNE NETWORK AND CLONAL SELECTION

identification rate and recall rate. Keywords- Artificial Immune Network; Clonal selection; Text

As a complex system, immune system composed with

Clustering analysis

cells, molecular and organ, is used to against to the I.

infringement by external substances. The body will

INTRODUCTION

generate a kind of protective effect to get ride of these Artificial Immune System is one of the intelligent

external substances after the external substances(antigens)

information processes methods[1] which simulates the

enters it. In other words, the normal cells in body will

functions of immune system of the natural biological specimens.

Through

simulating

mechanism of biological specimens

natural

generate antibody to against to these antigens. This kind of

defense

protective effect was completed through two stages, the

against to the

first immune response and second response. The former is

external substances such as clonal selection, genetic

the processes of learning new knowledge, in other words,

valuation, feedback and memory, it formed a model for

it is the processes of identifying the diversity, and the later

information processes and problem solving, and it also

is the memory of the entered antibody. The immune

was demonstrated and validated that it can be used in the

network is the mechanism for immune system expressing

analysis of clustering. De Castro and Von Zuben proposed

and storing the learned knowledge, and the number of

artificial immune net model for data clustering, aiNet[3],

antibody, the tension strength of antibodies of this network

based on the mathematic model of immune network given

change with the entering of antigens dynamically.

by Jerne[2]. A text clustering algorithm[4] was proposed

Clone is the Asexual reproduction of cell. As one of

based on the artificial immune network by Tang Na and

the most important concepts, it was proposed by Jerne

Rao VemuriV, and it has a better clusters quality than

firstly and expressed completely by Burnet. The

other algorithms through data compression, but it also has

concerning of this concept is as following: after the

some disadvantages such as higher complexity time and

978-0-7695-4761-9/12 $26.00 © 2012 IEEE DOI 10.1109/SNPD.2012.111

Lymphocyte identified the antigens, B cell was active and 746

reproduced itself(clone); secondly, the cloned cell

So, the cloned child generation needs the further

mutates(small changing) and generates the special

processing.

antibody. The mutated immune cells are divided into

Based on the theory of clonal selection, the

antibody and memory cells, the classes and the scale of

information processes model [8]which was structured by a

these cells are developed with the direction of destroying

group of antibody has been abstracted. The entrance of the

the antigens. But, in the process of cloning, the

external antigens touches off the clonal selection of

information between the parent generation and the child

antibody colony. The process of clonal selection can be

generation is just the copying, and has no exchanging of

expressed as the following random process:

different information which can not enhance the evolution. clone immune genic operation selection A(k ) ⎯⎯⎯ → A′(k ) ⎯⎯⎯⎯⎯⎯⎯ → A′′(k ) ⎯⎯⎯⎯ → A(k + 1),

in which, A(k) stands for the kth generation antibody

the later algorithm inherits more features of the parent

colony and clone, immune genetic operator and selection

generation than the former algorithm. It enhances the

stands for the stages of clone, immune gene and

ability of local searching and separating the minimal value

compression selection separately. A(k+1) stands for the

and the convergence speed, through the crossover and

(k+1)th generation antibody colony. The clonal selection

mutation and selection of the polyclonal algorithm.

is the random mapping induced by affinity in the artificial

III.

immune system. In order to enlarge the search area and

DOCUMENT PREPROCESSING

against to the premature of evolution, the local minimal

The core subject of a document can be represented by

value and accelerate the convergence speed, nearby the

the key words which possess of semantic feature. The

optimal situation of each generation, the clone generates a

features can be obtained by participle processing,

mutation situation colony based on the affinity.

part-of-speech tagging and dimension reduction. The processing and the results that the document is represented

In the above process of clonal selection, the immune

as a set of the feature keywords are shown in Figure 1.

genetic operation includes the crossover and mutation. In which, the clone algorithm only used the mutation operation is called monoclonal algorithm, and polyclonal is called with the both operation. Because of the crossover,

Software-defined The computer system is running programs to realize various applications. The various of programs, including user' Applications which written for the specific purpose of inspection.

participle

Software/n

Software

defined/n

system

computer/n

computer

system/ad is/v

Keyword

program

to/p

Extraction

application

running/vn

entity

program/n to/f

………

realize/v

the

various

of/r

 Figure 1. Document preprocessing schematic

The input data for Artificial Immune Network must be

presence of the corresponding keyword in the document.

normalized. So the vector space model is used to represent

In particular, 1 represents presence, 0 for absence. The

the document in this paper. The feature term stands for the

specific operation process is as follows:

keyword of a document. The term weight represents the

747

If the document set is

D = {D1, D2, ...Di , ...D N } ,

distribution. So the clustering analysis of antigens (document) is achieved.

Di is the i-th

while N is the total number of documents.

B. The Definition The task of document clustering is to get clustering

document. After document segmentation and feature set

prototype. In this paper, the antibodies after dimension

D' = {D'1 , D' 2 ,..., D' i ,...D' N } can be obtained. Here,

reduction processing are considered as input data.

dimension

reducing

processing,

the

keyword

Antibodies are selected randomly of initialization.

D'i = {ki ,1, ki , 2 ,...ki , j ,...ki , n ( i ) } is that the keyword set of the i-th document. document.

Set antigen as generation

ki , j is the j-th key word of the i-th

Set

K = {k1 , k2 ,..., k M } . M is the number of model.

Because

of

the

Ai (k ) = {ai1 , ai2 ,..., aim } . The

Definition 1

affinity between

widespread

Ai (k ) = {aii ,..., aim } and

The affinity between

emergences of synonyms and near synonyms in the nature language, the word can be consolidated by using the

A j (k ) = {a ji ,..., a j m } equals the measured distance

semantic similarity computational method. That can compress the scale of the keywords, and reduce the

d ( Ai (k ), A j (k )) between them. The smaller it is, the

dimension of vector space. IV.

any two

antibodies:

keyword set. Thus, all documents can be represented as a space

as

is the I th antibody of the k th Generation antibody group.

document. So all the key words come into being a

vector

group

A(k ) = { A1 (k ),..., Ai (k ),..., An (k )} , where Ai (k )

n(i ) is the number of key words of the i-th

keyword set

Ag = {g1 ,..., gi ,...g m } .Set the kth

Ai (k ) and Aj (k ) .

less the difference between

ADAPTIVE POLYCLONAL NETWORK DOCUMENT CLUSTERING ALGORITHM



A. The idea of algorithm

d ( Ai (k ), A j (k )) =

A novel document clustering algorithm based on adaptive polyclonal clustering algorithm[9] is presented in

1

¦φ (a

ik

t =1

this paper. The processing idea as follows: A group of antibodies is selected randomly as the

Where,

network nodes and each vector representing a document is created and treated as an antigen. The algorithm which is

(1)

m

, a jt )

φ (.,.) is: ­1, x =y ¯ 0, x ≠ y 

φ ( x, y ) = ®

directed by the affinity function between Antibodies (Ab) and antigens (Ag) contains seven operators: replacement, inverse, crossover, non-consistence mutation, clonal, death

Definition

and concatenate. So the scope of antibody group and the

2

The

affinity

function

between

Ai (k ) = {aii ,..., aim } and Ag = {g1 ,..., g m } :

adaptability between Ab-Ag can be self-regulated. And both the global search and the local search are directed.

m

These lead to the matching of antigen and antibody much

f ( Aii (k ), Ag ) =

more effective. And the right group called clustering prototype is gained by iteration through network learning and suppression. The prototype is the description for data

¦ φ (a t =1

ii

m

, gt ) (2)

The affinity function which measures the similarity between antigens and antibodies, and that between

748

antibodies is formed in accordance with the principle that

crossover point is d, dę˄0,m˅, after crossover, the

the smaller the optimization function is, the better the

follow equation is gained:

clustering result is and the higher the antigen-antibody

Ai (k) = {ai1 , ai2 ,...,a jd −1 , a jd , a jd +1 ,...a jm }

affinity is. Definition 3 Replacement Operator: Select T% antibodies from

Aj (k) = {a j1 , a j2 ,...,a jd −1 , aid , aid +1 ,...aim } (6)

A(k ) whose affinities

Definition 7 Mutation operator: Perform the mutation in accordance with the principle

are higher to replace T% ones whose affinities are lower to generate a novel population.

that the higher the affinity of antibodies is, the smaller the

A(k ) = { A1 (k ), A2 (k ), A j (k ), A1 (k ), A2 (k ),... Al (k )}

mutation rate pm is. And for the antigens called

(3) Where

the current generation,

j + l = n, l = floor (T % * n)

,

and

Ag of

p m = 1 − f ( Ag , Ag (k )) . Set

every gene of individual code as mutation point. Then for the point, set certain elder generation as an individual:

floor (x) is a integral function, representing the

Ai (k ) = {aik1 ,...aikt ,..., aikm } By the mutation operator

maximum integer less than x. The population scale is pre-determined as n.

on

Definition 4 Inverse Operator: Select some antibodies in accordance with the inverse

Ai (k ) , A j (k ) = {aik1 +1 , aik2 +1 ,..., aikk +1 ,..., aikm+1 } is

obtained. Here, the mutation principle of

rate pi to invert A(k) so as to generate a novel population. Set p and q , p

Suggest Documents