r.. 0*.0 u0 .0 * 0 - Defense Technical Information Center

2 downloads 0 Views 844KB Size Report
The financial support provided by the Graduate School of. Industrial Administration at Carnegie Mellon University and the International. Center for Management ...
0= 0

c

*.0G

*

0

06

0d

>~

ý0

on

r..

0d

j

00

x0..

0*.0 u0

000* 0 *O 0* *4 0

.0

*0

3

.

NOTICE

SCLI

THIS

DOCUMENT

IS

BEST

QUALITY AVAILABLE. THE COPY FURNISHED TO DTIC CONTAINED A SIGNIFICANT NUMBER OF

DO WHICH PAGES REPRODUCE LEGIBLY.

NOT

-Y



"

-

j

-

.2

I

met

Carnegie-Mellon University r*TSSBUlMJ

Ii

, PENNMYLVANIA 15213

GRADUATE SCHOOL OF INDUSTRIAL ADMINISTRATION WILLIAM LARIMER MS..ON, PIXNDER

S~CLEARINGHOUSE

Reproduced by the for Federal Scientific & Technical Information Springfield Va. 22151

s alaI t ':r: cmd P .r-ubi* LTbix dtxCX.. ý hta bou (

.

DE

'

m

W-

Management Sciences Research Report No.

183

CLUSTER ANALYSIS AND MATHEMATICAL PROGRAMMING

by

M.

R.

Rao*

October 1969

*Graduate School of Industrial Administration Carnegie-Mellon University and International Center for Management Sciences at the Center for Operations Research & Econometrics Universite Catholique de Louvain Now at University of Rochester

This report was prepared Sciences Research Group, NONR 760(24), NR 047-048 Reproduction in whole or U. S. Government.

as part of the activities of the Management Carnegie-Mellon University, under Contract with the U. S. Office of Naval Research. in part is permitted for any purpose of the

Management Sciences Research Group Graduate School of Industrial Administration Carnegie-Mellon University Pittsburgh, Pennsylvania 15213

ACKNOWLEDGEMENT

The author is discussions.

indebted to Professor W. W. Cooper for many helpful

The financial support provided by the Graduate School of

Industrial Administration at Carnegie Mellon University and the International Center for Management Sciences at the Center for Operations Research & Econometrics affiliated with Universite Catholique de Louvain is gratefully acknowledged.

ABSTRACT Cluster analysis involves the problem of optimal partitioning of a given set of entities into a pre-assigned number of mutually exclusive and exhaustive clusters.

Here the problem is

formulated

in two different ways with the distance function (a) of minimizing the within groups sums of squares and (b) minimizing the maximum distance within groups.

These lead to different kinds of linear and non-linear

(0-1) integer programming problems. discussed.

Computational difficulties are

.•i'

S..!

CLUSTER ANALYSIS AND MATHEMATICAL PROGRAMMING -4M.R.RAO

INTRODUCTION

Cluster analysis involves the problem of optimal partitioning of a given set of entities into a number (pre-assigned).of mutually The criterion for optimality exclusive and exhaustive clusters. depends heavily upon the application in It

is

which it

is

to be used.

not the purpose of this paper to discuss the relative merits A discussion of this can be found in

of the various criteria. "Sokal and Sneath [1]. our analysis here,

In

we confine ourselves to distance based

cluster analysis in which a distance measure between the various -enti.ties is available. This type of cluster analysis has received increasing attention recently. given by Majone

A detailed discussion of this is

[2] and Majone and Sandy [3].

Even

though a

distance ".measure between the various entities is known, the cridepends upon the intended terion for optimal partitioning still application.

Again,

it

is

beyond the scope of this paper to dis-

cuss the relative merits of the various possible criteria but references can be found in £lgorithm is

Jensen

[5] where a dynamic programming

given for minimizing the within - groups sums of

squares.

One of the purposes of this paper is

to consider the crite-

rion of minimizing the within groups sums of squares and show how the problem can be formulated as a mathematical programming problem.

In

the general case,

the formulation leads to a fractional

non-linear

See also [4]for an application of distance based cluster analysi-..

2.

0-1 programming problem with constraints and there does not appear to be any computationally problem.

Fortunately,

ly tractable in sed in

efficient procedure for solving such a

the problem does appear to be computational-

some variants and special cases which are discus-

detail. An alternate criterion,viz.,

within groups,

is

minimize the maximum distance

also considered.

leads to a linear integer (0-i)

The formulation in

this case

programming problem which can be

solved by any one of the known techniques

[6,7].

A simple but

efficient algorithm is given to solve this problem when the number of pre-assigned groups is restricted to be only two. FORMULATION OF THE PROBLEM. The following definitions are used in d4. is N is

the formulation

a metric distance between entities i and

.

j.

the number of entities

M id the number of pre-assigne*d groups. We give a formulation of the problem under different criteria. I)

Minimize (maximize) the within (between)-groups sums of squares Let Xik be equal to 1 or 0 depending upon whether the ith entity is

in

group k or not.

Let d.. be a Euclidean metric. o

The problem can now be written as

M N-1 N "Min k=E [( E * i=1" j=i+l

2 iX d i Xik

N kxV i=I

x -k]

(I)

M Subject to

E k=1

xik

1

Xik > 0 and integer valued

for i

=

for i

=

k

÷.

1,2,...N

1,2,...M

(2)

3.

a fractional non-linear 0-1 programming problem which However two special gases -is- very difficult to solve in general. of this appear to be less difficult and they are discussed next. This is

each group is

a) The number of eptitie& in

in

M I

specified

Sometimes the number of entities in each group is specified Let k be the number of entities in. group k such that advance. k= N. n.

k=J function (1) now becomes

The objective

Min

I -M 2(3) d. x.i Z ( E E k=J nk .i=I j=i+l -

kN xjk

and -we have an additional set of constraints, ".i=

N Z xik ='k (4)

k = 1,2,...M

a non-linear 0-1 programming problem with conThis is still However since all nk are specified in adstraint set (2) and (4). vance,

we do not have a fractional objective as in

(1) and there

are at least two possible approaches to attempt to solve this problec approach is to treat this as a constrained non-linear The first boolean programming problem and use the methods outlined by Hammer and Rudeanu [8].

to linearize the obje-tive

Another approach is

function at the expense of increasing the number of constraints and solve the resulting problem by any of the known techniques for linear integer programming.

The 0-1 programming problem to be

solved is as follows N N-1 M I E - ( £E Min E k=1

k 2 d.. y..) 13 1) -nk i=1 j=i+l k

Subject to Xik + xj

-

Yij

c 1

i

=

1,2,...N-1

j

i+1,i+2,...N

k=

1,2,...M

(6)

N i

I:Mx z x ik

k= 1,2k...M

n

.I ik

k

= 1

=1,2,....N

k=1

All

In

i

and

0 and integer valued.

this formulation,

the number of constraints increases rapidly

with N and M and hence this approach is

computationally useful only

for small values of N and M. *

b) The number of pre-assigned groups is

equal to two

A method for cluster analysis suggested by Edwards and CavalliSforza [9] involves dividing the entities into two most-compact clusters and repeating the process sequentially.

This,

in. our

notation implies that at each stage of the process we have M = 2. In vwith

this case,

a better formulation of the problem is

possible

the following notation

SLet x. = I or 0 depending upon whether the ith entity is e

in

group

1or 2.

The problem can be written as

Mini( N-1 E

NE



x

x.)

r x.[

I-1 X E

NE

2 dý(O-x )(1-x'}

i=1 j=i+i N /(N-

i=1

xj)]

(7)

x.1 = 0 or I for all i. This

is

a fractional non-linear 0-1 programming problem with no

one approach to solve this [•].

also a boolean programming problem an

is

ccnstraints.This

additional

.appears to be promising;

It

so far no computational

1) In

the

first

and of these the

is

we rewrite

in

second one

should be pointed out however,

experience

approach,

to use the methods given

problem is

We outline two other methods

that

available.

the objective

function

*)

as follows

-

x.)( E E i=i j=i+1

r

d

referi'ed to as (8w).

i=i1

i=i j=i+1

N (1-xi)(l-x.))]/( E x.)(N 1 i=1i

function

(8)

N =E

(81

x.)

without the denominator be

We note that the denominator in

(8)

is

a

if

maximium

N X i=1

[•]

+

I

d. 13 Let the objective

x. x)

.

NY

N-1

N

N

N-1

N

Min [(N

x.

[I] 2

=

where

the least integer greater than or equal to N!2.

is

function

minimize the objective

We first

in

methods given

£ x. "i=1 the problem is

If

[8].

or

then solved.

[N] 2

by using the

obtain

we'are fortunate'to

[I 2

=

(8')

-

Otherwise,

N Z

let

x.

m < [].

i=1

Now,

we know that the only possible solutions that'may improve

the value of (8)

should necessarily

satisfy

the

following relation

N i= I

x

9

43.

There this

is

no loss

in

generality.in

were not true we

assuming

could use N -

N2N Y

x. i 1

that m < instead of

since E

i=i1

xi.

if

."6.

The next step is

to consider

added constraint

(9).

If

(8')

the current

"for the objective function (8) now, is

it'is

solution has a better

becomes

'either

so high that

Z x. ='[11 2 . i=1 the value

x. to be equal to V] or 2

I

The process.

or the value of the function for the objective function

would be higher than the current best value even if N N i=

value

than the best value found till

retained as the best value found thus far.

repeated until

(8')

and solve the problem with the

(8)

we assume

-2

The efficiency of this approach depends heavily upon the ability

to find the solution of (8')

present,

with the restriction

(9).

At

there does not appear to be efficient methods for doing

this, 2-) An alternate

approach which appears to be promising is

branch and bound methods to solve this problem. that the problem is

to use

First we note

very similar to the problem of "selecting

an optimum subset" described by Beale [10] where a tree-search (branch and bound) in in

procedure is

outlined.

Using the terminology

[10], we briefly describe the procedure first and then indicate some detail how the various steps can be performed for this

problem. The root of the tree corresponds to a partial solution in which all the elements2 are free to be assigned.

We select one

element and we now have two possibilities of assignment. We can assign the element to group I or group 2. These two possibilities correspond to the two branches emanating from the root.

For con-

venience,

we will always write the assignment of an element to group 1 as the branch to the right and always branch to the right 2

II

F For convience

we use

!,.

the word elenent

..

to refer to

-

an entity.

...

".first. Thus,

we have a partial solution

at any node of the tree,

with some elements assigned to group. 1, while others are yet to be assigned. in

which all

some assigned to group 2

A complete solution is

one

the elements have been assigned and this corresponds

to an end of the tree. Before we branch from a particular node, bound for the objective function.

If

we compute a lower

the lower bound is

greater

than or equal to the best complete solution found so far, we do not branch further from this node. In this case,'we back-track along the tree until we reach a node with an unexplored branch to If no such node exists,.the problem is solved and the best the left. complete solution found so far is other case,

the optimal solution.

where the lower bound is

In the

less than the value of the

current best solution, we need to branch further. variable not yet assigned and branch to the right.

We select a We continue

branching until we reach an end of the tree or the lower bound test indicates that we need not branch further. When an end of the tree is reached, it would represent an improvement in the objective function and hence it is retained as the current best solution.

From the end of the tree,

we then back-track along the

tree until we reach a node with an une'xplored branch to the left. SELECTION

OF AN ELEMENT AT A NODE.

At any particular.node,

we have a set (possibly empty) of-

elements in group I and a set (possibly empty) of elements in group 2. If we need to branch further, we select an element not yet assigned and branch to the right.

In

our procedure,

ponds to assigning this selected element to group 1.

this

corres-

So the se-

lection of an element is made such that the sum of the distances-between this element and all other elements already in group I is minimum.

vi 8. COMPUTATION OF LOWER BOUND AT A NODE.

first

-We

give some

notations

for convenience.

At a

given node

let

xi = 1 -for all

i assigned to group I

x . = 0 for all

i assigned to group 2

{i/xi = 1),

Let N,

=

Let.n1

and n 2 be1 21 the number of elements in

Let n 3

= N -

nI

n2

-

to any group and let Let K1

I i,j&N

d..

and N 2

{i/xi (

= 01.

be the number

N

and N2repciey respectively.

of elements

not yet assigned

N3 be a set consisting of these n 3 elements. and K2

1

=

i

Z jEN

d..

13

2

Let D be a N x N symmetric matrix giving the. distance between

Let F be a'n

entities.

between an element in

x n1

3

the

sub-matrix of D giving the distance

N3 and an element in

N1 .

Let ri be the sum

of the elements in row i of matrix Y. Let.1 be a n 2 x n submatrix of D giving the distance between an Let R i be the sum of the element in N22 and 1an element in N. elements in

row i of the matrix R.

We assume that matrices R.and F are arranged in

such a manner

that

j.

for i

E. .1 FY

for

Let C be a ni x n 3

3

3

i

symmetric