Grouping Genetic Algorithm for the Blockmodel ... - Semantic Scholar

4 downloads 1865 Views 237KB Size Report
tion Technology, Pamplin College of Business, Virginia Polytechnic Institute and State University .... his blockmodel heuristic [8] to the MBA and British university datasets. ...... of Central Florida, Orlando, and the Ph.D. degree in management ...
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010

103

Grouping Genetic Algorithm for the Blockmodel Problem Tabitha James, Evelyn Brown, and Cliff T. Ragsdale

Abstract— Many areas of research examine the relationships between objects. A subset of these research areas focuses on methods for creating groups whose members are similar based on some specific attribute(s). The blockmodel problem has as its objective to group objects in order to obtain a small number of large groups of similar nodes. In this paper, a grouping genetic algorithm (GGA) is applied to the blockmodel problem. Testing on numerous examples from the literature indicates a GGA is an appropriate tool for solving this type of problem. Specifically, our GGA provides good solutions, even to large-size problems, in reasonable computational time. Index Terms— Blockmodel, grouping genetic algorithm (GGA), social network analysis.

I. I NTRODUCTION LASSICAL social network analysis combines sociology with graph theory in order to study complex networks of relationships between entities. Social network analysis consists of a set of tools (or methods) for analyzing networks of relations between entities. It is convenient to represent a social network as a graph, where the entities are represented as nodes and the relationships between nodes are represented as arcs. This representation is a traditional graph in graph theory. It should also be mentioned that the arcs may be directed and/or weighted, but throughout this paper an undirected nonweighted network will be assumed. The data view of the graph is an adjacency matrix. While the graph provides the visual representation, the adjacency matrix is the primary data source that is used as input to the methods. A subset of social network analysis tools includes methods to observe patterns or structures in the graph, typically by manipulating the adjacency matrix to obtain some type of groups. One such grouping technique is referred to as blockmodeling. Blockmodeling originated from the social network analysis arena [2]. Blockmodeling attempts to encourage the discovery of a small number of large groups of densely connected nodes in a graph. In practice, this leads to the detection of groups of entities that are all related (or similar) to one another. While there exists many practical applications from being able to analyze the structure of social

C

Manuscript received July 2, 2008; revised December 1, 2008; accepted April 12, 2009. Current version published January 29, 2010. T. James and C. T. Ragsdale are with the Department of Business Information Technology, Pamplin College of Business, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061 USA (e-mail: [email protected]; [email protected]). E. Brown is with the Department of Engineering, College of Technology and Computer Science, East Carolina University, Greenville, NC 27858 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TEVC.2009.2023793

networks, blockmodeling has also been utilized recently to analyze multi-attribute measures of performance in several domains [3], [4] which will be described in the next section. It has also been suggested that blockmodeling may provide an attractive alternative approach to data envelopment analysis [5]. In this paper, we develop a grouping genetic algorithm (GGA) for the blockmodeling problem (BM-GGA). The intention of the current study is to illustrate that a GGA is an appropriate tool for application to the blockmodeling problem. In Section II we present some necessary terminology and background on blockmodeling. Specifically, in Section II-A we describe the terminology for groups used in the literature. In Section II-B, we present an overview of previous solution methodologies for the blockmodeling problem as well as some application areas. In Section II-C, we present the mathematical formulation for the blockmodeling problem and discuss the appropriateness of applying a heuristic solution method to the problem. Section III provides a brief introduction to GGAs. In Section IV, we describe in detail the GGA we developed for the blockmodeling problem. To illustrate the effectiveness of the heuristic approach proposed here, in Section V we test the algorithm on a set of problems obtained from the literature. These problems come from various problem domains and vary in size (and therefore solution complexity). We present results of our algorithm as well as provide comparisons to previous solution techniques from the literature. Our conclusions are presented in Section VI. II. P ROBLEM OVERVIEW A. Groups in Social Networking Methods for grouping items into some type of defined subsets abound in the literature of many disciplines. There exist numerous alternative terms in the literature for organizing groups of objects into defined subsets. Graph theorists have long analyzed, for example, cliques, sets, clusters, and partitions. Problems in these categories belong to the set of combinatorial optimization problems that are known to be difficult to solve. The study of solution methodologies for such problems has been popular in operations research, computer science, and applied mathematics. Borrowing from graph theory, social network analysis uses these concepts in an applied manner to examine social structures. Cliques, blocks, clusters, and community structures have all been used in the social networking literature to refer to some formed subsets (or pattern) in a graph. Commonly

1089-778X/$26.00 © 2010 IEEE

104

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010

accepted uses of these terms can be distinguished by looking at the strictness with which the concepts of membership and density are applied. We will use membership to define the number of groups to which a node may belong. Density refers to the tightness of the interconnections between the members of the group. A block typically limits membership of a node to exactly one group [4]. Although a common block definition requires all nodes to be directly connected to one another, it has also been prevalent in the literature to relax this constraint in favor of high density [1], [4]–[6]. A clique, on the other hand, typically allows membership to more than one group [4] but requires all nodes to be directly connected to each other [1]. A “true clique” is a group of nodes that are all directly connected to each other. This existence of a direct connection between every pair of nodes is the defining factor of a clique. As true cliques are not as prevalent in real networks as one would like, this definition has also been relaxed a bit in some cases. It is common to view a cluster as relaxing the constraints on both membership and density. However, the terms cluster and clique are often used interchangeably. In most of the relevant literature, the primary distinctions between the two terms are due to the connections that exist in the group and the separation [1]. Clusters can be characterized as areas of high density in a graph and are typically more distinctly separated. Thus, a cluster is a bit more loosely defined than a block or a clique. The research in the physics community on detecting community structure in networks (see [7]) is a related notion to clustering. All of the formations described above have become areas of interest in the field of social networking, which itself has become increasingly more interdisciplinary. Many different approaches for detecting these various formations in a network (or graph) exist, including the use of heuristics. This is an area of much ongoing research, as determining significant structural features of many types of networks can provide interesting practical insights. In the next section, we review the methods specifically written for blockmodeling, as this is the focus of the current study. B. Review of Solution Methodologies for Blockmodeling Much of the recent work in the area of blockmodeling has come from Jessop [3], [4], [8], [9]. In his analysis of 30 international airlines, Jessop [3] applies multidimensional scaling to cluster airlines of similar performance. His approach utilizes a linear weighted sum of six specific performance ratios in order to determine an overall measure of airline performance. Once probability distributions for the values of the weights are specified, Monte Carlo simulation is employed to determine the mean and variance of all pairwise differences in scores. The values of the standardized differences between airports are converted into a binary matrix based on a chosen significance level. Rows and columns of the matrix are rearranged to form dense blocks along the diagonal. Airports in the same block are said to be similar. The blocks are then ranked using their mean score.

Another work by Jessop [8] provides a comparison of his blockmodeling approach to those of Alexander [10] and Elms [11]. The objective for Jessop’s blockmodel heuristic is to maximize the sum of squared block sizes, known as the Herfindahl–Hirshman Index or HHI [12], [13]. Jessop’s procedure for block formation, which was also used in his analysis of airlines, begins by finding the node with the maximum number of connections. A block is built around this node by augmenting it with the node having the most connections that is also connected to all current block members. The augmenting step is repeated until there is no node that can be added without violating the constraint that it is connected to all current block members. Whereas the approaches of Alexander and Elms build cliques, Jessop’s approach builds blocks, meaning a node may belong to only one group. This being the case, all nodes assigned to the initial block are removed from further consideration and the steps are repeated to form the next block. The process iterates. Jessop also found improved algorithm performance by omitting a constraint on the minimum block size. In a subsequent work, Jessop formulates the blockmodel problem as a quadratic program [4]. The approach is applied to two scenarios: establishing groups of MBA students based upon their elective choices, and performance ranking of British universities. The quadratic program is solved using commercial software. Jessop also presents results of applying his blockmodel heuristic [8] to the MBA and British university datasets. More recent work by Jessop [9] applies the blockmodel approach to the problem of assessing the competitiveness of soccer leagues. In this paper, league competitiveness is measured by the number of maximally dense blocks that can be constructed based on the similarities of the teams in the league. In this paper, Jessop demonstrates that blockmodels are a feasible means of describing league performance. An integer-linear programming (ILP) formulation of the blockmodel problem is developed by Proll [5]. Proll incorporates Jessop’s [3] formulation of the blockmodel problem, but linearizes the model. Even with the increased problem size that results from the linearization, Proll is able to apply ILOG CPLEX and find feasible solutions for each of the seven problems tested. For four of the problems, the solution Proll obtains is superior to the solution found by Jessop. Proll also examines the application of ILOG CPLEX to two other models intended to alleviate problems of symmetry (i.e., block labels are arbitrary). Proll reports that his solutions are inferior to those of Jessop for the models that employ “symmetry breakers.” More recent work by Jessop et al. [14] presents two ILP formulations as well as a heuristic approach to the blockmodel problem. The authors refer to the two ILP formulations as the vertex formulation and the clique formulation. To obtain a solution, all cliques of a given graph are enumerated and the resulting set partitioning problem is solved. The clique formulation offers the advantage that it does not require specification of the maximum number of clusters. Jessop et al. [14] point out that an optimal solution for the clique formulation is also an optimal solution to the

JAMES et al.: GROUPING GENETIC ALGORITHM FOR THE BLOCKMODEL PROBLEM

integer programming formulation presented in [4]. Since the problem of enumerating all cliques in a graph is NP-hard, it is necessary to explore non-ILP approaches when solving larger problems (i.e., greater than 50 vertices). This paper examines a heuristic approach that begins by decomposing the given graph into subgraphs with nonoverlapping vertices. For each subgraph, the optimal clique problem is solved. The resulting solutions are combined to provide an initial solution for the given graph. This initial solution is improved by redistributing vertices “from the smallest clique to the largest possible clique until no further changes are possible” [14, p. 18]. Both Jessop and Proll point out that their work is based upon earlier approaches. One such algorithm is CONCOR [15]. CONCOR predates the common definition of a block, presented in Section II-A above, requiring all nodes in a block to be directly connected to one another. For a given problem, CONCOR produces a partition of the nodes into exactly two equivalence classes. Applied repeatedly, it produces smaller classes as each equivalence class is partitioned, thus creating a hierarchical clustering. Another of the early blockmodel approaches is contained in the program named STRUCTURE [16], developed by Ronald Burt [17]. It differs from other approaches in that it uses Euclidean distances as a dissimilarity measure [18]. Based on strength of relations, a 0/1 matrix is determined based upon a chosen level of similarity. The matrix serves as input to the block model procedure of STRUCTURE, and analysis is performed in a manner similar to CONCOR [1]. BLOCKER is yet a third example of one of the earlier approaches to the blockmodel problem [20]. In establishing appropriate blocks, BLOCKER makes it possible to identify nodes (called crystallizers) whose placement determines the placement of numerous other nodes and to identify nodes (called floaters) allowed multiple assignments [21]. BLOCKER differs from CONCOR in that BLOCKER requires as input a hypothesized initial blockmodel. It then derives assignments of nodes to blocks based on satisfying the hypothesis for the given data matrices. C. Problem Formulation We adopt the formulation suggested by [4] and also used in [5]. In this formulation, the objective is to maximize the sum of squared block sizes. This term HHI is a popular measure in economics [12], [13] that is used to look at the degree of industrial concentration. Maximizing this function encourages solutions with a small number of large blocks. Formally, the objective function can be given as in  n 2 b   λik . (1) max HHI = k=1

i=1

In (1), b is the maximum number of blocks, n is the number of nodes, and λik = 1 if node i belongs to block k, otherwise λik = 0. By definition, a block allows membership of a node to only one group. Therefore, we must add a set of constraints to enforce the membership of a node to one group (2) as

Fig. 1.

105

a

b

c

d

e

f

g

h

a

1

1

1

0

1

0

0

1

b

1

1

1

0

0

1

0

0

c

1

1

1

0

0

1

0

0

d

0

0

0

1

1

1

1

0

e

1

0

0

1

1

1

1

0

f

0

1

1

1

1

1

1

0

g

0

0

0

1

1

1

1

0

h

1

0

0

0

1

0

0

1

Example blockmodel solution.

well as a set of constraints to impose the desired density (3). Therefore, (1) is subject to the following: b 

n n   i=1 j =1

 x i j λik λ j k − β

λik = 1 ∀ i

k=1 n 

λik

(2)

2 ≥ 0 ∀ k.

(3)

i=1

The parameter β is the minimum block density required. If β = 1, then all nodes in a block must be directly connected (adjacent to) every other node in the block. Typically, a value of β either equal to or close to 1 is desirable. To illustrate, consider the reordered adjacency matrix given in Fig. 1, where a 1 denotes that node i and k are similar (or adjacent), 0 otherwise. The boxes around the 1s in Fig. 1 represent a solution. As can be seen, nodes {a, b, c} compose block 1, nodes {d, e, f, g} compose block 2, and block 3 is made up of only node {h}. The value of the objective function for this solution, using (1), becomes 32 + 42 + 12 = 26. Constraints (2) are imposed as no node is in more than one group. For example, node a is also like node h, but since membership to more than one group is restricted, node a is only a part of group 1. In the example below, β = 1, which forces all nodes to be connected to all other nodes in their group (hence, no 0’s appear in the groups). By decreasing β, it would be possible for the solution to have a block containing a zero (a density of less than 100%). A singleton (group with only one node) is also shown in the example. It is sometimes the case where the occurrence of a singleton is not allowed, though this is context dependent. Proll [5] notes that the continuous relaxation of Jessop’s [4] formulation in (1)–(3) results in a nonconvex feasible region: a difficulty that is compounded by the desired direction of optimization. Proll goes on to show that the model can be linearized by the introduction of binary variables to replace the product term λik λ j k in (3). However, this linearization results in a substantial increase in problem size. Proll’s [5] computational testing suggests an ILP approach may be an effective approach for solving full-density block modeling problems of up to 50 nodes. However, he notes that solving larger problems remains problematic, especially for the nonfull

106

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010

density block modeling problem where the size of the resulting ILP may make it difficult to find good solutions, even if only a small number of blocks need to be formed. Thus, effective heuristics are needed for block modeling problems where ILP approaches are not appropriate. This paper focuses on the use of a GGA as a heuristic approach for solving difficult block modeling problems. III. OVERVIEW OF G ROUPING G ENETIC A LGORITHMS A GA is a heuristic search technique that works by manipulating a population of solutions. Using the operators of selection, crossover, and mutation, a GA combines parts of successful solutions in an attempt to form better ones. In this manner, a GA is said to utilize an analogy to natural selection and survival of the fittest [22]. The GGA was introduced by Emanuel Falkenauer [23]. It has a specialized chromosome and specialized operators to overcome the problems that typically arise when a standard GA is applied to a grouping problem. Falkenauer [24] points out three specific shortcomings of applying a standard GA to a grouping problem. First, the standard encoding scheme is highly redundant. Second, the application of the crossover operator often results in an offspring chromosome which has few or no characteristics in common with the parent chromosomes used to produce it. Third, standard GA mutation can be too disruptive when trying to establish successful groups. Falkenauer’s GGA proposes a new encoding scheme in an effort to overcome these drawbacks. This encoding is accompanied by revised crossover and mutation operators which have been shown to produce highquality solutions to a variety of grouping problems [24]. Falkenauer’s encoding scheme includes an augmented chromosome. The augmented chromosome consists of the standard GA chromosome appended with a listing of the groups. Crossover and mutation are applied to the group section of the chromosome, resulting in alteration of the main chromosome. Crossover is an operator that attempts to exploit promising areas of the search space. It does this by forming new child chromosomes through the exchange of portions of existing parent chromosomes. Specifics of GGA crossover are detailed in Section IV-C. Mutation is an operator whose objective is to explore new areas of the search space. Generally, mutation works by altering a gene value for a selected chromosome. In order to keep promising chromosomes intact, mutation is generally applied to only a small percentage of the genes in a population of chromosomes [25]. IV. G ROUPING G ENETIC A LGORITHM FOR THE B LOCKMODELING P ROBLEM In this paper, we develop a GGA for the blockmodeling problem (BM-GGA). Following the framework described by Falkenauer, the GGA includes a revised encoding scheme, GGA crossover operator, and a specialized repair operator (to maintain feasibility and improve the child solutions). Mutation is excluded, as is the practice in many GGAs. A traditional roulette wheel selection procedure [26], which is common in GAs, is used. Fig. 2 shows the pseudocode for the GGA for

Step 1: Initialization 1(a) Randomly Generate a Population of n Solutions Step 2: Selection Part 1 2(a) Rank Chromosomes Based on OFV 2(b) Calculate Fitness for Each Chromosome 2(c) Save Best Solution Over All Generations Step 3: Repeat Until n Children are Created 3(a) Selection Part 2 - Draw with Replacement Parent 1 - Draw with Replacement Parent 2 3(b) Perform Crossover - Select Two Cross-points - Insert Portion of Parent 1 Designated by the Cross-points into Parent 2 - Adjust Child to Reflect Groups Inserted from Parent 1 3(c) Remove Empty Groups 3(d) Renumber Groups 3(e) Repair Child - Remove Infeasible Nodes from Groups - Identify Singletons for Reassignment - Reassign Infeasible Nodes and Singletons Step 4: Goto Step 2 Until Iteration Count Exceeds Maximum Number of Generations Fig. 2.

Pseudocode for BM-GGA.

the blockmodel problem (BM-GGA). The following sections detail the implementation of BM-GGA. A. Encoding A traditional GGA encoding is employed where the chromosome contains two portions. The first portion maps each node to a group and the second portion lists the groups. This encoding complements the crossover operator that will be discussed in Section IV-C and is traditionally implemented in GGAs. To illustrate, the encoded chromosome for the solution shown in Fig. 1 is given below x =a b c d e f g h n=11122223 Groups: G: 1 2 3.

Node Assignments:

In this example, which is equivalent to Fig. 1, the first group consists of nodes a, b, and c. The second group contains nodes d, e, f, and g. The last group consists of only node h. The array locations denote the node x, and the value assigned to each location n(x) represents that node’s assigned group. Since in this solution there are three groups, the “groups” portion G of the chromosome includes the values 1, 2, and 3. The crossover operator will manipulate the “groups” portion of the chromosome. B. Selection Roulette wheel selection is implemented in BM-GGA. This selection mechanism is very commonly employed in traditional genetic algorithms and is used to determine which chromosomes from the current generation to use as parents

JAMES et al.: GROUPING GENETIC ALGORITHM FOR THE BLOCKMODEL PROBLEM

to create the next generation. Roulette wheel selection [26] uses the function given in (4) to assign each chromosome in the population a fitness value based, in our case, on the quality of the solution’s objective function value f =

2r . N(N + 1)

(4)

In (4), f is the fitness value being calculated for each chromosome, r is the rank of the chromosome, and N is the number of chromosomes being ranked. The chromosomes are sorted by the quality of their objective function and given a rank r. The fitness value f determines the probability of that chromosome being selected as a parent. The fitness value is normalized between 0 and 1. Solutions are then selected from the population with replacement, which means the same solution may be used as a parent more than once, based on this wheel. However, we do enforce the criterion that a pair of parents consists of two different solutions. C. Crossover Once the parents are selected, crossover is performed to create the child (or children) for the new generation. In the current algorithm, we use generational replacement which means that at every iteration of the GGA, the entire generation is replaced with a new set of solutions. BM-GGA uses the customary GGA crossover and creates one child from each pair of parents using the process described below. The GGA crossover works on the “groups” portion of the chromosome. The following set of steps [24] describes the crossover process to create a child from a pair of parents. 1) For the “groups” portion of the first parent, select two cross-points. The groups between these two cross-points will be the contributing groups from this parent to the child. 2) Insert the section of the “groups” portion of the chromosome extracted in Step 1 into the second parent. 3) Modify the node assignment portion of the second chromosome to reflect the group assignments from the contributed section of the chromosome from the first parent. 4) If necessary, apply a problem dependent repair/improvement method(s) to the new child. This method will be tailored based upon the objective function and the constraints of the problem under consideration. To demonstrate the crossover operation, we will use the following two chromosomes: Node Assignments: n1 = 1 3 1 4 4 2 2 3 Groups: G1: 1 |2 3| 4 Node Assignments: n2 = 1 1 1 2 2 2 2 3 Groups: G2: 1 2 3 In G1, we create the first cross-point between 1 and 2 and the second cross-point between the 3 and 4. The portion of G1

107

to be contributed to the child is shown between the lines in the example above. The bold values in n1 have to be translated to the child. Thus, the child is created by inserting the bold values from n1 in place of the corresponding values in n2. The group portion of the child is created by moving the portion of the G1 defined by the cross-points to G2. Shown below as an intermediate step, child n3 employs underlining to indicate which node assignments for n3 will come from n1 Node Assignments: n3 = 1 1 1 2 2 2 2 3 Groups: G3: 1 2 3 [2 3]. Renumbering the genes to reflect the new groups and the modified node assignments we obtain the following child: Node Assignments: n3 = 1 5 1 2 2 4 4 5 Groups: G3: 1 2 3 [4 5]. In this example, after the node assignments are updated, group 3 in the original n2 no longer has any nodes assigned to it. Since group 3 does not contain any nodes, we can eliminate that group in the child and renumber. We end up with the final child chromosome Node Assignments: n3 = 1 4 1 2 2 3 3 4 Groups: G3: 1 2 3 4. D. Repair Operator In this paper, we are using the strictest definition of a block and requiring every member of the block to be similar to every other member in that block. We adopt this definition in order to be able to compare to the previous studies of Jessop [3]– [5], [8], [9], [14] and test the effectiveness of the heuristic. The requirement that every member of the block be similar to every other member [or β = 1 in (3)] means that if the adjacency matrix is rearranged to reflect the blocks along the diagonal (as illustrated in Fig. 1), all entries in each of the blocks represented in this matrix must be 1. A solution with a block containing a 0 (in this adjacency matrix representation) is infeasible by this definition. As previously discussed, this constraint may be relaxed in some situations and BM-GGA could easily be adapted to handle this circumstance. To repair the infeasibilities that may have been created by the crossover operator, assuming β = 1, the repair operator is applied to each child to identify nodes that belong to a group that are not like one or more of the other nodes in that group. In other words, the entries of the adjacency matrix corresponding to the node pairs in the block formation are searched to determine if a 0 exists between the node under consideration and any other node currently in its group. If this condition exists, the node is marked for reassignment and removed from its current group. At this point in the operator, the reassignment array contains all the nodes that need to be reassigned to obtain a feasible solution and those nodes have been removed from their original groups. Now that the infeasibilities have been removed, reassignments are considered that may provide improvements to the solution quality, as described next.

108

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010

TABLE I BM-GGA R ESULTS Worst solution (over 10 runs) Problem number

Problem name

Nodes

Edges

Density

Max. HHI

No. of blocks

Average solution (over 10 runs)

Time (s)

Max. HHI

Best solution (out of 10 runs)

No. of blocks

Time (s)

Max. HHI

No. of blocks

Time (s)

1

Soccer

20

95

50

64

7

0.1248

65.0

7.6

0.1154

72

10

0.1092

2

MBA

30

117

27

136

10

0.2184

136.0

10.0

0.1966

136

10

0.1872

3

Dwellings

33

162

31

133

10

0.2496

136.0

10.0

0.2356

137

10

0.2184

4

Airport1

40

282

36

276

8

0.1872

279.6

8.0

0.1950

280

8

0.1872

5

Airport2

40

408

52

362

7

0.1872

374.8

6.5

0.1763

388

6

0.1716

6

Airport3

47

393

36

353

9

0.2340

359.2

9.1

0.2496

367

9

0.2028

7

Airport4

47

569

54

429

7

0.2184

484.2

6.3

0.2106

519

6

0.2028

8

FTMBA1

100

2332

24

994

16

0.9984

1027.0

15.6

0.9703

1058

16

0.9984

9

FTMBA2

100

3496

35

1576

9

0.5616

1647.0

8.1

0.5429

1718

8

0.5616

10

Indian Village

141

2806

14

491

48

7.2228

505.4

45.8

7.0731

519

45

6.6768

11

Pattern Language

253

3678

6

761

97

121.9764

766.4

96.8

121.2904

773

96

125.0496

12

FTSE350

309

1076

1

771

146

217.0212

778.2

146.0

220.6264

787

145

226.2156

13

WUAR

500

39418

16

17514

37

27.0972

18454.0

38.4

26.4454

19078

37

26.8160

The objective of the current problem is to try to find a small number of large groups. Although a good solution may include a group containing only one node, a singleton does not contribute much to the objective function. As a first step to improve a child solution, we mark for reassignment all the singleton groups that exist in the current child. These singletons may be a result of the crossover operator itself or a result of the removal of the infeasibilities. By later attempting to add these singleton groups to a group with other nodes, simple improvements to the solution quality may be found. It should be mentioned that singletons are not disallowed in the final child solutions. If a node is found not to be compatible with the existing groups, it will be left a singleton. The existing singletons are simply marked for reassignment at this stage to see whether it is possible to reassign them to a fuller group. Once all infeasible assignments and all singleton nodes have been marked, an attempt is made to reassign these nodes to groups. An array holds all the nodes that have been identified for reassignment as a result of the two checks above. An attempt is then made to sequentially reassign each node from this array to one of the remaining groups in the child. If the node can be added to a group without creating an infeasible solution, then the assignment is made. That is, the node is added to the group and removed from the reassignment array. Otherwise, the reassignment of that node to the next group is considered. If all the groups have been checked and no feasible reassignment exists, the node forms a group by itself. This process then iterates for the next node in the array. This process creates singleton groups, but allows for the possibility that nodes may be added to those singleton groups as the reassignment array is traversed. Of course, if there are not any nodes left in the reassignment array that are similar to the node in the singleton, the singleton will remain.

This operator corrects the infeasibilities and performs a simple improvement by attempting to create larger feasible groups. To accommodate a β of less than 1, the routine could be easily modified to tolerate a percentage of 0’s in the block. V. C OMPUTATIONAL R ESULTS AND D ISCUSSION For computational testing, a set of problems presented in [5] and [14] were used. These problems represent a number of different domains (see [14] for a complete description of the domain of each problem). Problem 1 was proposed by Jessop [9] in a study examining competitiveness in the English soccer leagues. Problem 2, from [4], looks at similarity between MBA students. Problem 3 [8] results from a study of dwellings. Problems 4–7 are all related to airport performance assessment [27]. Problems 8 and 9 are performance evaluations problems from MBA program data [14]. Problem 10 comes from a partitioning design problem for an Indian village [14]. Problem 11 comes from the relationship between patterns of advice in software development [14]. Problem 12 is a social network analysis problem dealing with relationships among a company’s board of directors [14]. Problem 13 is also a problem from MBA program data, though from a different dataset than Problems 8 and 9 [14]. In Tables I, II and III, the problems are numbered and given a name corresponding to their application as discussed above. The number of nodes, which we will use to refer to the size of the problem, is also given in the tables. The test instances range in size from 20 to 500 nodes. Also included in the tables are the number of edges in the graph and the density of the problem. An edge is a connection between two nodes in a graph or a “1” in the adjacency matrix as described in Section II-C. The density of the problem is an indication of how many “1”s are in the adjacency matrix (or the number of

JAMES et al.: GROUPING GENETIC ALGORITHM FOR THE BLOCKMODEL PROBLEM

109

TABLE II C OMPARISON OF BM-GGA TO J ESSOP ’ S B LOCKMODELING H EURISTIC

Problem number

Problem name

Nodes

Edges

Density

BM-GGA (worst out of 10 runs)

BM-GGA (average over 10 runs)

BM-GGA (best out of 10 runs)

Jessop’s heuristic (from [5])

Max. HHI

Max. HHI

Max. HHI

No. of blocks

Max. HHI

No. of blocks

No. of blocks

No. of blocks

1

Soccer

20

95

50

64

7

65.0

7.6

72

10

58

9

2

MBA

30

117

27

136

10

136.0

10.0

136

10

104

12

3

Dwellings

33

162

31

133

10

136.0

10.0

137

10

99

13

4

Airport1

40

282

36

276

8

279.6

8.0

280

8

254

9

5

Airport2

40

408

52

362

7

374.8

6.5

388

6

336

8

6

Airport3

47

393

36

353

9

359.2

9.1

367

9

321

10

7

Airport4

47

569

54

429

7

484.2

6.3

519

6

403

8

edges considering the possible number of edges, the possible number of edges being based upon the number of nodes in the problem). The algorithm developed in this paper (BM-GGA), was written in Visual Basic .NET using the Visual Studio 2005 compilers. All testing was done on a laptop computer with a 2.40-GHz Intel Core 2 Duo CPU running Windows Vista. The GGA was only allowed to iterate for 20 generations, and the population size was set to 100. A population size between 50 and 200 is common in the GA literature, as it normally provides a good balance between runtime and solution quality. Similarly, we ran the algorithm for 20 generations as testing showed that good-quality results were obtained in reasonable computational time with this parameter choice. The algorithm could easily be set to utilize a different stopping condition or to run for greater than 20 iterations. Increasing the number of iterations or the population size may possibly provide better solution quality. The parameters used in this paper were chosen based on common practice and limited computational testing. The algorithm was run 10 times on each test instance. Table I presents the average results. Given is the average HHI value over the 10 runs, the average number of blocks in the solution, and the average run time in seconds. The best and worst HHI values obtained from the 10 runs are also provided, along with the number of blocks in that best (or worst) solution and the time to obtain that solution. In order to provide an idea of the quality of the solutions obtained by BM-GGA, we provide comparisons in Tables II and III. In Table II, we compare our solution quality to the heuristic proposed by [8]. The results for Jessop’s heuristic were obtained from [5]. In Table II, we present our best result against the values reported in [5], as only one solution value was reported in that study. Results were reported for only the first seven problems using Jessop’s heuristic and no computational times or hardware specifications were given. Therefore, this can only be a loose comparison in terms of solution quality as no comparisons on computational effort can be determined. The best solution found by BM-GGA is better than the solution reported for Jessop’s heuristic for all seven problems. It should also be noted that the average solution quality for BM-GGA for all seven problems is also better than the values reported for Jessop’s heuristic. BM-GGAs worst solution from 10 runs is also better than the values reported

for Jessop’s heuristic. This result allows us to conclude that our algorithm is effective for this problem. Table III shows our results against the values obtained by the ILP approaches in [5] and [14]. These results provide solutions that can be considered the best known solutions for this test set. Therefore, this comparison provides an idea of how close our algorithm is to results that can be obtained from an exact approach. The worst, average, and best solutions obtained by BM-GGA are compared against the ILP solutions. We provide the results for Proll’s original ILP approach [5] as well as for the improved approach [14]. The original ILP approach was only applied to the smaller problems 1–7. The improved approach was run on all 13 problems but they were unable to obtain a solution for the last problem due to its size. The second to last column of Table III gives the percent deviation of the average solution quality of BM-GGA to the best ILP solution. The last column of Table III gives the percent deviation of our best solution from the best ILP solution. The negative values represent an improved solution found by BM-GGA. The ILP approach provides better quality solutions for eight of the problems although BM-GGA obtains solutions that are relatively close. BM-GGA provides better solutions to two instances. However, the solution to problem 3 of 125 is listed a best known solution in [5] but as an optimal in [14]. Therefore, there is a discrepancy and it may be possible that the solution for problem 3 in [14] was misprinted. The solution obtained by BM-GGA for problem 3 was checked by hand and is a valid solution to the problem instance as presented in [8]. The approaches tie on one other when considering the average solution quality for BM-GGA and the best ILP solution and tie on two others when considering the best solutions for both BM-GGA and the ILP methods. No comparison is possible for the largest instance. Overall, the results illustrate that our algorithm provides quite reasonable results quickly. Table III also provides a view of the difficulty of the problem. For the largest problem instance, the ILP approach was not able to obtain a solution, whereas the heuristic approach provided a solution relatively quickly. The hardware used for the different algorithms was different so a direct comparison of time is not possible. It can be seen from Table I that BMGGA runs relatively quickly. All solutions were obtained in under 4 min which indicates the number of iterations could

FTMBA2

9

WUAR

FTMBA1

8

FTSE350

Airport4

7

13

Airport3

6

12

Airport2

5

Pattern Language

Airport1

4

Indian Village

Dwellings

3

11

MBA

10

Soccer

1

Problem name

2

Prob. no.

500

309

253

141

100

100

47

47

40

40

33

30

20

39418

1076

3678

2806

3496

2332

569

393

408

282

162

117

95

16

1

6

14

35

24

54

36

52

36

31

27

50 7.16

6.79

Time (s)

900

35.75

519 152.87

363

390

280 396.69

125 135.34

136

78

Max. Nodes Edges Density HHI

ILP solution (from [5])

78



825

797

593

1444

1084

527

369

398

280

125

136



136

93

40

11

15



5

0

0

175

317

231

3

22

1

0

0

0

64

17514

771

761

491

1576

994

429

353

362

276

133

136

7

37

146

97

48

9

16

7

9

7

8

10

10

7.0731

0.5429

0.9703

0.2106

0.2496

0.1763

0.1950

0.2356

0.1966

0.1154

Time (s)

96.8 121.2904

45.8

8.1

15.6

6.3

9.1

6.5

8.0

10.0

10.0

7.6

No. of blocks

38.4

787

773

519

1718

1058

519

367

388

280

137

136

72

37

145

96

45

8

16

6

9

6

8

10

10

10

Max. No. of HHI blocks

26.8160

226.2156

125.0496

6.6768

0.5616

0.9984

0.2028

0.2028

0.1716

0.1872

0.2184

0.1872

0.1092

Time (s)

BM-GGA (best out of 10 runs)

26.4454 19078

778.2 146.0 220.6264

766.4

505.4

1647.0

1027.0

484.2

359.2

374.8

279.6

136.0

136.0

65.0

Max. HHI

BM-GGA (average over 10 runs)

27.0972 18454.0

217.0212

121.9764

7.2228

0.5616

0.9984

0.2184

0.2340

0.1872

0.1872

0.2496

0.2184

0.1248

Time (s)

BM-GGA (worst out of 10 runs)

Max. No. of Time Max. No. of HHI blocks (s) HHI blocks

ILP Solutions (from [14])

TABLE III C OMPARISON OF BM-GGA S OLUTIONS TO ILP S OLUTION



5.673

3.839

14.772



4.606

3.011

12.479

2.399 −18.975

5.258

1.518

0.542

2.513

−14.058

8.121

2.656

5.829

0.000

−9.600 0.143

−8.800

7.692 0.000

0.000

16.667

Percent deviation Percent deviation of avg. BM-GGA of best BM-GGA solution from best solution from best ILP solution ILP solution

110 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010

JAMES et al.: GROUPING GENETIC ALGORITHM FOR THE BLOCKMODEL PROBLEM

also be increased without incurring unreasonable run times. It is interesting to note that the sparse matrices presented the longest runtimes. VI. C ONCLUSION This paper presented a grouping genetic algorithm for the blockmodel problem (BM-GGA). The blockmodel problem attempts to find a small number of large blocks of highly connected nodes. The practical application of using this technique to analyze structure in networks is wideranging. Our algorithm was demonstrated to be effective on this problem. It produced high-quality solutions compared to the previous heuristic approach from the literature. In comparison to the exact approaches, the algorithm was also shown to provide high-quality results. The runtimes of the algorithm were very reasonable. Future research could include applying the algorithm to obtain insight into an application in practice, such as exploring its use in information retrieval, social networking, threat assessment, etc. Improvements to the algorithm could also be explored, including incorporating an ILP solution method into the improvement routine for the heuristic and exploring the use and design of other GGA operators. R EFERENCES [1] J. Scott, Social Network Analysis. London, U.K.: Sage, 2000. [2] H. C. White, S. A. Boorman, and R. L. Breiger, “Social structure from multiple networks, I. Blockmodels of roles and positions,” Amer. J. Sociol., vol. 81, pp. 730–737, 1976. [3] A. Jessop. “Multiple attribute probabilistic assessment of the performance of some airlines,” in Multiple Criteria Decision Making in the New Millennium, New York: Springer, 2001, pp. 417–426. [4] A. Jessop, “Blockmodels with maximum concentration,” Eur. J. Oper. Res., vol. 148, no. 1, pp. 56–64, 2003. [5] L. Proll. “ILP approach to the blockmodel problem,” Eur. J. Oper. Res., vol. 177, no. 2, pp. 840–850, 2007. [6] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications. Cambridge, U.K.: Cambridge Univ. Press, 1997. [7] M. E. J. Newman, “Detecting community structure in networks,” Eur. Phys. J. B, vol. 38, pp. 321–330, 2004. [8] A. Jessop, “Exploring structure: A blockmodel approach,” Civ. Eng. Environ. Syst., vol. 19, no. 4, pp. 263–284, 2002. [9] A. Jessop, “A measure of competitiveness in leagues: A network approach,” J. Oper. Res. Soc., vol. 57, no. 12, pp. 1425–1434, 2006. [10] C. Alexander, Notes on the Synthesis of Form. Cambridge, MA: Harvard Univ. Press, 1964. [11] D. G. Elms, “From structure to a tree,” Civ. Eng. Environ. Syst., vol. 1, pp. 95–106, 1983. [12] O. C. Herfindahl, “Concentration in the U.S. steel industry,” Ph.D. dissertation, Columbia Univ., New York, 1950. [13] A. O. Hirschman, “The paternity of an index,” Amer. Econ. Rev., vol. 54, pp. 761–762, 1964. [14] A. Jessop, L. Proll, and B. M. Smith. (2007). Optimal cliques: Applications and solutions. Univ. Leeds, Leeds, U.K. [Online]. Available: http://www.comp.leeds.ac.uk/research/pubs/reports/2007/2007_03.pdf [15] R. L. Breiger, S. A. Boorman, and P. Arabie, “An algorithm for clustering relational data with applications to social network analysis and comparison to multidimensional scaling,” J. Math. Psychol., vol. 12, pp. 328–383, 1975. [16] Project in Structural Analysis, “STRUCTURE: A computer program providing basic data for the network analysis of empirical positions in a system of actors,” in Computer Program 1, Berkeley, CA: Univ. California, Survey Res. Center, 1981. [17] R. S. Burt. “Positions in networks,” Soc. Forc., vol. 55, pp. 93–122, 1976. [18] P. Doreian, V. Batagelj, and A. Ferligoj, Generalized Blockmodeling. Cambridge, U.K.: Cambridge Univ. Press, 1994.

111

[19] R. S. Burt R, “Models of network structure,” Ann. Rev. Soc., vol. 6, pp. 79–141, 1990. [20] G. H. Heil and H. C. White, “An algorithm for constructing homomorphisms of multiple graphs,” Dept. Sociology, Harvard Univ., Cambridge, MA, 1974. [21] G. H. Heil and H. C. White, “An algorithm for finding simultaneous homomorphic correspondences between graphs and their image graphs,” Behav. Sci., vol. 21, pp. 26–35, 1976. [22] J. H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications in Biology, Control, and Artificial Intelligence. Ann Arbor, MI: Univ. Michigan Press, 1975. [23] E. Falkenauer, “The grouping genetic algorithms widening the scope of the GAs,” JORBEL Belgian J. Oper. Res., Stat. Comput. Sci., vol. 33, no. 1–2, pp. 79–102, 1992. [24] E. Falkenauer, Genetic Algorithms for Grouping Problems. New York: Wiley, 1998. [25] K. DeJong, “An analysis of the behavior of a class of genetic adaptive systems,” Ph.D. dissertation, Univ. Michigan, Ann Arbor, MI, 1975. [26] D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley, 1989. [27] A. Jessop, “A multiattribute assessment of airport performance,” presented at 25th Eur. Working Group Financial Modelling, 1999.

Tabitha James received the BBA and Ph.D. degrees in business administration with a major in management information systems and a minor in productions and operations management from the University of Mississippi. She is currently an Associate Professor in the Department of Business Information Technology, Pamplin College of Business, Virginia Polytechnic Institute and State University, Blacksburg. Her research interests are in the areas of combinatorial optimization, heuristics, and parallel computing.

Evelyn Brown received the B.S. degree in mathematics from Furman University, Greenville, SC, the M.S. degree in operations research from North Carolina State University, Raleigh, and the Ph.D. degree in systems engineering from the University of Virginia, Charlottesville. She currently works as an Associate Professor in the Department of Engineering, College of Technology and Computer Science, East Carolina University, Greenville. Her research is mainly in applications of genetic algorithm’s and her work has been published in journals such as the International Journal of Production Research, Computers and Industrial Engineering, OMEGA–The International Journal of Management Science, and Engineering Applications of Artificial Intelligence. Dr. Brown is a Member of American Society of Engineering Education, Institute of Industrial Engineers, International Council on Systems Engineering, and Society of Women Engineers.

Cliff T. Ragsdale received the B.A. degree in psychology and the MBA degree from the University of Central Florida, Orlando, and the Ph.D. degree in management science and information technology from the University of Georgia, Atlanta. He is currently the Bank of America Professor in the Department of Business Information Technology, Pamplin College of Business, Virginia Polytechnic Institute and State University, Blacksburg. He has published more than 40 research articles and is the author of the textbook Spreadsheet Modeling and Decision Analysis (South-Western College Publishing, 2007). His research interests center on the use of artificial intelligence and quantitative modeling techniques to solve complex business problems.

Suggest Documents