IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010
103
Grouping Genetic Algorithm for the Blockmodel Problem Tabitha James, Evelyn Brown, and Cliff T. Ragsdale
Abstract— Many areas of research examine the relationships between objects. A subset of these research areas focuses on methods for creating groups whose members are similar based on some specific attribute(s). The blockmodel problem has as its objective to group objects in order to obtain a small number of large groups of similar nodes. In this paper, a grouping genetic algorithm (GGA) is applied to the blockmodel problem. Testing on numerous examples from the literature indicates a GGA is an appropriate tool for solving this type of problem. Specifically, our GGA provides good solutions, even to large-size problems, in reasonable computational time. Index Terms— Blockmodel, grouping genetic algorithm (GGA), social network analysis.
I. I NTRODUCTION LASSICAL social network analysis combines sociology with graph theory in order to study complex networks of relationships between entities. Social network analysis consists of a set of tools (or methods) for analyzing networks of relations between entities. It is convenient to represent a social network as a graph, where the entities are represented as nodes and the relationships between nodes are represented as arcs. This representation is a traditional graph in graph theory. It should also be mentioned that the arcs may be directed and/or weighted, but throughout this paper an undirected nonweighted network will be assumed. The data view of the graph is an adjacency matrix. While the graph provides the visual representation, the adjacency matrix is the primary data source that is used as input to the methods. A subset of social network analysis tools includes methods to observe patterns or structures in the graph, typically by manipulating the adjacency matrix to obtain some type of groups. One such grouping technique is referred to as blockmodeling. Blockmodeling originated from the social network analysis arena [2]. Blockmodeling attempts to encourage the discovery of a small number of large groups of densely connected nodes in a graph. In practice, this leads to the detection of groups of entities that are all related (or similar) to one another. While there exists many practical applications from being able to analyze the structure of social
C
Manuscript received July 2, 2008; revised December 1, 2008; accepted April 12, 2009. Current version published January 29, 2010. T. James and C. T. Ragsdale are with the Department of Business Information Technology, Pamplin College of Business, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061 USA (e-mail:
[email protected];
[email protected]). E. Brown is with the Department of Engineering, College of Technology and Computer Science, East Carolina University, Greenville, NC 27858 USA (e-mail:
[email protected]). Digital Object Identifier 10.1109/TEVC.2009.2023793
networks, blockmodeling has also been utilized recently to analyze multi-attribute measures of performance in several domains [3], [4] which will be described in the next section. It has also been suggested that blockmodeling may provide an attractive alternative approach to data envelopment analysis [5]. In this paper, we develop a grouping genetic algorithm (GGA) for the blockmodeling problem (BM-GGA). The intention of the current study is to illustrate that a GGA is an appropriate tool for application to the blockmodeling problem. In Section II we present some necessary terminology and background on blockmodeling. Specifically, in Section II-A we describe the terminology for groups used in the literature. In Section II-B, we present an overview of previous solution methodologies for the blockmodeling problem as well as some application areas. In Section II-C, we present the mathematical formulation for the blockmodeling problem and discuss the appropriateness of applying a heuristic solution method to the problem. Section III provides a brief introduction to GGAs. In Section IV, we describe in detail the GGA we developed for the blockmodeling problem. To illustrate the effectiveness of the heuristic approach proposed here, in Section V we test the algorithm on a set of problems obtained from the literature. These problems come from various problem domains and vary in size (and therefore solution complexity). We present results of our algorithm as well as provide comparisons to previous solution techniques from the literature. Our conclusions are presented in Section VI. II. P ROBLEM OVERVIEW A. Groups in Social Networking Methods for grouping items into some type of defined subsets abound in the literature of many disciplines. There exist numerous alternative terms in the literature for organizing groups of objects into defined subsets. Graph theorists have long analyzed, for example, cliques, sets, clusters, and partitions. Problems in these categories belong to the set of combinatorial optimization problems that are known to be difficult to solve. The study of solution methodologies for such problems has been popular in operations research, computer science, and applied mathematics. Borrowing from graph theory, social network analysis uses these concepts in an applied manner to examine social structures. Cliques, blocks, clusters, and community structures have all been used in the social networking literature to refer to some formed subsets (or pattern) in a graph. Commonly
1089-778X/$26.00 © 2010 IEEE
104
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010
accepted uses of these terms can be distinguished by looking at the strictness with which the concepts of membership and density are applied. We will use membership to define the number of groups to which a node may belong. Density refers to the tightness of the interconnections between the members of the group. A block typically limits membership of a node to exactly one group [4]. Although a common block definition requires all nodes to be directly connected to one another, it has also been prevalent in the literature to relax this constraint in favor of high density [1], [4]–[6]. A clique, on the other hand, typically allows membership to more than one group [4] but requires all nodes to be directly connected to each other [1]. A “true clique” is a group of nodes that are all directly connected to each other. This existence of a direct connection between every pair of nodes is the defining factor of a clique. As true cliques are not as prevalent in real networks as one would like, this definition has also been relaxed a bit in some cases. It is common to view a cluster as relaxing the constraints on both membership and density. However, the terms cluster and clique are often used interchangeably. In most of the relevant literature, the primary distinctions between the two terms are due to the connections that exist in the group and the separation [1]. Clusters can be characterized as areas of high density in a graph and are typically more distinctly separated. Thus, a cluster is a bit more loosely defined than a block or a clique. The research in the physics community on detecting community structure in networks (see [7]) is a related notion to clustering. All of the formations described above have become areas of interest in the field of social networking, which itself has become increasingly more interdisciplinary. Many different approaches for detecting these various formations in a network (or graph) exist, including the use of heuristics. This is an area of much ongoing research, as determining significant structural features of many types of networks can provide interesting practical insights. In the next section, we review the methods specifically written for blockmodeling, as this is the focus of the current study. B. Review of Solution Methodologies for Blockmodeling Much of the recent work in the area of blockmodeling has come from Jessop [3], [4], [8], [9]. In his analysis of 30 international airlines, Jessop [3] applies multidimensional scaling to cluster airlines of similar performance. His approach utilizes a linear weighted sum of six specific performance ratios in order to determine an overall measure of airline performance. Once probability distributions for the values of the weights are specified, Monte Carlo simulation is employed to determine the mean and variance of all pairwise differences in scores. The values of the standardized differences between airports are converted into a binary matrix based on a chosen significance level. Rows and columns of the matrix are rearranged to form dense blocks along the diagonal. Airports in the same block are said to be similar. The blocks are then ranked using their mean score.
Another work by Jessop [8] provides a comparison of his blockmodeling approach to those of Alexander [10] and Elms [11]. The objective for Jessop’s blockmodel heuristic is to maximize the sum of squared block sizes, known as the Herfindahl–Hirshman Index or HHI [12], [13]. Jessop’s procedure for block formation, which was also used in his analysis of airlines, begins by finding the node with the maximum number of connections. A block is built around this node by augmenting it with the node having the most connections that is also connected to all current block members. The augmenting step is repeated until there is no node that can be added without violating the constraint that it is connected to all current block members. Whereas the approaches of Alexander and Elms build cliques, Jessop’s approach builds blocks, meaning a node may belong to only one group. This being the case, all nodes assigned to the initial block are removed from further consideration and the steps are repeated to form the next block. The process iterates. Jessop also found improved algorithm performance by omitting a constraint on the minimum block size. In a subsequent work, Jessop formulates the blockmodel problem as a quadratic program [4]. The approach is applied to two scenarios: establishing groups of MBA students based upon their elective choices, and performance ranking of British universities. The quadratic program is solved using commercial software. Jessop also presents results of applying his blockmodel heuristic [8] to the MBA and British university datasets. More recent work by Jessop [9] applies the blockmodel approach to the problem of assessing the competitiveness of soccer leagues. In this paper, league competitiveness is measured by the number of maximally dense blocks that can be constructed based on the similarities of the teams in the league. In this paper, Jessop demonstrates that blockmodels are a feasible means of describing league performance. An integer-linear programming (ILP) formulation of the blockmodel problem is developed by Proll [5]. Proll incorporates Jessop’s [3] formulation of the blockmodel problem, but linearizes the model. Even with the increased problem size that results from the linearization, Proll is able to apply ILOG CPLEX and find feasible solutions for each of the seven problems tested. For four of the problems, the solution Proll obtains is superior to the solution found by Jessop. Proll also examines the application of ILOG CPLEX to two other models intended to alleviate problems of symmetry (i.e., block labels are arbitrary). Proll reports that his solutions are inferior to those of Jessop for the models that employ “symmetry breakers.” More recent work by Jessop et al. [14] presents two ILP formulations as well as a heuristic approach to the blockmodel problem. The authors refer to the two ILP formulations as the vertex formulation and the clique formulation. To obtain a solution, all cliques of a given graph are enumerated and the resulting set partitioning problem is solved. The clique formulation offers the advantage that it does not require specification of the maximum number of clusters. Jessop et al. [14] point out that an optimal solution for the clique formulation is also an optimal solution to the
JAMES et al.: GROUPING GENETIC ALGORITHM FOR THE BLOCKMODEL PROBLEM
integer programming formulation presented in [4]. Since the problem of enumerating all cliques in a graph is NP-hard, it is necessary to explore non-ILP approaches when solving larger problems (i.e., greater than 50 vertices). This paper examines a heuristic approach that begins by decomposing the given graph into subgraphs with nonoverlapping vertices. For each subgraph, the optimal clique problem is solved. The resulting solutions are combined to provide an initial solution for the given graph. This initial solution is improved by redistributing vertices “from the smallest clique to the largest possible clique until no further changes are possible” [14, p. 18]. Both Jessop and Proll point out that their work is based upon earlier approaches. One such algorithm is CONCOR [15]. CONCOR predates the common definition of a block, presented in Section II-A above, requiring all nodes in a block to be directly connected to one another. For a given problem, CONCOR produces a partition of the nodes into exactly two equivalence classes. Applied repeatedly, it produces smaller classes as each equivalence class is partitioned, thus creating a hierarchical clustering. Another of the early blockmodel approaches is contained in the program named STRUCTURE [16], developed by Ronald Burt [17]. It differs from other approaches in that it uses Euclidean distances as a dissimilarity measure [18]. Based on strength of relations, a 0/1 matrix is determined based upon a chosen level of similarity. The matrix serves as input to the block model procedure of STRUCTURE, and analysis is performed in a manner similar to CONCOR [1]. BLOCKER is yet a third example of one of the earlier approaches to the blockmodel problem [20]. In establishing appropriate blocks, BLOCKER makes it possible to identify nodes (called crystallizers) whose placement determines the placement of numerous other nodes and to identify nodes (called floaters) allowed multiple assignments [21]. BLOCKER differs from CONCOR in that BLOCKER requires as input a hypothesized initial blockmodel. It then derives assignments of nodes to blocks based on satisfying the hypothesis for the given data matrices. C. Problem Formulation We adopt the formulation suggested by [4] and also used in [5]. In this formulation, the objective is to maximize the sum of squared block sizes. This term HHI is a popular measure in economics [12], [13] that is used to look at the degree of industrial concentration. Maximizing this function encourages solutions with a small number of large blocks. Formally, the objective function can be given as in n 2 b λik . (1) max HHI = k=1
i=1
In (1), b is the maximum number of blocks, n is the number of nodes, and λik = 1 if node i belongs to block k, otherwise λik = 0. By definition, a block allows membership of a node to only one group. Therefore, we must add a set of constraints to enforce the membership of a node to one group (2) as
Fig. 1.
105
a
b
c
d
e
f
g
h
a
1
1
1
0
1
0
0
1
b
1
1
1
0
0
1
0
0
c
1
1
1
0
0
1
0
0
d
0
0
0
1
1
1
1
0
e
1
0
0
1
1
1
1
0
f
0
1
1
1
1
1
1
0
g
0
0
0
1
1
1
1
0
h
1
0
0
0
1
0
0
1
Example blockmodel solution.
well as a set of constraints to impose the desired density (3). Therefore, (1) is subject to the following: b
n n i=1 j =1
x i j λik λ j k − β
λik = 1 ∀ i
k=1 n
λik
(2)
2 ≥ 0 ∀ k.
(3)
i=1
The parameter β is the minimum block density required. If β = 1, then all nodes in a block must be directly connected (adjacent to) every other node in the block. Typically, a value of β either equal to or close to 1 is desirable. To illustrate, consider the reordered adjacency matrix given in Fig. 1, where a 1 denotes that node i and k are similar (or adjacent), 0 otherwise. The boxes around the 1s in Fig. 1 represent a solution. As can be seen, nodes {a, b, c} compose block 1, nodes {d, e, f, g} compose block 2, and block 3 is made up of only node {h}. The value of the objective function for this solution, using (1), becomes 32 + 42 + 12 = 26. Constraints (2) are imposed as no node is in more than one group. For example, node a is also like node h, but since membership to more than one group is restricted, node a is only a part of group 1. In the example below, β = 1, which forces all nodes to be connected to all other nodes in their group (hence, no 0’s appear in the groups). By decreasing β, it would be possible for the solution to have a block containing a zero (a density of less than 100%). A singleton (group with only one node) is also shown in the example. It is sometimes the case where the occurrence of a singleton is not allowed, though this is context dependent. Proll [5] notes that the continuous relaxation of Jessop’s [4] formulation in (1)–(3) results in a nonconvex feasible region: a difficulty that is compounded by the desired direction of optimization. Proll goes on to show that the model can be linearized by the introduction of binary variables to replace the product term λik λ j k in (3). However, this linearization results in a substantial increase in problem size. Proll’s [5] computational testing suggests an ILP approach may be an effective approach for solving full-density block modeling problems of up to 50 nodes. However, he notes that solving larger problems remains problematic, especially for the nonfull
106
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010
density block modeling problem where the size of the resulting ILP may make it difficult to find good solutions, even if only a small number of blocks need to be formed. Thus, effective heuristics are needed for block modeling problems where ILP approaches are not appropriate. This paper focuses on the use of a GGA as a heuristic approach for solving difficult block modeling problems. III. OVERVIEW OF G ROUPING G ENETIC A LGORITHMS A GA is a heuristic search technique that works by manipulating a population of solutions. Using the operators of selection, crossover, and mutation, a GA combines parts of successful solutions in an attempt to form better ones. In this manner, a GA is said to utilize an analogy to natural selection and survival of the fittest [22]. The GGA was introduced by Emanuel Falkenauer [23]. It has a specialized chromosome and specialized operators to overcome the problems that typically arise when a standard GA is applied to a grouping problem. Falkenauer [24] points out three specific shortcomings of applying a standard GA to a grouping problem. First, the standard encoding scheme is highly redundant. Second, the application of the crossover operator often results in an offspring chromosome which has few or no characteristics in common with the parent chromosomes used to produce it. Third, standard GA mutation can be too disruptive when trying to establish successful groups. Falkenauer’s GGA proposes a new encoding scheme in an effort to overcome these drawbacks. This encoding is accompanied by revised crossover and mutation operators which have been shown to produce highquality solutions to a variety of grouping problems [24]. Falkenauer’s encoding scheme includes an augmented chromosome. The augmented chromosome consists of the standard GA chromosome appended with a listing of the groups. Crossover and mutation are applied to the group section of the chromosome, resulting in alteration of the main chromosome. Crossover is an operator that attempts to exploit promising areas of the search space. It does this by forming new child chromosomes through the exchange of portions of existing parent chromosomes. Specifics of GGA crossover are detailed in Section IV-C. Mutation is an operator whose objective is to explore new areas of the search space. Generally, mutation works by altering a gene value for a selected chromosome. In order to keep promising chromosomes intact, mutation is generally applied to only a small percentage of the genes in a population of chromosomes [25]. IV. G ROUPING G ENETIC A LGORITHM FOR THE B LOCKMODELING P ROBLEM In this paper, we develop a GGA for the blockmodeling problem (BM-GGA). Following the framework described by Falkenauer, the GGA includes a revised encoding scheme, GGA crossover operator, and a specialized repair operator (to maintain feasibility and improve the child solutions). Mutation is excluded, as is the practice in many GGAs. A traditional roulette wheel selection procedure [26], which is common in GAs, is used. Fig. 2 shows the pseudocode for the GGA for
Step 1: Initialization 1(a) Randomly Generate a Population of n Solutions Step 2: Selection Part 1 2(a) Rank Chromosomes Based on OFV 2(b) Calculate Fitness for Each Chromosome 2(c) Save Best Solution Over All Generations Step 3: Repeat Until n Children are Created 3(a) Selection Part 2 - Draw with Replacement Parent 1 - Draw with Replacement Parent 2 3(b) Perform Crossover - Select Two Cross-points - Insert Portion of Parent 1 Designated by the Cross-points into Parent 2 - Adjust Child to Reflect Groups Inserted from Parent 1 3(c) Remove Empty Groups 3(d) Renumber Groups 3(e) Repair Child - Remove Infeasible Nodes from Groups - Identify Singletons for Reassignment - Reassign Infeasible Nodes and Singletons Step 4: Goto Step 2 Until Iteration Count Exceeds Maximum Number of Generations Fig. 2.
Pseudocode for BM-GGA.
the blockmodel problem (BM-GGA). The following sections detail the implementation of BM-GGA. A. Encoding A traditional GGA encoding is employed where the chromosome contains two portions. The first portion maps each node to a group and the second portion lists the groups. This encoding complements the crossover operator that will be discussed in Section IV-C and is traditionally implemented in GGAs. To illustrate, the encoded chromosome for the solution shown in Fig. 1 is given below x =a b c d e f g h n=11122223 Groups: G: 1 2 3.
Node Assignments:
In this example, which is equivalent to Fig. 1, the first group consists of nodes a, b, and c. The second group contains nodes d, e, f, and g. The last group consists of only node h. The array locations denote the node x, and the value assigned to each location n(x) represents that node’s assigned group. Since in this solution there are three groups, the “groups” portion G of the chromosome includes the values 1, 2, and 3. The crossover operator will manipulate the “groups” portion of the chromosome. B. Selection Roulette wheel selection is implemented in BM-GGA. This selection mechanism is very commonly employed in traditional genetic algorithms and is used to determine which chromosomes from the current generation to use as parents
JAMES et al.: GROUPING GENETIC ALGORITHM FOR THE BLOCKMODEL PROBLEM
to create the next generation. Roulette wheel selection [26] uses the function given in (4) to assign each chromosome in the population a fitness value based, in our case, on the quality of the solution’s objective function value f =
2r . N(N + 1)
(4)
In (4), f is the fitness value being calculated for each chromosome, r is the rank of the chromosome, and N is the number of chromosomes being ranked. The chromosomes are sorted by the quality of their objective function and given a rank r. The fitness value f determines the probability of that chromosome being selected as a parent. The fitness value is normalized between 0 and 1. Solutions are then selected from the population with replacement, which means the same solution may be used as a parent more than once, based on this wheel. However, we do enforce the criterion that a pair of parents consists of two different solutions. C. Crossover Once the parents are selected, crossover is performed to create the child (or children) for the new generation. In the current algorithm, we use generational replacement which means that at every iteration of the GGA, the entire generation is replaced with a new set of solutions. BM-GGA uses the customary GGA crossover and creates one child from each pair of parents using the process described below. The GGA crossover works on the “groups” portion of the chromosome. The following set of steps [24] describes the crossover process to create a child from a pair of parents. 1) For the “groups” portion of the first parent, select two cross-points. The groups between these two cross-points will be the contributing groups from this parent to the child. 2) Insert the section of the “groups” portion of the chromosome extracted in Step 1 into the second parent. 3) Modify the node assignment portion of the second chromosome to reflect the group assignments from the contributed section of the chromosome from the first parent. 4) If necessary, apply a problem dependent repair/improvement method(s) to the new child. This method will be tailored based upon the objective function and the constraints of the problem under consideration. To demonstrate the crossover operation, we will use the following two chromosomes: Node Assignments: n1 = 1 3 1 4 4 2 2 3 Groups: G1: 1 |2 3| 4 Node Assignments: n2 = 1 1 1 2 2 2 2 3 Groups: G2: 1 2 3 In G1, we create the first cross-point between 1 and 2 and the second cross-point between the 3 and 4. The portion of G1
107
to be contributed to the child is shown between the lines in the example above. The bold values in n1 have to be translated to the child. Thus, the child is created by inserting the bold values from n1 in place of the corresponding values in n2. The group portion of the child is created by moving the portion of the G1 defined by the cross-points to G2. Shown below as an intermediate step, child n3 employs underlining to indicate which node assignments for n3 will come from n1 Node Assignments: n3 = 1 1 1 2 2 2 2 3 Groups: G3: 1 2 3 [2 3]. Renumbering the genes to reflect the new groups and the modified node assignments we obtain the following child: Node Assignments: n3 = 1 5 1 2 2 4 4 5 Groups: G3: 1 2 3 [4 5]. In this example, after the node assignments are updated, group 3 in the original n2 no longer has any nodes assigned to it. Since group 3 does not contain any nodes, we can eliminate that group in the child and renumber. We end up with the final child chromosome Node Assignments: n3 = 1 4 1 2 2 3 3 4 Groups: G3: 1 2 3 4. D. Repair Operator In this paper, we are using the strictest definition of a block and requiring every member of the block to be similar to every other member in that block. We adopt this definition in order to be able to compare to the previous studies of Jessop [3]– [5], [8], [9], [14] and test the effectiveness of the heuristic. The requirement that every member of the block be similar to every other member [or β = 1 in (3)] means that if the adjacency matrix is rearranged to reflect the blocks along the diagonal (as illustrated in Fig. 1), all entries in each of the blocks represented in this matrix must be 1. A solution with a block containing a 0 (in this adjacency matrix representation) is infeasible by this definition. As previously discussed, this constraint may be relaxed in some situations and BM-GGA could easily be adapted to handle this circumstance. To repair the infeasibilities that may have been created by the crossover operator, assuming β = 1, the repair operator is applied to each child to identify nodes that belong to a group that are not like one or more of the other nodes in that group. In other words, the entries of the adjacency matrix corresponding to the node pairs in the block formation are searched to determine if a 0 exists between the node under consideration and any other node currently in its group. If this condition exists, the node is marked for reassignment and removed from its current group. At this point in the operator, the reassignment array contains all the nodes that need to be reassigned to obtain a feasible solution and those nodes have been removed from their original groups. Now that the infeasibilities have been removed, reassignments are considered that may provide improvements to the solution quality, as described next.
108
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010
TABLE I BM-GGA R ESULTS Worst solution (over 10 runs) Problem number
Problem name
Nodes
Edges
Density
Max. HHI
No. of blocks
Average solution (over 10 runs)
Time (s)
Max. HHI
Best solution (out of 10 runs)
No. of blocks
Time (s)
Max. HHI
No. of blocks
Time (s)
1
Soccer
20
95
50
64
7
0.1248
65.0
7.6
0.1154
72
10
0.1092
2
MBA
30
117
27
136
10
0.2184
136.0
10.0
0.1966
136
10
0.1872
3
Dwellings
33
162
31
133
10
0.2496
136.0
10.0
0.2356
137
10
0.2184
4
Airport1
40
282
36
276
8
0.1872
279.6
8.0
0.1950
280
8
0.1872
5
Airport2
40
408
52
362
7
0.1872
374.8
6.5
0.1763
388
6
0.1716
6
Airport3
47
393
36
353
9
0.2340
359.2
9.1
0.2496
367
9
0.2028
7
Airport4
47
569
54
429
7
0.2184
484.2
6.3
0.2106
519
6
0.2028
8
FTMBA1
100
2332
24
994
16
0.9984
1027.0
15.6
0.9703
1058
16
0.9984
9
FTMBA2
100
3496
35
1576
9
0.5616
1647.0
8.1
0.5429
1718
8
0.5616
10
Indian Village
141
2806
14
491
48
7.2228
505.4
45.8
7.0731
519
45
6.6768
11
Pattern Language
253
3678
6
761
97
121.9764
766.4
96.8
121.2904
773
96
125.0496
12
FTSE350
309
1076
1
771
146
217.0212
778.2
146.0
220.6264
787
145
226.2156
13
WUAR
500
39418
16
17514
37
27.0972
18454.0
38.4
26.4454
19078
37
26.8160
The objective of the current problem is to try to find a small number of large groups. Although a good solution may include a group containing only one node, a singleton does not contribute much to the objective function. As a first step to improve a child solution, we mark for reassignment all the singleton groups that exist in the current child. These singletons may be a result of the crossover operator itself or a result of the removal of the infeasibilities. By later attempting to add these singleton groups to a group with other nodes, simple improvements to the solution quality may be found. It should be mentioned that singletons are not disallowed in the final child solutions. If a node is found not to be compatible with the existing groups, it will be left a singleton. The existing singletons are simply marked for reassignment at this stage to see whether it is possible to reassign them to a fuller group. Once all infeasible assignments and all singleton nodes have been marked, an attempt is made to reassign these nodes to groups. An array holds all the nodes that have been identified for reassignment as a result of the two checks above. An attempt is then made to sequentially reassign each node from this array to one of the remaining groups in the child. If the node can be added to a group without creating an infeasible solution, then the assignment is made. That is, the node is added to the group and removed from the reassignment array. Otherwise, the reassignment of that node to the next group is considered. If all the groups have been checked and no feasible reassignment exists, the node forms a group by itself. This process then iterates for the next node in the array. This process creates singleton groups, but allows for the possibility that nodes may be added to those singleton groups as the reassignment array is traversed. Of course, if there are not any nodes left in the reassignment array that are similar to the node in the singleton, the singleton will remain.
This operator corrects the infeasibilities and performs a simple improvement by attempting to create larger feasible groups. To accommodate a β of less than 1, the routine could be easily modified to tolerate a percentage of 0’s in the block. V. C OMPUTATIONAL R ESULTS AND D ISCUSSION For computational testing, a set of problems presented in [5] and [14] were used. These problems represent a number of different domains (see [14] for a complete description of the domain of each problem). Problem 1 was proposed by Jessop [9] in a study examining competitiveness in the English soccer leagues. Problem 2, from [4], looks at similarity between MBA students. Problem 3 [8] results from a study of dwellings. Problems 4–7 are all related to airport performance assessment [27]. Problems 8 and 9 are performance evaluations problems from MBA program data [14]. Problem 10 comes from a partitioning design problem for an Indian village [14]. Problem 11 comes from the relationship between patterns of advice in software development [14]. Problem 12 is a social network analysis problem dealing with relationships among a company’s board of directors [14]. Problem 13 is also a problem from MBA program data, though from a different dataset than Problems 8 and 9 [14]. In Tables I, II and III, the problems are numbered and given a name corresponding to their application as discussed above. The number of nodes, which we will use to refer to the size of the problem, is also given in the tables. The test instances range in size from 20 to 500 nodes. Also included in the tables are the number of edges in the graph and the density of the problem. An edge is a connection between two nodes in a graph or a “1” in the adjacency matrix as described in Section II-C. The density of the problem is an indication of how many “1”s are in the adjacency matrix (or the number of
JAMES et al.: GROUPING GENETIC ALGORITHM FOR THE BLOCKMODEL PROBLEM
109
TABLE II C OMPARISON OF BM-GGA TO J ESSOP ’ S B LOCKMODELING H EURISTIC
Problem number
Problem name
Nodes
Edges
Density
BM-GGA (worst out of 10 runs)
BM-GGA (average over 10 runs)
BM-GGA (best out of 10 runs)
Jessop’s heuristic (from [5])
Max. HHI
Max. HHI
Max. HHI
No. of blocks
Max. HHI
No. of blocks
No. of blocks
No. of blocks
1
Soccer
20
95
50
64
7
65.0
7.6
72
10
58
9
2
MBA
30
117
27
136
10
136.0
10.0
136
10
104
12
3
Dwellings
33
162
31
133
10
136.0
10.0
137
10
99
13
4
Airport1
40
282
36
276
8
279.6
8.0
280
8
254
9
5
Airport2
40
408
52
362
7
374.8
6.5
388
6
336
8
6
Airport3
47
393
36
353
9
359.2
9.1
367
9
321
10
7
Airport4
47
569
54
429
7
484.2
6.3
519
6
403
8
edges considering the possible number of edges, the possible number of edges being based upon the number of nodes in the problem). The algorithm developed in this paper (BM-GGA), was written in Visual Basic .NET using the Visual Studio 2005 compilers. All testing was done on a laptop computer with a 2.40-GHz Intel Core 2 Duo CPU running Windows Vista. The GGA was only allowed to iterate for 20 generations, and the population size was set to 100. A population size between 50 and 200 is common in the GA literature, as it normally provides a good balance between runtime and solution quality. Similarly, we ran the algorithm for 20 generations as testing showed that good-quality results were obtained in reasonable computational time with this parameter choice. The algorithm could easily be set to utilize a different stopping condition or to run for greater than 20 iterations. Increasing the number of iterations or the population size may possibly provide better solution quality. The parameters used in this paper were chosen based on common practice and limited computational testing. The algorithm was run 10 times on each test instance. Table I presents the average results. Given is the average HHI value over the 10 runs, the average number of blocks in the solution, and the average run time in seconds. The best and worst HHI values obtained from the 10 runs are also provided, along with the number of blocks in that best (or worst) solution and the time to obtain that solution. In order to provide an idea of the quality of the solutions obtained by BM-GGA, we provide comparisons in Tables II and III. In Table II, we compare our solution quality to the heuristic proposed by [8]. The results for Jessop’s heuristic were obtained from [5]. In Table II, we present our best result against the values reported in [5], as only one solution value was reported in that study. Results were reported for only the first seven problems using Jessop’s heuristic and no computational times or hardware specifications were given. Therefore, this can only be a loose comparison in terms of solution quality as no comparisons on computational effort can be determined. The best solution found by BM-GGA is better than the solution reported for Jessop’s heuristic for all seven problems. It should also be noted that the average solution quality for BM-GGA for all seven problems is also better than the values reported for Jessop’s heuristic. BM-GGAs worst solution from 10 runs is also better than the values reported
for Jessop’s heuristic. This result allows us to conclude that our algorithm is effective for this problem. Table III shows our results against the values obtained by the ILP approaches in [5] and [14]. These results provide solutions that can be considered the best known solutions for this test set. Therefore, this comparison provides an idea of how close our algorithm is to results that can be obtained from an exact approach. The worst, average, and best solutions obtained by BM-GGA are compared against the ILP solutions. We provide the results for Proll’s original ILP approach [5] as well as for the improved approach [14]. The original ILP approach was only applied to the smaller problems 1–7. The improved approach was run on all 13 problems but they were unable to obtain a solution for the last problem due to its size. The second to last column of Table III gives the percent deviation of the average solution quality of BM-GGA to the best ILP solution. The last column of Table III gives the percent deviation of our best solution from the best ILP solution. The negative values represent an improved solution found by BM-GGA. The ILP approach provides better quality solutions for eight of the problems although BM-GGA obtains solutions that are relatively close. BM-GGA provides better solutions to two instances. However, the solution to problem 3 of 125 is listed a best known solution in [5] but as an optimal in [14]. Therefore, there is a discrepancy and it may be possible that the solution for problem 3 in [14] was misprinted. The solution obtained by BM-GGA for problem 3 was checked by hand and is a valid solution to the problem instance as presented in [8]. The approaches tie on one other when considering the average solution quality for BM-GGA and the best ILP solution and tie on two others when considering the best solutions for both BM-GGA and the ILP methods. No comparison is possible for the largest instance. Overall, the results illustrate that our algorithm provides quite reasonable results quickly. Table III also provides a view of the difficulty of the problem. For the largest problem instance, the ILP approach was not able to obtain a solution, whereas the heuristic approach provided a solution relatively quickly. The hardware used for the different algorithms was different so a direct comparison of time is not possible. It can be seen from Table I that BMGGA runs relatively quickly. All solutions were obtained in under 4 min which indicates the number of iterations could
FTMBA2
9
WUAR
FTMBA1
8
FTSE350
Airport4
7
13
Airport3
6
12
Airport2
5
Pattern Language
Airport1
4
Indian Village
Dwellings
3
11
MBA
10
Soccer
1
Problem name
2
Prob. no.
500
309
253
141
100
100
47
47
40
40
33
30
20
39418
1076
3678
2806
3496
2332
569
393
408
282
162
117
95
16
1
6
14
35
24
54
36
52
36
31
27
50 7.16
6.79
Time (s)
900
35.75
519 152.87
363
390
280 396.69
125 135.34
136
78
Max. Nodes Edges Density HHI
ILP solution (from [5])
78
–
825
797
593
1444
1084
527
369
398
280
125
136
–
136
93
40
11
15
–
5
0
0
175
317
231
3
22
1
0
0
0
64
17514
771
761
491
1576
994
429
353
362
276
133
136
7
37
146
97
48
9
16
7
9
7
8
10
10
7.0731
0.5429
0.9703
0.2106
0.2496
0.1763
0.1950
0.2356
0.1966
0.1154
Time (s)
96.8 121.2904
45.8
8.1
15.6
6.3
9.1
6.5
8.0
10.0
10.0
7.6
No. of blocks
38.4
787
773
519
1718
1058
519
367
388
280
137
136
72
37
145
96
45
8
16
6
9
6
8
10
10
10
Max. No. of HHI blocks
26.8160
226.2156
125.0496
6.6768
0.5616
0.9984
0.2028
0.2028
0.1716
0.1872
0.2184
0.1872
0.1092
Time (s)
BM-GGA (best out of 10 runs)
26.4454 19078
778.2 146.0 220.6264
766.4
505.4
1647.0
1027.0
484.2
359.2
374.8
279.6
136.0
136.0
65.0
Max. HHI
BM-GGA (average over 10 runs)
27.0972 18454.0
217.0212
121.9764
7.2228
0.5616
0.9984
0.2184
0.2340
0.1872
0.1872
0.2496
0.2184
0.1248
Time (s)
BM-GGA (worst out of 10 runs)
Max. No. of Time Max. No. of HHI blocks (s) HHI blocks
ILP Solutions (from [14])
TABLE III C OMPARISON OF BM-GGA S OLUTIONS TO ILP S OLUTION
–
5.673
3.839
14.772
–
4.606
3.011
12.479
2.399 −18.975
5.258
1.518
0.542
2.513
−14.058
8.121
2.656
5.829
0.000
−9.600 0.143
−8.800
7.692 0.000
0.000
16.667
Percent deviation Percent deviation of avg. BM-GGA of best BM-GGA solution from best solution from best ILP solution ILP solution
110 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 14, NO. 1, FEBRUARY 2010
JAMES et al.: GROUPING GENETIC ALGORITHM FOR THE BLOCKMODEL PROBLEM
also be increased without incurring unreasonable run times. It is interesting to note that the sparse matrices presented the longest runtimes. VI. C ONCLUSION This paper presented a grouping genetic algorithm for the blockmodel problem (BM-GGA). The blockmodel problem attempts to find a small number of large blocks of highly connected nodes. The practical application of using this technique to analyze structure in networks is wideranging. Our algorithm was demonstrated to be effective on this problem. It produced high-quality solutions compared to the previous heuristic approach from the literature. In comparison to the exact approaches, the algorithm was also shown to provide high-quality results. The runtimes of the algorithm were very reasonable. Future research could include applying the algorithm to obtain insight into an application in practice, such as exploring its use in information retrieval, social networking, threat assessment, etc. Improvements to the algorithm could also be explored, including incorporating an ILP solution method into the improvement routine for the heuristic and exploring the use and design of other GGA operators. R EFERENCES [1] J. Scott, Social Network Analysis. London, U.K.: Sage, 2000. [2] H. C. White, S. A. Boorman, and R. L. Breiger, “Social structure from multiple networks, I. Blockmodels of roles and positions,” Amer. J. Sociol., vol. 81, pp. 730–737, 1976. [3] A. Jessop. “Multiple attribute probabilistic assessment of the performance of some airlines,” in Multiple Criteria Decision Making in the New Millennium, New York: Springer, 2001, pp. 417–426. [4] A. Jessop, “Blockmodels with maximum concentration,” Eur. J. Oper. Res., vol. 148, no. 1, pp. 56–64, 2003. [5] L. Proll. “ILP approach to the blockmodel problem,” Eur. J. Oper. Res., vol. 177, no. 2, pp. 840–850, 2007. [6] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications. Cambridge, U.K.: Cambridge Univ. Press, 1997. [7] M. E. J. Newman, “Detecting community structure in networks,” Eur. Phys. J. B, vol. 38, pp. 321–330, 2004. [8] A. Jessop, “Exploring structure: A blockmodel approach,” Civ. Eng. Environ. Syst., vol. 19, no. 4, pp. 263–284, 2002. [9] A. Jessop, “A measure of competitiveness in leagues: A network approach,” J. Oper. Res. Soc., vol. 57, no. 12, pp. 1425–1434, 2006. [10] C. Alexander, Notes on the Synthesis of Form. Cambridge, MA: Harvard Univ. Press, 1964. [11] D. G. Elms, “From structure to a tree,” Civ. Eng. Environ. Syst., vol. 1, pp. 95–106, 1983. [12] O. C. Herfindahl, “Concentration in the U.S. steel industry,” Ph.D. dissertation, Columbia Univ., New York, 1950. [13] A. O. Hirschman, “The paternity of an index,” Amer. Econ. Rev., vol. 54, pp. 761–762, 1964. [14] A. Jessop, L. Proll, and B. M. Smith. (2007). Optimal cliques: Applications and solutions. Univ. Leeds, Leeds, U.K. [Online]. Available: http://www.comp.leeds.ac.uk/research/pubs/reports/2007/2007_03.pdf [15] R. L. Breiger, S. A. Boorman, and P. Arabie, “An algorithm for clustering relational data with applications to social network analysis and comparison to multidimensional scaling,” J. Math. Psychol., vol. 12, pp. 328–383, 1975. [16] Project in Structural Analysis, “STRUCTURE: A computer program providing basic data for the network analysis of empirical positions in a system of actors,” in Computer Program 1, Berkeley, CA: Univ. California, Survey Res. Center, 1981. [17] R. S. Burt. “Positions in networks,” Soc. Forc., vol. 55, pp. 93–122, 1976. [18] P. Doreian, V. Batagelj, and A. Ferligoj, Generalized Blockmodeling. Cambridge, U.K.: Cambridge Univ. Press, 1994.
111
[19] R. S. Burt R, “Models of network structure,” Ann. Rev. Soc., vol. 6, pp. 79–141, 1990. [20] G. H. Heil and H. C. White, “An algorithm for constructing homomorphisms of multiple graphs,” Dept. Sociology, Harvard Univ., Cambridge, MA, 1974. [21] G. H. Heil and H. C. White, “An algorithm for finding simultaneous homomorphic correspondences between graphs and their image graphs,” Behav. Sci., vol. 21, pp. 26–35, 1976. [22] J. H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications in Biology, Control, and Artificial Intelligence. Ann Arbor, MI: Univ. Michigan Press, 1975. [23] E. Falkenauer, “The grouping genetic algorithms widening the scope of the GAs,” JORBEL Belgian J. Oper. Res., Stat. Comput. Sci., vol. 33, no. 1–2, pp. 79–102, 1992. [24] E. Falkenauer, Genetic Algorithms for Grouping Problems. New York: Wiley, 1998. [25] K. DeJong, “An analysis of the behavior of a class of genetic adaptive systems,” Ph.D. dissertation, Univ. Michigan, Ann Arbor, MI, 1975. [26] D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley, 1989. [27] A. Jessop, “A multiattribute assessment of airport performance,” presented at 25th Eur. Working Group Financial Modelling, 1999.
Tabitha James received the BBA and Ph.D. degrees in business administration with a major in management information systems and a minor in productions and operations management from the University of Mississippi. She is currently an Associate Professor in the Department of Business Information Technology, Pamplin College of Business, Virginia Polytechnic Institute and State University, Blacksburg. Her research interests are in the areas of combinatorial optimization, heuristics, and parallel computing.
Evelyn Brown received the B.S. degree in mathematics from Furman University, Greenville, SC, the M.S. degree in operations research from North Carolina State University, Raleigh, and the Ph.D. degree in systems engineering from the University of Virginia, Charlottesville. She currently works as an Associate Professor in the Department of Engineering, College of Technology and Computer Science, East Carolina University, Greenville. Her research is mainly in applications of genetic algorithm’s and her work has been published in journals such as the International Journal of Production Research, Computers and Industrial Engineering, OMEGA–The International Journal of Management Science, and Engineering Applications of Artificial Intelligence. Dr. Brown is a Member of American Society of Engineering Education, Institute of Industrial Engineers, International Council on Systems Engineering, and Society of Women Engineers.
Cliff T. Ragsdale received the B.A. degree in psychology and the MBA degree from the University of Central Florida, Orlando, and the Ph.D. degree in management science and information technology from the University of Georgia, Atlanta. He is currently the Bank of America Professor in the Department of Business Information Technology, Pamplin College of Business, Virginia Polytechnic Institute and State University, Blacksburg. He has published more than 40 research articles and is the author of the textbook Spreadsheet Modeling and Decision Analysis (South-Western College Publishing, 2007). His research interests center on the use of artificial intelligence and quantitative modeling techniques to solve complex business problems.