Computational & Mathematical Organization Theory 7, 275–285, 2001. c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.
A Faster Katz Status Score Algorithm∗ KURT C. FOSTER AND STEPHEN Q. MUTH Private Consultants, Colorado Springs, CO, USA email:
[email protected] JOHN J. POTTERAT El Paso County Department of Health, Colorado Springs, CO, USA email:
[email protected] RICHARD B. ROTHENBERG Emory University School of Medicine, Atlanta, GA, USA email:
[email protected]
Abstract A new graph theoretical algorithm to calculate Katz status scores reduces computational complexity from time O(n 3 ) to O(n + m). Randomly-generated graphs as well as data from a large empiric study are used to test the performance of two commercial network analysis packages (GRADAP and UCINET V), compared to the performance achieved by the authors’ algorithm, implemented in Visual Basic. Keywords: graph theory, centrality, rank prestige, influence, Katz status
1.
Introduction
A central problem in graph theory is quantification of actor importance in a network. Several centrality measures are available to estimate nodal prominence, such as betweenness (Freeman, 1978), information centrality (Stephenson and Zelen, 1989), and status scores based either on eigenvectors (Bonacich, 1972) or degree (Katz, 1953). Together, they offer complementary views of network structure (Friedkin, 1991). Most of these centrality measures require calculation of high-order powers of a matrix, of their inverses, or of shortest paths—all of which present inherent computational complexities. Regrettably, when calculating such centrality scores for networks involving more than a few thousand actors, two currently available commercial software packages, GRADAP 2.10 (InterUniversity Project Group, 1989) and UCINET 5.0.1.0 (Borgatti et al., 1999), are either slow or fail due to sample size, software errors, or hardware limitations. Although the Katz index is simple in its formulation, its implementation in GRADAP and UCINET is computationally intensive. The former compulsorily finds the largest eigenvalue of the underlying sociomatrix (regardless of whether the statistic is requested in the CENTRALITY procedure [pp. 365–366]), while the latter, in providing the underlying ∗ This work was supported by grant 1 R01 DA09928-01A1 from the National Institute on Drug Abuse, Bethesda, Maryland.
276
FOSTER ET AL.
influence matrix, uses inversion. Both approaches consume inordinate computer resources as the network size N increases. In this presentation, we offer a shortcut based on wellknown methods, to substantially reduce time and space complexity in Katz status score computations. 1.1.
The Katz Index
The centrality index proposed by Katz was intended to rank actor influence as a function not only of the number of first-order “choosers” (contacts) of an actor, but also of the numbers of second-order “choosers” and so on, to the limit of each connected component in the graph. The Katz index, or status score, is given by the following formula (1), where I is the N × N identity matrix, u is a N × 1 column vector with all entries equal to 1, and s is the column vector containing the rank status scores for actors 1 to N . s = (I − bA)−1 bAu
(1)
As usual, A is the binary N × N adjacency matrix (with all diagonal entries equal to 0), and “b” is a positive scalar, Katz’s “attenuation factor”, which is assumed to lie in the interval (0, 1). 1.2.
The Problem
A traditional approach to solve formula (1) would consist of evaluating the matrix inverse. LU decomposition (Forsythe et al., 1967) and Gaussian reduction are frequently used to this end, while modest increases in efficiency over these methods are sometimes achieved by implementing matrix-inversion algorithms which take advantage of symmetry (Nash, 1990; Kuˇcera, 1990). The overall magnitude of these computing tasks, however, is roughly proportional to the cube of the number of nodes in the network, making such analyses extremely time-consuming or impossible on today’s desktop computers as network size increases beyond a few thousand nodes. Such O(n 3 ) calculations impede ease of many centrality computations, but the shortcut we present simplifies Katz calculations to time complexity O(n + m), where m is the number of network connections. 2.
Methods
Empiric and random network data-sets are used to compare the computational speed of GRADAP and UCINET versus that of our algorithm, implemented in Visual Basic 6.0 Professional Edition (Microsoft Corporation, 1995). All calculations are performed on a 200 MHz dual-Pentium personal computer running Windows NT 4.0 workstation (Microsoft Corporation, 1996) with 64 megabytes of random access memory and an 8 gigabyte hard drive. One gigabyte of hard drive space was allocated by Windows NT for virtual memory. The empiric graphs are derived from a study of heterosexuals at-risk for acquisition of HIV (Woodhouse and Rothenberg, 1994; Rothenberg and Woodhouse, 1995; Rothenberg and Potterat, 1995, 1998), and the random graphs are generated by a program written in Visual Basic. All graphs are prepared for importation into GRADAP and UCINET using PC-SAS release 6.12 (SAS Institute Inc., 1985).
A FASTER KATZ STATUS SCORE ALGORITHM
2.1.
277
Network Construction
Because sparse networks are the rule in most social networks analyses, we focus on such networks. The empiric study consists of five interview waves, demonstrating 31,147 undirected (out of over 38 million possible) connections among 8,759 persons. One subset, a single component, contains 3,718 connections among 3,018 persons (341 respondents and their 2,677 socio-sexual partners), while the last subset contains 731 connections among 341 persons (respondents only). The random graphs are created from N nodes, each with a 0.5 probability of selecting either one or two other nodes, without replacement. This approach guarantees generation of sparse networks, each having approximately 1.5 × N out of N × (N − 1) possible directed connections. (Because the empiric data which underlie the risk networks lack directional information, all graphs were symmetrized before analysis.) 2.2.
The Shortcut
One could use standard linear-equations software to compute the vector s in equation (1). However, inverting an N × N matrix, and multiplying two N × N matrices, are both computationally laborious. By taking advantage of the original formulation of the Katz score as a geometric series (Katz, 1953; Leik, 1975), using iterative methods (Rosenthal, 1987; Acton, 1990), and avoiding a wasteful matrix representation in intermediate steps, we observe how the Katz prestige index can be obtained with time complexity O(n + m). The inverse in formula (1) permits expression as the following geometric series. (I − bA)−1 = I + bA + (bA)2 + (bA)3 + · · ·
(2)
Providing that this series converges, we can then quickly compute the vector of Katz scores, s, by using a simple iteration formula, producing a sequence of estimates s0 , s1 , s2 , . . . as follows: s0 = bAu sk+1 = bAsk + s0
where k ≥ 0.
If only the actor status scores are desired (rather than the entire influence matrix), matrix multiplication can be avoided by working directly from the list of adjacencies (dyads), rather than the adjacency matrix itself. Our algorithm (see Appendix I) can either be terminated when successive iterates are close enough to being equal; or aborted after a reasonably large number of iterations. Appendix II provides a three-node example to further illustrate both the matrix-and dyad-based methods for obtaining actor Katz scores. 2.3.
Convergence of the Series
Determining exactly how large the attenuation factor b can become in (2) is laborious; the answer is b < 1/λ, where λ is the largest absolute value of any eigenvalue of A. Fortunately, A has zeros on the diagonal and only non-negative entries, so—as pointed out by Katz (1953)—Gershgorin’s Circle Theorem (1931) gives the simple sufficient condition
278
FOSTER ET AL.
b < 1/D, where D is the largest row (or column) sum of A, i.e., the largest entry of Au. We use the generally-accepted default attenuation factor 1/(D + 1) in all calculations. 2.4.
Hoede Score Computation
A related status score developed by Hoede (1978; unpublished in the formal literature, but documented and implemented in the GRADAP manual) uses the same formulation as the Katz score, except a weighted matrix W is substituted for bA. s = (I − W)−1 Wu
(3)
The weighted matrix is constructed using pre-defined “forces” p for each actor, where each force is an indicator of each actor’s power base. Since we had no a priori scheme either for determining actor forces or for constructing the weighting method, Hoede status computation remains a challenge for future investigation. It should be noted, however, that an algorithm similar to the one we present can be employed to calculate Hoede status. 3.
Results
The size of the empiric data-sets (N = 341, N = 3018, and N = 8759), as expected for O(n 3 ) calculations, exceeded our computer’s resources using either GRADAP or UCINET (excepting UCINET’s success with the smallest data-set). Using randomly-generated networks, GRADAP achieved its limit with an N of 160, while UCINET failed soon after N exceeded 2000. By contrast, implementation of our method permitted rapid calculation of Katz status for networks exceeding 100,000 nodes. Figure 1 details GRADAP’s computation times as a function of network size, while figure 2 records times for UCINET, and figure 3, for our Visual Basic program. Note that, with GRADAP and UCINET, the time curve is clearly exponential, while our shortcut solution generates an essentially linear time curve. The small-scale fluctuations in the time curve are due to differences in the number of iterations required for convergence between the test graphs; convergence for all graphs was achieved in 30 iterations or less (mean: 24.4, std: 1.7, range: 21–30). Our Visual Basic program provided Katz status indices on the
Figure 1.
GRADAP’s performance.
A FASTER KATZ STATUS SCORE ALGORITHM
Figure 2.
UCINET’s performance.
Figure 3.
Author’s method performance.
279
three empiric networks in fractions of seconds: 0.016, 0.078, and 0.500 seconds, from the smallest to the largest graph. 4.
Discussion
We present a computationally simple method for obtaining Katz status scores, thereby enabling such analysis for extremely large networks using currently available personal computers. The increase in computational efficiency offered is achieved by computing only the matrix product bAv, where v is either u or one of the sk ; i.e., v is only an N ×1 column vector. Moreover, by using a linked-list representation for A, we take full advantage of matrix sparseness. No attempt is made to compute the N ×N matrix inverse, or the N ×N matrix product that premultiplies u in formula (1). At first glance, it may seem that the authors’ method is merely a restatement of prior work in the field. For example, general procedures have been described with which eigenvector calculations can be streamlined (e.g., Richards and Seary, 2000; Acton, 1990). The methods appear similar merely because the same tools are described, i.e., power series iterative approximations. The key insight we communicate in this presentation is that only two of the eigenvector-based centrality measures, the Katz and the (unpublished) Hoede scores,
280
FOSTER ET AL.
have power series representations that obviate the need for advanced eigenvalue estimation in the first place. The authors discovered no such shortcut in any other centrality measure (e.g., Bonacich eigenvector, Hubbell, information, betweenness), due to their more complex formulation. It is interesting that Katz’s original geometric series provided the least intensive computational method, yet the mathematically elegant matrix inverse formula became the focus of previous programming efforts.1 Avoiding matrix-based calculations resulted in dramatic improvement of calculation times for Katz status scores. Although an ideal software tool for unattended large networks analysis, GRADAP suffers from a memory shortage when a bug causes unwanted computation of the largest eigenvalue of the adjacency matrix. The software package is nonetheless an impressive achievement as it is capable of analyzing networks of up to 6,000 nodes and 60,000 connections under the DOS 640K memory limitation. Similarly, UCINET V offers impressive analytic possibilities for modest-sized networks, although its chief limitation as well as its strength is its exclusive use of matrix representations in all calculations. 4.1.
Limitations
While our improved algorithm makes Katz status calculations possible for extremely large networks, it should be pointed out that the measure is of limited utility. The Katz score is a refinement of degree centrality, where distant actors are accounted for by successive iterates in the geometric series. Indeed, the Katz score is most often highly correlated with degree, thus providing a local measure of centrality (depending more on a node’s immediate environment than on its overall network position). Thus, while Katz scores do not provide the global view afforded by other shortest-path or eigenvector centrality measures, they still provide a means of distinguishing between actors with the same degree. This can be useful when incomplete network ascertainment distorts the overall picture, as happens more often than not when collecting large amounts of empiric data. In such situations, for example, one could use the Katz score to distinguish between network termini (those nodes with degree 1) based on their proximity to more highly-connected actors. Lastly, it is important to note that the rank of Katz status scores can depend on one’s choice of attenuation factor b. It has been noted that as b approaches 1/λ, the contribution of the higher-order (farther away) contacts increases. For some networks this observation can have unexpected consequences. For example, if actor i has fewer contacts than actor j, then i’s Katz score will be less than j’s if b is very small. But if i has far more second-order contacts than j, then i’s Katz score can become greater than j’s when b increases slightly. 5.
Conclusion
We frequently found ourselves at the limits of the available network analysis software when analyzing our large empiric networks. The operational benefits conferred by a morecomplete understanding of large network structure were frequently outweighed by computational time and space constraints. However, by implementing a simple shortcut, we are now able to estimate the relative importance of actors in our large networks. Efforts by programmers to optimize network analysis programs, recently highlighted by Brandes’ (2000)
A FASTER KATZ STATUS SCORE ALGORITHM
281
achievement in shortest-path algorithms, will continue to expand the analytical scope to include ever larger network data-sets. Appendix I The Katz status score algorithm implemented by the authors Given:
Network of directed relations, as adjacency matrix A, order n, with m nonzero elements. Variables: iter = iteration of calculation; done = flag for algorithm termination i = index variable, crrw = current row, sum = accumulator variable Vectors: id(n), actor numbers; dyad1(m) and dyad2(m), linked list of id(n) init(n), initial estimate of status score pre(n) and post(n), before and after intermediate calculations of the status score Convert: unique actor identifiers to sequence of numbers in id(n), from 1 to n; matrix A to two linked vectors dyad1(m) and dyad2(m) representing the choosing and chosen actors in id(n), respectively, in row-major order. Precompute: nodal outdegree → init(n), default attenuation factor b as 1/(maximum in or outdegree +1) init(n) ← b × init(n) post(n) ← init(n) Kernel: iter ← 0; done ← “false” Loop until (done or (iter > 127)) For each pre(n) do pre(n) ← post(n) sum ← 0; crrw ← 1 Loop i from 1 to m If dyad1(i) = crrw do sum ← sum + pre(dyad2(i)) else do sum ← sum × b post(crrw) ← sum reference point A crrw ← dyad1(i) (see Appendix II) sum ← pre(dyad2(i)) end end sum ← sum × b reference point B post(crrw) ← sum iter ← iter + 1 done ← “True” Loop i from 1 to n post(i) ← post(i) + init(i) done ← done and (post(i) = pre(i)) end end reference point C
282
FOSTER ET AL.
Appendix II Examples of matrix-based (slower) and dyad-based (faster) Katz score computations Matrix-Based Calculation Graph
Actors
Adjacency matrix A
Frank: 1
0
1
1
2
Jerry: 2
0
0
1
1
Jim: 3
1
0
0
1
Column sums (in degrees):
1
1
2
(I − bA):
(I − bA)−1
Invert matrix
:
(I − bA)−1 Au :
(I − bA)−1 bAu
Multiply by attenuation factor
Largest row/column sum:
2
Default attenuation factor b:
1/3
−1/3
−1/3
0
1
−1/3
−1/3
0
1
27/23
9/23
12/23
3/23
24/23
9/23
9/23
3/23
27/23
27/23
9/23
12/23
3/23
24/23
9/23
9/23
3/23
27/23
1
(rate-limiting step)
Multiply by rowsum vector
Rowsums Au (outdegrees)
:
•
2 1 1
75/23 × 1/3 = 25/23
1.087(actor 1, Frank)
39/23 × 1/3 = 13/23
0.565 (actor 2, Jerry)
48/23 × 1/3 = 16/23
0.696 (actor 3, Jim)
Dyad-Based Calculation: Actors numbered as above, dyads represented as two vectors. A frequency distribution of Dyad1 yields actor outdegrees, which are used as the initial values of array init (init0 ). The largest in or outdegree is 2, yielding 1/(2+1) as the default attenuation factor b. Vector init0 is multiplied by b, the post vector set equal to init, and the main loop (controlled by “iter” and “done”) entered. The pre vector is set equal to the post vector, and the sum and crrw variables are initialized. The variable and array values are now set to the values depicted below. m
Dyad1 (m)
Dyad2 (m)
init0
init1
pre
post
iter = 0
1
1
2
Frank(1)
2
2/3
2/3
2/3
done = false
2
1
3
Jerry(2)
1
1/3
1/3
1/3
sum = 0
3 4
2 3
3 1
Jim (3)
1
1/3
1/3
1/3
crrw = 1
actor
283
A FASTER KATZ STATUS SCORE ALGORITHM
Convergence of the “pre” and “post” vectors is illustrated by the following table, The following values are revealed at points A, B and C of the authors’ algorithm (Appendix I, kernel) Step
Iter
crrw
Sum
pre1
pre2
pre3
post1
post2
post3
A1
0
1
2/9
2/3
1/3
1/3
2/9
1/3
1/3
A2
0
2
1/9
2/3
1/3
1/3
2/9
1/9
1/3
B
1
3
2/9
2/3
1/3
1/3
2/9
1/9
2/9
C
1
3
2/9
2/3
1/3
1/3
8/9
4/9
5/9
A1
1
1
1/3
8/9
4/9
5/9
1/3
4/9
5/9
A2
1
2
5/27
8/9
4/9
5/9
1/3
5/27
5/9
B
2
3
8/27
8/9
4/9
5/9
1/3
5/27
8/27
C
2
3
8/27
8/9
4/9
5/9
1
14/27
17/27
A1
2
1
31/81
1
14/27
17/27
31/81
14/27
17/27
A2
2
2
17/81
1
14/27
17/27
31/81
17/81
17/27
B
3
3
1/3
1
14/27
17/27
31/81
17/81
1/3
C
3
3
1/3
1
14/27
17/27
85/81
44/81
2/3
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
C
8
3
0.3618351
1.086
0.564
0.695
1.086
0.565
0.695
C
9
3
0.3621061
1.086
0.565
0.695
1.087
0.565
0.695
C
10
3
0.3622246
1.087
0.565
0.695
1.087
0.565
0.696
C
11
3
0.3622773
1.087
0.565
0.696
1.087
0.565
0.696
C
12
3
0.3623005
1.087
0.565
0.696
1.087
0.565
0.696
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
C
21
3
0.3623188
1.087
0.565
0.696
1.087
0.565
0.696
The algorithm terminated (at the limit of single-digit precision) on the 21st iteration, while the correct answer to the third decimal place was achieved in 11 iterations.
Acknowledgments The authors wish to thank Prof. Dr. Frans N. Stokman, University of Groningen Department of Sociology; Prof. Dr. Cornelius Hoede, Twente University of Technology Department of Applied Mathematics; Prof. Ulrik Brandes, Universit¨at Konstanz; and William D. Richards & Andrew J. Seary, Simon Fraser University, for their generous assistance. They are responsible for none of our presentation’s shortcomings.
284
FOSTER ET AL.
Note 1. The computer program MultiNet 3.0 (Richards and Seary, The Vancouver Network Analysis Team, http://www.sfu.ca/∼richards/Multinet/Pages/multinet.htm) is currently being updated to employ the authors’ method for Katz score calculations (personal communication, July 2001).
References Acton, F.S. (1990), Numerical Methods that Work. The Mathematical Association of America, Washington D.C., Vol. 8, pp. 211–217. Bonacich, P. (1972), “Factoring and Weighting Approaches to Status Scores and Clique Identification,” Journal of Mathematical Sociology, 2, 113–120. Brandes, U. (2000), “Faster Evaluation of Shortest-Path Based Centrality Indices,” Konstanzer Schriften in Mathematik und Informatik, 120. Borgatti, S.P., M.G. Everett and L.C. Freeman (1999), Ucinet 5.0 Version 1.00 for Windows: Software for Social Network Analysis. Natick: Analytic Technologies. Forsythe, G.E., M.A. Malcolm and C.B. Moler (1967), Computer Solution of Linear Algebraic Systems. PrenticeHall, Inc., Englewood Cliffs, NJ, ch. 17. Freeman, L.C. (1978), “Centrality in Social Networks: Conceptual Clarification,” Social Networks, 1, 215–239. Friedkin, N.E. (1991), “Theoretical Foundations for Centrality Measures,” American Journal of Sociology, 96(6), 1478–1504. Gershgorin, S.A. (1931), Bull Acad Sciences de l’URRS, Classe math´em., 7-e s´erie, Leningrad, (p. 749. Cited in Kreyzig I., (1972), Advanced Engineering Mathematics, 3rd ed.), Section 18.12, Theorem 1. Wiley, New York. Hoede, C. (1978), “A New Status Score for Actors in a Social Network,” Twente University Department of Applied Mathematics (Memorandum no. 243). Inter-University Project Group of the Universities of Amsterdam, Groningen, Nijmegen, and Twente (1989), GRADAP, Graph Definition and Analysis Package. ProGamma, Groningen, The Netherlands. Katz, L. (1953), “A New Status Index Derived from Sociometric Analysis,” Psychometrika, 18, 39–43. Kuˇcera, L. (1990), Combinatorial Algorithms IOP, Bristol, UK, 91–128 (in English). Leik, R.K. and B.F. Meeker (1975), Mathematical Sociology. Prentice-Hall, Englewood Cliffs, NJ, chapter 5. Microsoft Corporation (1995), Visual Basic 6.0 Professional Edition, http://www.microsoft.com/vbasic/. Microsoft Corporation (1996), Microsoft Windows NT Workstation Version 4.0, document number 69396-0696. Nash, J.C. (1990), Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, 2nd ed. IOP, Bristol, UK, 97–101 and 119–141. Richards, W.D. and A.J. Seary (2000), “Eigen Analysis of Networks,” Journal of Social Structure, 1(2), (http://www.heinz.cmu.edu/project/INSNA/joss/ean.html). Rosenthal, N., R. Karant, M. Ethier and M. Fingrutd (1987), “Centrality Analysis for Historians,” Historical Methods, 20(2), 53–62. Rothenberg, R.B., D.E. Woodhouse, J.J. Potterat, S.Q. Muth, W.W. Darrow, A.S. Klovdahl (1995), “Social Networks in Disease Transmission: The Colorado Springs Study,” in R.H. Needle, S.G. Genser and R.T. Trotter II, (eds.) Social Networks, Drug Abuse and HIV Transmission, National Institute on Drug Abuse Research Monograph no. 151 (NIH Publication no. 95-3889), pp. 3–19. Rothenberg, R.B., J.J. Potterat, D.E. Woodhouse, W.W. Darrow, S.Q. Muth, A.S. Klovdahl (1995), “Choosing a Centrality Measure: Epidemiologic Correlates in the Colorado Springs Study of Social Networks,” Social Networks, 17, 273–297. Rothenberg, R.B., J.J. Potterat, D.E. Woodhouse, S.Q. Muth, W.W. Darrow, A.S. Klovdahl (1998), “Social Network Dynamics and HIV Transmission,” AIDS, 12, 1529–1536. SAS Institute Inc. (1985), SAS/IML User’s Guide for Personal Computers, Version 6 ed., Cary, NC. Stephenson, K. and M. Zelen (1989), “Rethinking Centrality: Methods and Examples,” Social Networks, 11, 1–37. Woodhouse, D.E., R.B. Rothenberg, J.J. Potterat, W.W. Darrow, S.Q. Muth, A.S. Klovdahl, H.P. Zimmerman, H.L. Rogers, T.S. Maldonado, J.B. Muth, J.U. Reynolds (1994), “Mapping a Social Network of Heterosexuals at High Risk of Human Immunodeficiency Virus Infection,” AIDS, 8, 1331–1336.
A FASTER KATZ STATUS SCORE ALGORITHM
285
Kurt Foster, Ph.D., is currently an independent computer consultant in Colorado Springs. He obtained his Ph.D. in Mathematics (algebraic number theory) at the University of Illinois at Urbana-Champaign in 1987. Current interests include participation in online forums devoted to solving a wide variety of mathematical problems including those in linear algebra and graph theory. Stephen Muth, BA, was formerly employed at the El Paso County Department of Health & Environment as the Information Systems Manager for the Sexually Transmitted Diseases/AIDS Programs under John Potterat. For thirteen years, he maintained data systems that monitored disease levels, performing social networks analyses integral to disease investigation. He is now an independent computer consultant. John Potterat, BA, is retired from nearly thirty years service as Director of the Sexually Transmitted Diseases/AIDS Programs at the El Paso County Department of Health & Environment in Colorado Springs. His vast contributions to the literature have advanced the science of prevention and control of sexually transmitted diseases, including HIV, internationally. He is currently President of the Colorado Springs Osteopathic Foundation. Dr. Richard Rothenberg, MD, MPH, FACP, is a Professor at the Emory University School of Medicine, and Professor of Epidemiology at the Rollins School of Public Health at Emory University. His work focuses on the dynamics of transmission of infectious diseases, primarily HIV, STDs, and the blood borne illnesses, with particular emphasis of the effects of social and sexual networks on transmission. He is currently Editor-in-Chief of the Annals of Epidemiology.