Dynamic Models of Affiliation and the Network ... - SAGE Journals

Dynamic Models of Affiliation and the Network Structure of Problem Solving in an Open Source Software Project

Organizational Research Methods 15(3) 385-412 ª The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/1094428111430541 http://orm.sagepub.com

Guido Conaldi1, Alessandro Lomi2, and Marco Tonellato2

Abstract Two-mode networks are used to describe dual patterns of association between distinct social entities through their joint involvement in categories, activities, issues, and events. In empirical organizational research, the analysis of two-mode networks is typically accomplished either by (a) decomposition of the dual structure into its two unimodal components defined in terms of indirect relations between entities of the same kind or (b) direct statistical analysis of individual two-mode dyads. Both strategies are useful, but neither is fully satisfactory. In this article, the authors introduce newly developed stochastic actor-based models for two-mode networks that may be adopted to redress the limitations of current analytical strategies. The authors specify and estimate the model in the context of data they have collected on the dual association between software developers and software problems observed during a complete release cycle of an open source software project. The authors discuss the general methodological implications of the models for organizational research based on the empirical analysis of two-mode networks. Keywords two-mode networks, stochastic actor-based models, SIENA models, dynamic models for social networks, free/open source software (F/OSS) projects The idea of duality is central to the study of organizations where the identity of units at one level is frequently defined in terms of patterns of association at a different—higher or lower—level (Breiger, 1974). The associations between individuals and tasks and between individuals and knowledge, for example, are typically considered the fundamental building blocks of organizations (Carley,

1 2

University of Greenwich, London, UK University of Lugano, Switzerland

Corresponding Author: Guido Conaldi, Centre for Business Network Analysis, University of Greenwich (UK), Park Row, London SE10 9LS Email: [email protected]

386

Organizational Research Methods 15(3)

Figure 1. Epiphany’s developer-bug associations observed during the release cycle March-September 2006 Note: Dark gray squares represent developers, light gray circles represent bugs. Network drawn using FruchtermanReingold graph layout algorithm.

1991). These associations are ‘‘dual’’ in the sense that they connect social entities standing in a mutual constitutive relation (Breiger & Mohr, 2004). In empirical research, the mutual constitution of social entities across levels of organizational analysis frequently takes the form of a two-mode network (2MN). A 2MN records the association between two sets of objects where each set contains a different kind of social entity and no direct association is admitted between objects belonging to the same set (Wang, Sharpe, Robins, & Pattison, 2009). Common examples of social entities linked by 2MN include individuals and issues (Breiger, 2000), team members and knowledge (Carley, 1991), directors and corporate boards (Robins & Alexander, 2004), organizations and items in their political agendas (Mische & Pattison, 2000), organizational members and issues (Doreian, Batagelj, & Ferligoj, 2004), and individuals and groups (Breiger, 1974). In the empirical part of this article, we analyze the 2MN linking problem solvers (represented as dark gray squares in Figure 1) and problems (represented as light gray circles in Figure 1) during one release cycle of a popular open source software product. In particular, we ask: How did the observed network configuration come about? How does it change over time? The objective of this article is to illustrate how some of the main methodological challenges posed by these questions may be addressed by a new class of stochastic dynamic actor-based models originally developed for the analysis of social networks (Snijders, van de Bunt, & Steglich, 2010). Organizational research based on 2MN have followed two main analytical strategies. The first focuses on the single one-mode networks (so-called unimodal projections) logically entailed in the corresponding 2MN (Newman, Strogatz, & Watts, 2001). In a unimodal network derived from a 2MN, two entities of the same class are connected if they share a common affiliation with at least one entity of the other class. For example, in a group-by-group unimodal network derived from a two-mode person-by-group network, groups are connected if they share at least one member. Dually, in the unimodal person-by-person network, individuals are connected if they share membership in at least one group (Breiger, 1974). This approach has the advantage that the unimodal networks may be

Conaldi et al.

387

analyzed as (integer-valued) social networks. The main disadvantage is loss of information that this transformation necessarily implies about the dual association between the different entities in the original 2MN structure (Borgatti & Everett, 1997; Newman, 2001; Wasserman & Faust, 1994). When affiliation data are cross-sectional, lattices and other algebraic models offer particularly useful solutions to the problem of representing different social entities jointly in a common space (Pattison & Breiger, 2002). The second strategy is based on direct statistical analysis of (two-mode) dyads, which represent the smallest components of any kind of network structure. For example, the relation between individual directors and board of directors may be examined by considering director–board dyads as units of analysis (Robins & Alexander, 2004). The advantage of this approach is that it is very general and allows direct examination of the mutual constitution of ‘‘cases’’ and ‘‘variables’’ across a variety of empirical settings (Breiger, 2009). The main disadvantage is that available statistical models typically used for analyzing dyadic data derived from 2MN are incapable of accounting for the complex system of local dependencies out of which networks—social or otherwise—emerge and evolve (Koskinen & Edling, in press). Recently developed exponential random graph models (ERGM) for bipartite graphs offer promising solutions to these problems in cross-sectional samples (Wang et al., 2009). The main objective of this article is to illustrate one way in which these limitations may be alleviated in the analysis of longitudinal network data. The methodological proposal that we want to develop is based on the specification of recently derived dynamic stochastic actorbased models for 2MN (Koskinen & Edling, in press; Snijders, Lomi, & Torlo`, 2011). The model is dynamic because it is based on specific assumptions about how networks change. It is stochastic because connections between elements of the row set (‘‘individuals’’ in the application we present) and elements of the column set (‘‘problems’’ in the application we present) define states that are subject to change at random times. Finally, the model is actor-based because it assumes that observed network configurations are the consequence of decisions made by actors that are interdependent but capable of individual action (Macy & Willer, 2002). The aim of the model is to represent the evolution through time of the random set of edges linking individuals and problems through a continuous-time Markov chain that allows parametrization of transition probabilities under the assumption of exponentially distributed waiting times (Snijders, van de Bunt, et al., 2010). In the specific case of 2MN, this setup defines a dynamic model of affiliation. Because of the novelty of the model, empirical experience with stochastic actor-based models for 2MN is rather limited, and meaningful organizational applications are yet to be developed. We demonstrate the empirical value of our proposal in the context of data that we have collected on the network structure of organizational problem solving with a free/open source software (F/OSS) project followed throughout an entire release cycle. We use computational techniques to assess the empirical validity of our models and their ability to reproduce salient structural features of 2MN. The article is organized as follows. In the next section we outline the motivation for developing explicit stochastic models for 2MN. We sketch the main limitations of existing approaches that the model we propose may help to overcome. In the third section we introduce stochastic actor-based models, define their main analytical components, and state their main underlying assumptions in the context of 2MN. In the fourth section we describe the empirical setting that we have selected to provide a meaningful illustration. In the fifth section we describe the data and link them to the specific models that we estimate and evaluate in the subsequent section. We conclude the article with a short discussion on the general usefulness, applicability, and limitations of the modeling process that we have presented in organizational research.

388


Background and Motivation Two-mode networks are used to represent the association between two (or more) different classes of entities such that relations are possible only between entities that are of different classes, but not between entities within the same class (Arabie, Carroll, & DeSarbo, 1987; Wasserman & Faust, 1994). A 2MN records information about measurement on a binary affiliation variable (Wasserman & Faust, 1994), and hence it may be interpreted as a special case of a more general qualitative ‘‘caseby-variable’’ design (Breiger, 2009). Organizational research frequently examines how social units defined at one level (e.g., ‘‘individuals’’ or ‘‘organizations’’) are linked to (or ‘‘construct’’) other social units defined at another level (e.g., ‘‘groups’’ or ‘‘markets’’) (Breiger & Mohr, 2004). According to Lomi and Larsen (2001, p. 6): ‘‘What makes organizations meaningful and relevant units of analysis’’ is the ‘‘inter-dependence of actors and action across levels.’’ The generality of research interests that depend on the specification of models for 2MN is considerable (Faust, 2005). Studies of organizational populations, fields, and communities frequently represent the relation between organizations and their environments in terms of a two-mode association between organizations and resource positions, or ‘‘niches’’ (Burt & Talmud, 2003; Podolny, Stuart, & Hannan, 1996). Studies of intercorporate relations are based on the dual association between board members and corporate boards (Robins & Alexander, 2004). Finally, studies of organizational identities from an institutional perspective are frequently interested in the dual association between organizational practices and identity categories used to classify potential beneficiaries of organizational services (Breiger, 2000; Mohr, 1998; Mohr & Duquenne, 1997). Two dominant approaches may be identified to the empirical analysis of 2MN. They are both useful but less than fully satisfactory. The first is based on the identification of the two unimodal components (so-called one-mode projections) embedded in 2MN and subsequent analysis of the derived one-mode networks (Breiger, 1974). For example, the two-mode association between companies issuing securities and the investment banks managing the issues (Podolny, 1994) may be decomposed into two one-mode associations. The first is between companies affiliated indirectly through the banks managing their issues. The second association is between banks indirectly affiliated through companies relying on their services. Breschi and Lissoni (2005) adopt the same approach in a different context to derive relations between inventors from a large 2MN recording the affiliation of inventors to patent applications submitted to the European Patent Office. Singh (2005) follows a similar strategy to derive the ‘‘social distance’’ between patent teams in terms of the numbers of inventors they share. The basic problem with this approach is that the analysis of either of the two modes cannot make efficient use of information available on the other (Zhou, Ren, Medo, & Zhang, 2007). One alternative is to use valued (instead of binary) unimodal projection matrices (Wasserman & Faust, 1994). In this case, information loss would be limited to paths of length three or more and to actors with single affiliation (Wang et al., 2009). Perhaps more important from a theoretical point of view is that analysis of the dyadic relations reconstructed indirectly on the basis of joint affiliation limits the possibility of a more complete theoretical understanding of the role played by social settings—or the ‘‘loci in which ties originate’’—on organizational outcomes (Sorenson & Stuart, 2008, p. 267). For these various reasons it is clearly preferable to build models that can make full use of information contained in the original network. Galois lattices represent one such class of models that may be usefully applied when information on the dual association is cross-sectional (Freeman & White, 1993; Pattison & Breiger, 2002). Correspondence analysis provides useful descriptive representations of 2MN (Faust, 2005). The second main approach to the analysis of 2MN is based on direct statistical analysis of twomode dyads. This strategy makes full use of the information available on the dual association

Conaldi et al.

389

between different kinds of entities or on the association between actors and events. For example, in a study of 92 Internet security ventures receiving investments from venture capitalists between 2000 and 2005, Hallen (2008) used (rare events) logistic regression to estimate the likelihood that a given potential investor forms an investment tie with a given venture. A similar strategy is also used by Singh (2005) who relies on patent citation data to estimate the probability of knowledge flow between inventors of any two patents. Although in these specific examples such strategy may be justified by the extreme sparseness of the original 2MN, the general problem with this approach is that it assumes the dyads to be independent. Any dependence that there may be in the data is reduced further by choice-based schemes used to randomly sample the null dyads (Manski & McFadden, 1981). In their analysis of the formation of syndicate relations between U.S. venture capital (VC) firms, Sorenson and Stuart (2008) identify the problem with characteristic clarity: We have modeled the choice of a given syndicate partner as being conditional independent of the other partners that have already joined the syndicate. In reality, however, the decision to join a syndicate may depend on the identities of other VC firms simultaneously joining in the same financing round or even on the (unobserved) firms who were invited to join the syndicate but declined the invitation [emphasis added]. Unfortunately we do not know of a feasible methodological approach for addressing these interdependencies. (p. 291) Recently derived ERGM for 2MN can be brought to bear directly on this problem when the dual association of interest is examined in cross-sectional samples (Wang et al., 2009). New specifications of ERGM for 2MN afford empirical testing of a variety of dependence assumptions that may be formulated to address the problem identified by Sorenson and Stuart (2008) in the context of interorganizational networks. In longitudinal samples, however, the problem of network dependencies has no obvious solution. In the next section we outline a newly derived class of stochastic actor-based models for the analysis of 2MN that addresses these problems directly. We then illustrate how these models can be specified, estimated, and tested using actual data that we have collected on the dual association between software developers and software problems encountered during a release cycle of a F/OSS project.

Model Development Suppose that Y(t) is a rectangular binary array of size n m defining a 2MN in which the elements record the existence of a relation (R) between an element of the first set (the ‘‘row set’’) and an element of the second set (the ‘‘column set’’). Suppose, furthermore, that repeated observations on Y(t), on y(t1), y(t2)..., y(tm)... y(tT), are available over time at time points t1, t2,..., tm... tT (with T 2). For example, in the illustration that we discuss in the following, the row set includes individuals (i), and the column set includes problems (j). The existence of a relation is recorded whenever an individual i decides to engage (or ‘‘work on’’) a problem j. We write (iRj)tm to summarize the statement: ‘‘Individual i engages problem j at time tm.’’ Given this general setup adapted from Snijders, Koskinen, and Schweinberger (2010), stochastic actor-based models assume that the network structure observed at any one moment develops as a result of interdependent individual decisions. Actors are only allowed to change the ties under their direct control (their row), but no single actor has control over the entire network structure. Actors can only control their own ‘‘row;’’ namely, they may change only their own association with the various contexts represented by the columns. Statistically, this assumption leads to a representation of the network structure observed at any moment as a realization of a

390


continuous-time stochastic process. In social networks research, Holland and Leinhardt (1977) and Wasserman (1980) pioneered this approach, suggesting that a continuous-time Markov chain defined over the space of all possible networks with a given node set could be used for modeling the dynamics of social networks observed at discrete time points. Stochastic actorbased models for network evolution are defined in terms of two components: the opportunity for change and the decision to change. We discuss them in turn, building on the complete treatments of the model contained in Snijders et al. (2011); Snijders, Koskinen, et al. (2010); and Snijders, van de Bunt, et al. (2010). The starting point of actor-based models for 2MN is represented by the assumption that at random moments each row element (a ‘‘software developer’’ in the application that we present in the following) has the opportunity to change its association with the column elements (a ‘‘software problem’’ in our example). Because a Markov process is assumed, waiting times between opportunities for change are exponentially distributed. A rate function li ða; yÞcontrols how quickly opportunities for change arise. At any time point t; YðtÞ ¼ y and the opportunity for change is defined by the rate function: X lða; yÞ ¼ li ða; yÞ; ð1Þ i

where a is a parameter that may be estimated directly from data. The rate function in Equation 1 may be constant between observation moments, or it may change depending on covariates or positional characteristics of the actor (rik ð yÞ). Parameterization of the rate function may be accomplished via exponential link so that: ! X ak rik ð yÞ : ð2Þ lða; yÞ ¼ exp k

In the illustration we present next, we make the opportunity for change that actors face dependent on their total activity as represented by the marginal value of their row (or ‘‘outdegree’’). In this way we admit the possibility that more active actors enjoy greater opportunities to change their portfolio of affiliations to problems over time. Obviously, theoretical or empirical considerations may suggest the inclusion of other effects in the rate function. The second component of an actor-based model for 2MN is the individual decision of actors (rows) to change to issues (columns). This decision is controlled by an evalua their affiliation tion function fi b; yð0Þ ; yð1Þ that is used to represent the relative attractiveness of moving from yð0Þ to yð1Þ , defined as a successive network configuration differing in terms of only one tie. Actors also have the option not to change anything. In this case, network configurations in successive time periods will be identical. The conditional probability that the network observed in the next time period is yð1Þ given that it is yð0Þ in the current period and actor i gets an opportunity to make a change is: X 0 ð0Þ ~ exp f b; y ; y ; if y 2 A y ; pi b; yð0Þ ; yð1Þ ¼ exp fi b; yð0Þ ; yð1Þ = ð3Þ i i y~ and pi b; yð0Þ ; yð1Þ ¼ 0 if y 像 Ai yð0Þ , where Ai ðy0 Þ is the permitted set of values that the network _

may assume. Empirical estimates of parameters (b) in Equation 3 are consistent with revealed preferences interpretations as described, for example, in Maddala (1983): If actors change their affiliation profile producing a change in the network from configuration yð0Þ to yð1Þ , then they act as if they prefer yð1Þ to yð0Þ . In this framework, the estimates provide information about the relative attractiveness for actor i of moving from one network configuration (yð0Þ ) to another (yð1Þ ) by changing one of the network ties under his or her direct control.

Conaldi et al.

391

In empirical work, the evaluation function assumes the typical linear form: L X fi b; yð0Þ ; yð1Þ ¼ bk sik yð0Þ ; yð1Þ ;

ð4Þ

k¼1

where the term sik may be specified to represent aspects of network structure that are contextually salient or that are considered important on theoretical grounds. The sik may be (a) specified in terms of purely structural—or ‘‘endogenous’’ network effects, (b) dependent on actor-specific covariates representing ‘‘exogenous’’ characteristics of the individual, and (c) defined by interactions between structural network effects and actor-specific covariates. Within the set of structural network effects it is possible to distinguish between degree-based and closure-based effects. The most general degreebased effects include the (a) outdegree effect to model the baseline tendency of the column element to associate with the row elements—and therefore the network ‘‘density,’’ (b) outdegree activity effect to model the variance in activity of the row elements (or the ‘‘spread’’ of the outdegree distribution), (c) indegree popularity effect to model the variance in popularity of the column elements (or the ‘‘spread’’ in the indegree distribution), and (d) degree assortativity effects to model the tendency of particularly active row elements (high outdegree) to affiliate with particularly popular column elements (high indegree). The fundamental closure-based effect in 2MN is the so-called 4-cycle (or ‘‘social circuit’’) effect (Robins & Alexander, 2004; Wang et al., 2009), which captures the tendency for actors sharing participation in an event to share participation in additional events in the future. All these structural network effects will play a role in our empirical model specification. Additional discussion of structural effects in models for 2MN may be found in Koskinen and Edling (in press), who also introduce bipartite two-stars and bipartite two-paths effects. More complex effects are derived in Snijders et al. (2011). Actor-specific covariates may enter the model specification in a variety of ways. Perhaps the simplest is as a ‘‘sender effect’’—to capture the tendency of a specific attribute to cause an increase (or as the case may be a decrease) in the outdegrees.1 Finally, actor-specific covariates may be interacted with structural network effects to capture the tendency of nodes possessing specific attributes to become part of specific local network structures. Snijders, van de Bunt, et al. (2010) offer a detailed discussion of the various possibilities in the more general context of stochastic actorbased models for social networks. In the analysis of 2MN, the situation is slightly more complicated by the fact that there are two sets of elements and therefore two distinct sets of nodal properties to take into account. For this reason, the kind and number of potential interaction effects that may be included in the evaluation function Equation 4 is very large and cannot be tabulated exhaustively but only generically, as we shall do later in the article. Once the rate (Equation 2) and the evaluation (Equation 4) functions have been specified, it is possible to write down the intensity matrix of the Markov chain (Q) implied by the model as: Q yð0Þ ; yð1Þ ¼ li a; yð0Þ pi byð0Þ ; yð1Þ ; ð5Þ if y 2 Ai ðy0 Þ and 0 otherwise.

Setting In the section that follows, the stochastic actor-based model for 2MN that we have outlined is specified and estimated using data that we have collected on the dual association of software developers (problem solvers) and software bugs (problems) within a large F/OSS development project. F/OSS projects are virtual communities of software developers who participate in the production of new pieces of software that are freely redistributable and modifiable (O’Mahony, 2003). Successful F/

392


OSS projects like GNU/Linux-based operating systems, Apache web server, or Mozilla Firefox browser are frequently presented as ‘‘successful counterexample(s) to conventional wisdom regarding the organization of innovation’’ (von Krogh & von Hippel, 2006, p. 976). In F/OSS projects, centralized control is absent, formal coordination is weak, and participants have direct access to problems (O’Mahony & Ferraro, 2007; Raymond, 1998; Scacchi, 2002). These contextual characteristics make F/OSS projects ideal organizational settings for studying the dynamics of change in the dual association between organizational problem solvers and organizational problems. The decentralized nature of problem-solving and coordination activities makes F/OSS projects a particularly useful context for specifying dynamic models of dual association: In F/OSS projects the identity of individual developers is defined in terms of the problems they engage and possibly solve. The identity of problems is defined, dually, in terms of the developers simultaneously working on their resolution. F/OSS projects are frequently considered unlikely organizations whose survival depends delicately on the commitment of voluntary participants and their willingness to engage in decentralized problem-solving activities (Crowston & Scozzi, 2008; Mockus, Fielding, & Herbsleb, 2002). Relatively little is known, however, on the processes of self-selection that regulate all the basic components of F/OSS organizational structure (Qureshi & Fang, 2011). For example, von Krogh, Spaeth, and Lakhani (2003) show that software developers self-assign to problems according to their level of ability and knowledge of the project’s codebase. Examples of problems include the implementation of new software functionalities and defects or misbehaviors that emerge while these new functionalities are implemented (i.e., software ‘‘bugs’’ in the jargon of software developers). In F/OSS projects, bugs are reported, discussed, and confronted publicly. Corrective actions taken by any developer2 are immediately visible to all others. Actual and perspective project participants are actively encouraged to familiarize themselves with the history of contributions in order to find their way into the development process and select with care the problems that will be receiving their efforts. The specific F/OSS project we analyze is Epiphany, the default web browser of the GNOME graphical desktop environment. Started in 1996 and having seen the contribution of more than 900 developers, GNOME is a now mature F/OSS meta-project whose aim is to produce an easy-touse desktop as well as a framework for building desktop applications (Lungu, Malnati, & Lanza, 2009). A foundation decides the future of GNOME through a board of directors democratically elected among the developers with limits to the representation of for-profit companies. The foundation promotes decentralized development (German, 2004; Noll, 2007). Decentralization in GNOME is reflected by its modular structure. Each module, and Epiphany among them, has its own maintainers, developers, and objectives. Epiphany has a dedicated section in GNOME’s online bug repository that has been used by over 1,000 individuals to track and engage bugs affecting the software and its documentation during development, installation, or usage. Engaging in bug-fixing activities is explicitly encouraged in GNOME official wiki as the first avenue into the project. In a section specifically designed to guide prospective contributors, aptly named ‘‘GNOME Love,’’ newcomers are first told that tasks are not usually assigned by existing members because usually, people will not tell you what tasks you need to do to contribute to GNOME. Instead you have to find things to work on yourself. Learning how to find things to work on is a big part of learning how to be a contributor. (GNOME, n.d.) Secondly, newcomers are encouraged to familiarize themselves with the bug reports associated with the module to which they would like to contribute. Reporting new bugs, improving existing bug reports, and submitting software code to fix a bug (so-called patches) are all mentioned as potential

Conaldi et al.

393

tasks, and when assigning themselves to one of them, newcomers are also explicitly encouraged to relay and collaborate with existing developers. When reporting bugs, individuals describe the problem they encountered by providing information on the circumstances in which the potential bug manifested itself and the version of Epiphany they are using. All newly opened bug reports are visible to the entire community, which is then able to triage and audit them. Following an initial description, a severity and a priority level are assigned to each bug. Developers are notified of newly opened bugs and may self-assign to bugs, however all subscribers to the bug repository may add themselves or others deemed knowledgeable in carbon copy (Cc) of bug reports in order to track all subsequent actions taken by others. In particular, developers may attach patches and change the status or other attributes of a bug. For example, a bug may be declared obsolete or solved, although possibly being reopened at a later stage if the solution is found not to be viable.

Empirical Illustration Orienting Questions ‘‘The apparently anarchistic process of open source production, in which no one tells anybody else what to do’’ noted by Lerner and Tirole (2001, p. 821) is precisely what makes organizational analysis of F/OSS projects challenging. How are programming resources allocated within F/OSS projects? How are priorities assigned? What patterns of specialization and division of labor emerge to sustain production over time? Indeed, is specialization an organizational characteristic of F/OSS projects? If this is the case, what are the mechanisms underlying integration and coordination? Much of the earlier literature on F/OSS projects tried to address these questions by focusing on motivational aspects of participation to F/OSS communities. Our strategy is to address these questions by examining the concrete properties of the problem-solving structure emerging from the observed association between problem solvers (developers) and problems (software bugs).

Data We collected data on the complete set of problem-solving activities recorded in Epiphany’s bug repository during one release cycle (from March 2006 to September 2006). Throughout the cycle, 135 developers were active in bug fixing and 719 bugs were engaged. These characteristics of bug fixing in Epiphany appear coherent with the mature and successful status reached by GNOME since 2003 (Krishnamurthy, 2002; Lungu et al., 2009). All bugs engaged by developers during the release cycle are included and all actions taken while working on them are considered. The raw data were collected by parsing the web pages of all relevant bug reports with the software Bicho (Robles, Gonza´lez-Barahona, Izquierdo-Cortazar, & Herraiz, 2009). We reconstructed the evolving two-mode association between developers and bugs throughout a release cycle by coding every action undertaken by developer i on bug j as a tie linking i to j. Actions were aggregated in four panels (t14 ), each subsuming approximately a development period of one and a half months.3 Two (or more) developers collaborate if they act on the same bug at the same time. For example, if developer i engages bug j at t1 but not at t2 , the tie linking i to j will be present in the network at t1 but will be absent from the network at t2 , thus witnessing a shift in the in focus of i’s engagement. As a partial exception, a tie linking i to j as the result of i putting himself or herself in Cc of j is carried over to subsequent time points unless i takes action to remove himself or herself from Cc. This is because Cc’ing oneself, as described previously, testifies a continuous engagement on j that i expresses by keeping up to date with the work done by other developers on a specific bug. The resulting networks are formed by 135 developers and 719 bugs, namely, by the total number of developers and bugs active throughout the release cycle. However, some developers and bugs were

394


Developers

●

Weighted Dichotomized

●

300

60

200

40

Frequency

Weighted Dichotomized

●

400

●

500

80

Bugs

100

20

●

● ●

● ●

0

● ●

5

●

●

● ●

10

●

●

●

●

●

15

●

●

●

●

●

●

●

20 Degree

●

●

●

●

25

●

●

●

●

●

30

●

●

●

●

●

35

●

●

●

●

40

0

0

● ●

●

●

0

5

●

●

●

●

●

10

●

15

Degree

Figure 2. Weighted and dichotomized degree distributions of developers (degrees 40 represented) and bugs aggregated over the entire release cycle

found to be completely inactive at specific time points and in our analyses were considered to have temporarily left the network. Given that multiple actions were occasionally observed from the same developers on the same bugs at the same time point, the networks were originally weighted (i.e., a value was assigned to the tie linking i to j according to the number of actions taken by i on j). Since we are interested in understanding the process of matching between developers and bugs, not the intensity of developers’ engagement in specific problem-solving activities, we decided to recode the networks such that all tie values greater than zero were set to one. We believe that this dichotomization criterion allows us to represent the matching process with a limited loss of information.4 As shown in Figure 2 for the entire release cycle, the dichotomized degree distributions maintain a level of skweness not qualitatively dissimilar to that which characterize the original weighted degree distributions. Indeed, these highly skewed distributions are consistent with the findings of the existing literature on participation in F/OSS projects (Bird, Gourley, Devanbu, Gertz, & Swaminathan, 2006; Crowston & Howison, 2006). In Figure 3 the dichotomized networks for the four time points used in the analysis are shown. We also measured two attributes of developers and bugs, respectively (see Table 1). Both attributes were coded as dummy variables. Developer tenure in project records whether or not a developer had been active in a previous release cycle. Bug severity classifies bugs according to the severity level assigned to them before the beginning of the release cycle. In GNOME, bugs are classified in order of increasing severity as: enhancements, trivial, minor, normal, major, critical, blocker. Bugs classified as enhancements are interpreted as requests for new features more than as software failures. Most bugs are classified as either ‘‘normal’’ or ‘‘critical,’’ the two categories accounting for 30% and 32% of all recorded bugs, respectively. We define a bug as severe if it is classified above the ‘‘normal’’ level.5 Empirical model specification. In our empirical model, the rate function (Equation 2) indicates the expected number of opportunities that software developers have to change their affiliation with software bugs. The evaluation function controls individual preferences, namely, the subjective utility that developers experience when changing their affiliation profile.

Conaldi et al.

395

(a) Time Point 1

(b) Time Point 2

●

Devs. (54) Bugs (186)

(c) Time Point 3

●

Devs. (69) Bugs (237)

●

Devs. (107) Bugs (568)

(d) Time Point 4

●

Devs. (95) Bugs (316)

Figure 3. Epiphany’s developer-bug associations observed during the release cycle March-September 2006 aggregated at four time points (a-d). Dark gray squares represent developers, light gray circles represent bugs. Inactive developers and bugs are not depicted. Network drawn using Fruchterman-Reingold graph layout algorithm

Table 1. Summary Table for Actor-Specific Attributes (Developers ¼ 135, Bugs ¼ 719) Attribute Developer tenure in project Bug severity

Type

Motivation

Constant Controls for the effect of learning Constant Controls for the effect of centralizing tendencies

Operationalization

Proportion

Expected Effect

One if a developer was active in the prior release cycle, zero otherwise One if a bug had a severity level above ‘‘normal’’ in the prior release cycle, zero otherwise

0.430

Positive

0.349

Positive

396


We specify four parameters in the rate function Equation 2: Rate of network change 1, 2, and 3 are the baseline rate parameters that model the average number of opportunities for change in each of the time periods. A fourth parameter, Outdegree effect on rate, captures the tendency for developers to experience progressively more opportunities for changing their engagements as their current level of engagement increases. If positive, the corresponding estimate implies a reinforcing feedback dynamic concerning the rate of change in problem-solving activities. In the evaluation function (Equation 4), we specify exogenous actor-specific covariates and endogenous network effects. We include exogenous actor-specific covariates to control for factors that may affect the propensity of individual developers to become involved in problem-solving activities. Among exogenous attributes we specify the main effects of Developer tenure in project and Bug severity (see Table 1). The Developer tenure in project effect is a constant actor-specific covariate included to control for the well-documented effect of experiential learning (Argote & Epple, 1990). Following prior studies we expect the engagement of developers to increase with experience (Robles & Gonza´lez-Barahona, 2006). The Bug severity effect is a constant actor-specific covariate that controls for the extra-attractiveness of severe bugs. If positive, the corresponding effect may be taken as evidence of ‘‘centralizing’’ tendencies in the problem-solving process. Positive (negative) parameters associated to the two actor-specific covariates would imply, respectively, that experienced developers tend to engage more (less) bugs and severe bugs tend to attract more (less) attention. Endogenous network effects are included to explore how specific patterns of network selforganization affect problem-solving activities. Such effects are based on network statistics computed by counting the number of specific local network configurations surrounding each developer in the observed networks (see Table 2). The decision to engage a bug may not depend uniquely on the individual motivation of a developer or the intrinsic relevance of a bug. In a context in which actions taken by others are visible and all actions are voluntary, individual decisions to engage in problem-solving activities may also be influenced by the decisions already made by other developers in the project. Endogenous network effects therefore play a particularly important role in our modeling as empirical estimates are directly linked to the orienting questions of the analysis. We include the Outdegree (density) effect as a baseline parameter that indicates the overall propensity of developers to engage bugs throughout the release cycle. The Developer activity effect is included to investigate the existence of a feedback dynamic whereby engaging many bugs progressively increases (or decreases in the case of a negative parameter) the likelihood of engaging further bugs. Similarly, the Bug popularity effect tests whether bugs already engaged by many developers are progressively more (less) likely to attract further developers. A statistically significant parameter associated with the Bug popularity effect would indicate that the decisions to engage a bug made by others act as a structural mechanism of problem prioritization by progressively increasing the likelihood of further engagements. A 4-cycle (transitive closure) effect is included to examine whether engaging the same bugs increases the likelihood of developers engaging more bugs together in the future. A positive parameter associated with the 4-cycle (transitive closure) effect would indicate that the local structure of collaboration influences future decisions by producing clusters of developers that collaborate over time on the resolution of the same bugs. Finally, we include a Developerbug assortativity effect to investigate whether more active developers are more likely to engage more popular bugs. A positive parameter associated with this effect would indicate that prior voluntary associations between developers and bugs create endogenous hierarchies of actors. Model estimation and evaluation. Because we are interested in computing the transition probabilities between neighboring states, the state space of the model consists of all possible trajectories that link networks observed in adjacent time periods. Under such conditions, conventional statistical estimation is unfeasible and parameters estimates may be obtained via an iterative Robbins-Monro (Robbins & Monro, 1951) stochastic approximation algorithm using the method of moments (MoM)

Conaldi et al.

397

Table 2. Summary of Endogenous Local Network Effects and Exogenous Actor-Specific Covariates

Parameteer

Outdegreee (density)

4-cycles (transitivve closure)

Developeer activity

Bug popularitty

Includeed Configguration ( ) to Con ntrol for Overall tendency to engaage a bug Tenden ncy for pairrs of developers to engage the same bugs Tenden ncy for actiive developers to engage extra bugs Tenden ncy for populaar bugs to attract extra developers

Configguration ( )

Definitiion

Developeer- Tenden ncy bug for actiive assortativvity developers to engagee popular bugs Developeer Tendency tenure in for project developers active in the prio or releasee cycle to engagee bugs Bug Tendency severity for sev vere bugs to o attract developers

Note: Dark gray squares represent developers, light gray circles represent bugs. Dashed white squares (circles) represent developers (bugs) with a specific dyadic attribute. Solid lines represent existing ties. Dotted lines represent ties not yet existing.

398


estimation procedure (Bowman & Shenton, 1985). The MoM identifies parameters as solutions to a set of estimating equations in which simulated network statistics are equated to observed network statistics. The algorithm computes multiple times the evolution process of the network via Markov Chain Monte Carlo (MCMC) simulations implied by the current model specification (Snijders, 2002). Parameters values are updated after each simulation, reflecting the deviations between generated and observed statistics corresponding to the effects included in the model (Schweinberger & Snijders, 2007). For example, the number of pairs of developers engaging the same two bugs is the statistic calculated in both the observed and simulated networks at each time point for the corresponding 4-cycle (transitive closure) effect. The algorithm searches for parameter values where the deviations between observed and simulated statistics average out to zero. The estimation procedure is successful if the deviation between the simulated and observed statistics becomes sufficiently close to zero. To this purpose, for all estimated parameters, averages and standard errors of these deviations are computed across all the simulated networks and arranged in convergence t-ratios. Values smaller than 0.1 in absolute values are required for a model to be considered fully convergent (Snijders, van de Bunt, & Steglich, 2010). If and when parameter values stabilize (i.e., a model reaches convergence), they are held constant and their respective standard errors computed by estimating the associated covariance matrix and matrix of derivatives. As suggested by Ripley, Snijders, and Preciado Lopez (2011), we repeated such estimation 3,000 times to obtain reliable estimates of the standard errors. Because of the novelty of the model, reliable rules of thumb about data requirements are not yet available. Experience, however, supports cautious optimism about the applicability of the model to samples of variable size. In studies based on fairly large samples, convergence is typically unproblematic, but computation time increases considerably as sample size increases. In the analysis of one-mode (a.k.a., ‘‘social’’) networks, stochastic actor-based models have been estimated successfully on networks of 10,000 nodes. In case of very small samples (e.g., N ¼ 20, T ¼ 2), maximum likelihood estimation (rather than the MoM) tends to be more efficient. Experience shows that obtaining reliable estimates is possible even with only two observation points. As it is frequently the case in empirical research, the problem is not the number of time periods or the number of cases, but rather the amount information contained in the data, which depends on the quality of the research design. Because stochastic actor-based models are oriented toward change, how much change is observable in a network over time is likely to play a crucial role in determining the empirical success of a model. For one-mode networks, for example, Snijders, van de Bunt, et al. (2010) suggest that a Jaccard coefficient computed for successive network panels that is below 0.2 may tend to make reliable estimates difficult to obtain and interpret. This is the case because excessive change between observation points weakens network dependencies and may be taken as evidence of structural change. On the other hand, if the Jaccard coefficient computed for successive network panels is above 0.9, change in the network may be insufficient to sustain reliable estimates. Before discussing the parameter estimates, it is important to assess the overall goodness of fit of convergent models; namely, it is necessary to assess whether, given the set of effects included in a model, the MoM algorithm simulated an evolution process capable of generating networks similar to those observed at each of the time points included in the analysis. Goodness of fit for stochastic network models is generally based on a comparison of model-based simulations of network data sets with the observed network data on which the model was estimated. The general statistical argument underlying this approach is developed by Hunter, Goodreau, and Handcock (2008) in the context of ERGM. In the empirical part of the study, we follow this analytical strategy to compare the degree distributions of the observed 2MN networks to those generated in the simulations (3,000 in our case) that make use of the final estimated parameters in order to estimate their standard errors. The observed network should fall within one standard deviation from the mean calculated across the

Conaldi et al.

399

simulated ones for all degree values for the model fit to be considered satisfactory. The rationale for focusing on the degree distribution to assess model fit stems from the fact that degree distributions are one of the main global properties used to characterize social networks, and 2MN among them. Degree distributions, however, are not directly parameterized in the models that we specify. Therefore a good fit would imply that simulated networks—based on empirical estimates of the model— reproduce global characteristics of the observed networks only by using the estimated parameters for local network structures and exogenous actor-specific covariates that the model includes. Some may find it useful to have an aggregate goodness-of-fit measure of the full model that is being estimated. Once a useful null model has been identified, informative summary measures of goodness of fit that have been proposed for model estimated using the MoM include the generalized Neyman-Rao score-type test and the well-known Wald-type test. The null distribution for both test statistics approximate to a w2 distribution with degrees of freedom equal to the differences in parameters in the models. Neyman-Rao score-type tests are described in Snijders, Steglich, and Schweinberger (2007). Wald-type tests are widely adopted in applied econometrics (Greene, 2003). Both tests may be implemented directly in RSiena (Ripley et al., 2011), the specialized package for the statistical software R (R Development Core Team, 2011) we used to produce the empirical results discussed next. Appendix A contains an annotated R script that can be used to reproduce the results that we report in the next section.

Analysis We report parameter estimates and standard errors for rate effects, endogenous network effects, and actor-specific covariates of the three nested models we estimated in Table 3. Model 1, our null model, only accounts for the overall propensity to engage a bug. Model 2, our restricted model, only accounts for the effects of exogenous actor-specific covariates. Model 3, our full model, also accounts for the effects of endogenous network structures that represent the most distinctive feature of actor-based models for 2MN. Convergence t-ratios for all reported estimates are smaller than 0.1 in absolute values.6 The values of the Jaccard coefficients for tie changes between our subsequent observations are 0.290, 0.488, and 0.394. We start by presenting results of simulation-based goodness-of-fit diagnostics to assess the ability of our models to reproduce salient characteristics of the data. In Figure 4, box-plots illustrate how the networks simulated during model estimation cover the degree distributions observed in our data. Results are shown for the last panel (t4 ).7 For each model, solid lines represent observed degree distributions, and box-plots represent frequency distributions for different degree values in the 3,000 networks simulated conditioning on the empirical estimates for the corresponding model. A satisfactory fit is achieved when for all degree values the observed frequency falls within one standard deviation from the mean frequency of the simulated networks. Results show that the first two models perform rather poorly, especially with reference to the degree distributions of developers (see Figures 4a,b). The fit improves substantially in Model 3 (see Figures 4c,f), although it is still not entirely satisfactory in the case of developers. In Table 4, diagnostic goodness-of-fit results for Model 3 are summarized across all time points (t14 ) by reporting the deviations between the averages across all the time periods of the first three moments of the observed degree distributions and the corresponding network statistics based on the simulation of the network distribution implied by the empirical estimates. The results in Table 4 show a more substantial deviation between observed and simulated networks in the case of the skewness of the degree distributions of developers. We take these results as evidence that accounting for the local network structures improves the ability of the model to capture the dynamic association between software developers and software problems. Wald-type tests for joint significance performed on Models 2 and 3 clearly show that the organizational problem-solving process depends on endogenous network structures (p < .001) (see

400


(b) Main Effects Model

(c) Interaction Model

60

60

60

(a) Null Model ●

●

50

50

50

●

●

● ●

● ● ●

40

●

● ● ● ●

10

●

● ●

0

a

b

c

● ●

d

●

● ●

● ●

e

● ● ●

●

f

● ● ●

●

g

h

● ● ●

i

●

j

20 ● ●

●

● ●

● ● ● ●

● ● ●

● ● ●

● ●

●

● ●

● ● ● ● ● ●

● ● ●

●

●

k

0

1

2

3

4

● ●

● ●

●

●

●

● ● ● ●

● ●

● ●

● ● ●

●

●

●

10

● ● ● ●

● ●

● ● ●

●

●

●

● ● ●

10

● ●

● ●

●

5

●

6

●

7

●

●

8

9

●

●

●

0

● ● ● ●

●

●

30

● ● ●

20

●

● ● ● ●

30

● ●

0

30

●

20

Frequency

40

40

●

10

●

0

1

2

3

4

5

●

6

● ● ●

●

7

●

● ● ● ●

●

●

8

9

10

●

Degree of Developers

500

600 500 ● ● ● ●

100

● ● ●

● ●

●

0

● ●

0

1

2

3

400 300

● ● ●

● ●

● ● ● ● ●

●

● ● ● ● ● ● ● ● ●

● ● ● ●

● ●

● ● ●

●

●

●

● ●

● ● ●

● ●

● ●

●

● ●

● ●

●

●

●

4

5

6

7

8

9

10

0

1

2

3

● ● ●

● ●

●

●

● ●

● ●

●

● ●

●

4

5

6

7

8

9

10

● ● ●

●

0

●

● ● ● ●

●

● ●

●

● ● ● ● ● ● ● ● ● ● ●

200

400 300

● ●

200

● ●

0

● ●

●

100

500

●

400 300

Frequency

200 100

(f) Interaction Model 600

(e) Main Effects Model

600

(d) Null Model

0

1

2

3

● ●

●

● ●

●

● ● ●

● ●

●

●

● ●

4

5

6

7

8

9

10

Degree of Bugs

Figure 4. Goodness-of-fit diagnostic box-plots for degree distributions of developers (a-c) and bugs (d-f) at fourth time point (t4)

Table 3). Interpreted together, the results we reported provide strong evidence that the network dynamics of bug engagement depend both on the individual characteristics of developers and bugs (Model 2) and on endogenous network dependencies triggered by problem-solving activities (Model 3). Finally, we report the results of the generalized Neyman-Rao score-type test for the MoM estimates (Snijders et al., 2007). For both Models 2 and 3 the test is carried out by estimating models in which all the new parameters included (relative to the exogenous actor-specific covariates for Model 2 and to the local network structures of interest for Model 3) are restricted to zero. The restriction is then tested by comparing the restricted and the full model. The low p value (p < .001) associated with the tests suggest that both Model 2 and Model 3 represent an improvement in fit with respect to their restricted counterparts (see Table 3). We discuss now the parameter estimates by presenting the three nested models in order of increasing complexity. Rates of network change (1-3) indicate the frequency with which developers encounter opportunities to alter the status quo by deciding to engage additional bugs or to cease prior engagements (i.e., changing one of their outgoing ties) in each of the time periods specified in the model. The parameter estimates can be interpreted as the number of changes that on average developers make to their existing engagements on bugs over a time period. For example, in Model 3 developers make on average approximately 2.6 of such changes in the last time period (t3!4 ) (see Table 3). If we consider Models 1 and 2, this rate appears to increase substantially toward the end of the release cycle. However, after the inclusion of the Outdegree effect on rate of network change in Model 3, the baseline rates indicate that the number of opportunities are in fact constantly decreasing and unevenly distributed across developers throughout the release cycle. The positive and significant parameter estimate for Outdegree effect on rate of network change

Conaldi et al.

401

Table 3. Estimated Stochastic Actor-Based Models for Two-Mode Network (2MN) (Developers ¼ 135, Bugs ¼ 719)a Effects Rate function Rate of network change 1 Rate of network change 2 Rate of network change 3 Outdegree effect on rate of network change Evaluation function Endogenous network effects Outdegree density) 4-cycles (transitive closure) Bug popularity Developer activity Developer-bug assortativity Exogenous actor-specific covariates Developer tenure in project Bug severity Wald w2 statistics (df) Gen. score w2 statistics (df)

Model 1 (Null)

SE

Model 2 (Restricted)

5.915*** 3.302*** 7.273***

0.408 0.231 0.381

6.557*** 3.731*** 8.747***

0.469 0.269 0.506

6.148*** 3.167*** 2.598*** 0.032***

0.530 0.235 0.450 0.002

–2.092***

0.048

–2.320***

0.042

–4.739*** 0.306** 0.566*** 0.642*** –0.017***

0.445 0.115 0.137 0.163 0.005

1.445***

0.089

0.195

0.124

–0.196*** 271.442*** (2) 364.882*** (2)

0.059

–0.279*** 1,251.089*** (5) 17,376.703*** (5)

0.070

SE

Model 3 (Full)

SE

a

Convergence t-ratios for all effects < |0.1|. *p < .05. **p < .01. ***p < .001.

shows that controlling for the overall average number of opportunities for changes in engagement, developers engaging more bugs enjoy significantly more opportunities to change their affiliation to bugs of their choice. The negative Outdegree (density) parameter in Model 1 suggests that engaging a bug is a rare event, as developers are relatively more likely to decide not to engage a bug. Parameter estimates in the models can also be interpreted in probabilistic terms. For example, in the case of Outdegree (density) in Model 1 we can affirm that the odds of a developer to engage a further bug against not to engage it are ½e2:092 ¼ 0.123 and the corresponding binary probably for a developer to engage a bug versus not to engage one are ½e2:092 =ð1 þ e2:092 Þ ¼ 11%. This can be explained by considering that engaging a bug is a costly action as developers’ resources are scarce and their attention is limited. In Model 2 we add actor-specific covariates. Controlling for the overall Outdegree (density) effect, the positive and significant parameter for Developer tenure in project indicates that more experienced developers tend to engage more bugs. Less obvious, and more interesting, is the interpretation of the negative and significant parameter for Bug severity. The parameter indicates that more severe bugs tend to ‘‘scare away’’ developers. This could be due to the fact that only a restricted number of elite developers have sufficient confidence and skills to engage bugs collectively classified as severe. Model 3 includes endogenous network effects that take into account the influence of local network structures surrounding developers on their decision to engage bugs. The positive parameter for Developer activity shows a reinforcing feedback effect driving engagement: The higher the number of engaged bugs, the higher the likelihood of engaging further bugs. We observe the same reinforcing dynamics with the Bug popularity effect: The positive and significant parameter indicates that

402


Table 4. Average, Standard Deviation and Skewness for Observed and Simulated Degree Distributions of Developers and Bugs Observed

Simulateda

Deviation

3.832 12.900 6.179

3.836 13.642 7.776

–0.004 –0.742 –1.597

0.720 0.808 2.237

0.720 0.824 1.671

–0.000 –0.016 0.566

Developers Average Standard deviation Skewness Bugs Average Standard deviation Skewness a

Reported values are averages across 3,000 simulated networks.

already popular bugs are more likely to attract further developers. Moreover, the propensities summarized by the positive parameters associated with Developer activity and Bug popularity are coherent with the skewness characterizing the observed degree distributions of the actors. A small number of developers and bugs account for most of the observed ties. More complex local network structures are accounted for by the 4-cycles (transitive closure) and Developer-bug assortativity effects. The positive and significant parameter for the 4-cycles (transitive closure) effect indicates a tendency toward local clustering. Pairs of developers engaging the same bugs become more likely to engage further bugs together, therefore showing a tendency to form a more stable interaction pattern. Ceteris paribus, the odds for a tie between a developer and a bug to form that closes a 4-cycle against the tie not to form are 1.358, with the corresponding binary probability being 57.6%. Developers thus seem to prefer repeated collaborations by engaging bugs that have attracted other developers with whom they have previously collaborated. The significantly negative estimate of the parameter for the Developer-bug assortativity effect indicates that, ceteris paribus, highly active developers become progressively less likely to engage the most popular problems. The result is decentralization of organizational problem-solving activities. Interestingly, when the effects of local network structures are included in the model, the parameter estimates for the actor-specific covariates change. In particular, Developer tenure becomes statistically not significant. Having being active in previous release cycles does not affect the tendency to engage bugs during the current one, once the effect of local network structures on the association between developers and bugs is taken directly into account. On the contrary, the negative effect of Bug severity remains significant and becomes stronger. In Model 3, ceteris paribus, the odds for a developer to engage a severe bug against not to engage it are 0.756, with the corresponding binary probability being equal to 43.1%.

Discussion and Conclusions As interest in network evolution continues to increase among students of organizations (Ahuja, Soda, & Zaheer, 2011), new data are collected (Zwijze-Koning & de Jong, 2005), new questions are asked (van de Bunt & Groenewegen, 2007), and new answers emerge (Snijders et al., 2011). In this article, we presented a novel organizational application of stochastic actor-based models recently developed for the analysis of 2MN. We motivated the model on the basis of the multiplicity of research interests coalescing around 2MN and of the observation that available models suffer from a number of shortcomings that limit their value for empirical research. Specifically, we observed that existing models usually decompose 2MN into unimodal projections, and in

Conaldi et al.

403

doing so, they lose information on the dual association composing the original 2MN structure. We also observed that more conventional models relying on choice-based sampling of individual two-mode dyads are incapable of taking into consideration extra-dyadic dependencies that are crucial to examine patterns of network self-organization. In our empirical illustration we showed how such patterns may be directly linked to the coordination of decentralized activities of organizational problem solving. In response to these limitations, we proposed a model in which the structural characteristics of the two-mode association between different classes of organizational entities are treated as substantive problems amenable to direct empirical investigation rather than statistical problems to be alleviated or ‘‘controlled for.’’ The model was specified in terms of a series of endogenous network effects representing specific local network structures induced by the dual association of individuals and problems. The results we reported could not have been obtained via traditional models of dyadic regression based on assumptions of dyadic independence. Our analysis of the two-mode association between software developers and software bugs in a F/OSS project sheds light on the dynamics of the structural mechanisms that ensure coordination in problem solving in F/OSS communities. Controlling for individual characteristics of bugs and developers, the likelihood of a developer to engage a bug is strongly affected by the local network structures in which such action is embedded. In particular, the large heterogeneity observed in the degree distributions is explained by bandwagon effects driving both the activity of the developers and the popularity of the bugs. We documented the emergence of local collaboration clusters, induced by the tendency of the same developers to cluster over time around the same subset of bugs. In more general terms, we have shown specific ways in which a 2MN of ‘‘problem solvers’’ by ‘‘problems’’ internalizes the information that is necessary for coordinating important organizational problem-solving activities (‘‘bug fixing’’ in our case). Although our main interest in this article was to develop a distinct methodological angle on the analysis of 2MN, we think that our empirical findings also support substantive conclusions that are of direct relevance to the study of F/OSS projects. In a recent article, Qureshi and Fang (2011) highlight the need for quantitative longitudinal studies investigating the dynamics of F/ OSS development. We heeded this call by presenting a dynamic model of affiliation that may be usefully applied to the study of F/OSS development. A key issue for the survival of F/OSS projects is their ability to maintain a significant base of peripheral developers (David & Rullani, 2008; Von Krogh et al., 2003). As we reported for the project we analyzed, in the F/OSS environment bug-fixing activities are recognized as the most common entry points into a project for new developers (Raymond, 1998). Our results suggest that local endogenous processes regulate the coexistence of very different levels of engagement by generating anti-hierarchical patterns in which developers becoming progressively more engaged increase their focus on less popular bugs. For this reason it is important to include in the analysis all the developers involved in a project regardless of their level of engagement in bug-fixing activities. It is the variability in the level of individual engagement in software development activities that is the key to understanding how F/OSS projects actually work (Roberts, Hann, & Slaughter, 2006). On a more methodological note, we recall that extreme skewness in the developers’ degree distribution—a core feature in the organizational structure of F/OSS projects—derives precisely from observed differences in engagement in problem-solving activities. This is a structural feature of F/OSS projects, not an incidental empirical problem. This structural feature can be modeled explicitly only if observations are collected on all developers involved in a project. We have illustrated our models in an empirical context that may appear highly idiosyncratic. Yet, we think that our methodological proposal transcends the limited scope of our empirical application. As we have mentioned earlier in the article, structures taking the shape of 2MN are common in

404


organizational research—and in social science research more generally. The potential for application of stochastic actor-based models for 2MN is correspondingly broad. For example, Carley (1991) proposes a model of group stability based on a 2MN representation of group members (rows) by facts or ‘‘knowledge’’ (columns). In this model the probability of interaction between group members is shaped by the relative amount of knowledge they share. Stochastic actor-based models similar to those we have discussed could be specified to examine what knowledge item is likely to attract more group members and test hypotheses about the specific characteristics of ‘‘shared facts’’ that are more likely to affect the interaction among group members. In a very different empirical context, Mohr and Guerra-Pearson (2010) provide an example of potential interorganizational applications. In their study of organizational niches, these authors reconstruct the links between 283 poverty relief organizations in New York City and (a) categories of service provided, (b) status distinctions made among the recipients of relief services, and (c) social problems addressed. Mohr and Guerra-Pearson (2010) analyze the resulting data structures with the goal of assigning each organization to a position in a space that they interpret as having both institutional as well as competitive valence. Their purely positional analysis could be readily extended by examining the patterns of association between ‘‘organizations’’ and ‘‘services’’ to reach a more detailed understanding of the process underlying the position that organizations come to occupy in their institutional-competitive environment. The measures of niche overlap that Mohr and Guerra Pearson (2010) derive could be used—for example—to test hypotheses about the effect of competitive crowding on change in services provided and problems addressed by organizations in their sample. Our study suffers from at least four limitations that deserve special attention as they point toward clear directions for future research. First, goodness-of-fit diagnostics suggest that the fit of the model is less than fully satisfactory. More specifically, extensive simulation analysis suggests that the model has problems in covering the lowest range of the degree distribution for developers. This may be related to the absence of actor-specific covariates for marginally engaged developers or to other unobservable characteristics of bugs and/or developers that we have somehow failed to incorporate in our models—an issue that is not specific to our analytical framework. We note that a core structural feature of organizational networks is that their degree distribution is exponential: Few actors attract and produce many ties, many actors attract and produce only few or no ties (Saavedra, Reed-Tsochas, & Uzzi, 2009). For this reason, observations that would be identified as ‘‘outliers’’ in conventional regression-type models (and perhaps excluded) require direct modeling in the analysis of social and organizational networks. To the best of our knowledge, the class of stochastic actor-based models that we have discussed represents the only analytical framework that can accomplish this task in the context of longitudinal 2MN data. Future development will need to identify specifications that improve the ability of the model to cover the degree distribution under conditions of extreme skewness. Second, there is margin for improving the accuracy of the empirical model specifications. Two possible directions for future development may be readily identified. The former involves including more actor-specific covariates (and possibly their interactions with endogenous network effects) to control for alternative or additional mechanisms of coordination. For example, core developers of the project—namely, the developers who have direct access to the codebase of the project—might be identified in the data set as they might play a specific role also in bugfixing activities. The specification of the rate function (Equation 2) could be complicated by allowing the opportunity for change to vary over developers—as for example core developers may enjoy more opportunities to change their affiliation with bugs. The specification of the evaluation function (Equation 4) could also be more detailed by letting the transitive closure parameter (corresponding to the 4-cycle) interact with individual attributes. For example, it could be that the likelihood to observe collaborative 4-cycles is increased if the developers

Conaldi et al.

405

involved had been both active in previous development cycles and both core developers. A positive interaction effect of this kind would suggest the presence of a ‘‘collaborative backbone’’ of more experienced developers sustaining the organization of the project. The latter direction involves allowing model parameters to be heterogeneous over different time periods. This analytical refinement has been recently proposed by Lospinoso, Schweinberger, Snijders, and Ripley (2010), who developed an innovative way to test for time heterogeneity in stochastic actor-oriented models. Controlling for time-heterogeneity is obviously important when long time periods are covered in a study (‘‘large T’’). It is also important when the panels are obtained by aggregation over multiple years or when there are reasons to believe that structural change is present in the data. In the more specific case of project management, accounting for time heterogeneity would be important because engineering projects are well known to advance at unsteady pace—with activity accelerating as the deadline established for delivery draws near (Ford & Sterman, 2003). Third, while our models for 2MN successfully address many of the analytical concerns that motivated our article, they still leave unresolved the fundamental theoretical problem of how, exactly, social settings generate social ties (Feld, 1981; Pattison & Robins, 2002). To make progress in this direction, models need to be developed to represent the co-evolutionary dynamics of one-mode networks (representing change in social relations among actors) and two-mode networks (representing change in the pattern of association between actors and activities). These models are currently being developed and are not yet generally available (Snijders, Lomi, & Torlo`, 2011). Fourth, and finally, stochastic-actor oriented models for 2MN are currently available only for binary tie variables. Extensions to integer-valued and real-valued ties are not yet generally available. In the specific case we have presented, this limitation is not particularly stringent as we were interested in analyzing the decision to engage a problem, namely, the existence of a relation between problems and problem solvers, rather than the intensity of the engagement. The need to binarize valued-networks, however, may limit somewhat the general applicability and appeal of the model. Despite these various issues that our analysis helps to identify—but leaves open—we believe that the current study clearly demonstrates the advantages of specifying models for 2MN that make full use of the information produced by dynamic processes of mutual constitution between structure and agency across levels of organizational analysis.

Appendix A This script allows replication of the estimations and tests carried out in the study. The estimation was carried out using RSiena rev169 (http://r-forge.r-project.org/R/?group_id¼461). RSiena is a contributed package for the statistical software R (R Development Core Team, 2011), which can be downloaded from http://cran.r-project.org. We assume that readers have an intermediate knowledge of the R language. Additional information can be found in Ripley, Snijders, and Preciado Lopez (2011). The script assumes the data to be saved in a series of plain text files within the current R working directory. Specifically: four text files (net1.txt, net2.txt, net3.txt, net4.txt) representing the associations of 135 developers (rows) and 719 bugs (columns) as dichotomous matrices; one text file (devcomp.txt) containing information about the composition change of the developers; the file has one line for each developer; one text file (bugcom.txt) containing information about the composition change of the bugs; one one-column text file (devten.txt) containing the dichotomous variable developer tenure; one one-column text file (bugsev.txt) containing the dichotomous variable bug severity.

406


Appendix A. (continued) Command Lines

Comments

Wald.RSiena

Dynamic Models of Affiliation and the Network ... - SAGE Journals

Dynamic Models of Affiliation and the Network ... - SAGE Journals

Suggest Documents

Dynamic Network SimulationâAssignment Platform ... - SAGE Journals

The Effects of Military Affiliation, Gender, and Political ... - SAGE Journals

Religious Affiliation and Hiring Discrimination in the ... - SAGE Journals

Network positions - SAGE Journals

Dynamic Volume Completion and Deformation - SAGE Journals

Dynamic Systems and Organizational Decision ... - SAGE Journals

Dynamic modeling of electromagnetic suspension ... - SAGE Journals

Dynamic characterizations of underwater structures ... - SAGE Journals

The locus coeruleus-norepinephrine network ... - SAGE Journals

Experimental measurement of dynamic load ... - SAGE Journals

The Location and Global Network Structure of ... - SAGE Journals

The Location and Global Network Structure of ... - SAGE Journals

Optimization of the dynamic stiffness of the seismic ... - SAGE Journals

Integrating Dynamic Signaling Commands Under ... - SAGE Journals

Reconfigurable Hardware Based Dynamic Data ... - SAGE Journals

Improved non-dimensional dynamic influence ... - SAGE Journals

Modeling and response analysis of dynamic systems ... - SAGE Journals

Dynamic Characterization and Interaction Control of ... - SAGE Journals

Dynamic response and damage character of road ... - SAGE Journals

Dynamic behaviour and stability of marine propulsion ... - SAGE Journals

Outcome of the Dynamic Helical Hip Screw System ... - SAGE Journals

Dynamic analysis of the upper limb during activities ... - SAGE Journals

Angel Network Affiliation and Business Angels

Use of Fourier Models for Analysis and Interpretation ... - SAGE Journals

Dynamic Models of Affiliation and the Network ... - SAGE Journals