Impact of license choice on Open Source Software development activity

262 downloads 141371 Views 446KB Size Report
The Open Source Software (OSS) development model has emerged as an .... relate to the level of activity of the OSS software development process?” Drawing ...
Impact of License Choice on Open Source Software Development Activity

Jorge Colazo Department of Information Systems and Operations Management, Sawyer Business School, Suffolk University, Boston, MA 02108. E-mail: [email protected] Yulin Fang Department of Information Systems, College of Business, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong SAR. E-mail: [email protected]

The Open Source Software (OSS) development model has emerged as an important competing paradigm to proprietary alternatives; however, insufficient research exists to understand the influence of some OSS project characteristics on the level of activity of the development process. A basic such characteristic is the selection of the project’s software license. Drawing upon social movement theory, our study examined the relationship between OSS licenses and project activity. Some OSS licenses include a “copyleft” clause, which requires that if derivative products are to be released, it must be done under the license the original product had. We hypothesize that copylefted licenses, as opposed to noncopylefted licenses, are associated with higher developer membership and coding activity, faster development speed, and longer developer permanence in the project. To test the hypotheses, we used archival data sources of working OSS projects spanning several years of development time. We discuss practical and theoretical implications of the results as well as future research ideas.

Introduction Open Source Software (OSS) is released under licensing terms that make its source code available and allow modified versions to be redistributed (albeit under different conditions depending on the specific license). OSS development communities have emerged as a viable alternative to the commercial knowledge-based collaboration work (Awazu & Desouza, 2004). For instance, a myriad of OSS projects have Received November 8, 2008; revised January 2, 2009; accepted January 2, 2009 © 2009 ASIS&T • Published online 13 February 2009 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.21039

grown rapidly in recent years (Câmara & Fonseca, 2008), some in direct competition with proprietary or “closed” software alternatives (Sen, 2007). Among the best known OSS projects are the Linux operating system and the Apache Web server, which answers 70% of all the Web page requests through the Internet (Netcraft, 2007). While Linux and Apache are clearly success cases, it also was observed that over 80% of the OSS projects fail not because their products had no appeal to users but because their development process could not sustain a healthy level of activity and failed to attract much needed help (Fang & Neufeld, in press; Hermann, 2004). This observed variability in activity levels begs the question of whether systematic or contextual factors may influence OSS development activities. This issue has become increasingly substantive since important organizations of all sizes are dependent on OSS applications (Sen, 2007), and OSS has been recognized as a viable alternative that can create good quality, cheaper software, capable of threatening the market dominance of proprietary alternatives (Paulson, Succi, & Eberlein, 2004; Raymond, 2001). Towards addressing the identification of these factors, this article develops and tests hypotheses in relation to the impact of one such factor, software license, on the level of activity of the OSS development process. We define development process as the participative activities necessary for the completion of an OSS software development project. Research on what fosters an active OSS development process is particularly needed because OSS projects have idiosyncratic traits, which makes difficult extrapolating findings from proprietary software development. That is, unlike commercial software development projects, OSS relies on a continuing involvement of volunteer developers who have free will to participate

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 60(5):997–1011, 2009

in or leave the project at any time (Crowston, Annabi, & Howison, 2003). Addressing this call for the explicit consideration of OSS’s particularities, our study focuses on examining an array of important indicators of OSS development process activity: project membership, coding activity level, development speed, and developer permanence (fully discussed in subsequent sections). Previous research has suggested a few characteristics that may be instrumental to various aspects of OSS project development: project tenure, size, intended audience, types of software, programming language (Crowston & Scozzi, 2002), codebase architecture (Baldwin & Clark, 2006), and project governance (Shah, 2006). Our study focuses on a different trait: the project’s license choice. Licenses are examined for the following reasons. First, license choice is important to OSS developers, many of whom highly value the idea of preventing source code from being appropriated by third parties (Hertel, Niedner, & Herrmann, 2003). Second, the license used also may influence how attractive the project is to volunteers (Stewart & Maruping, 2006), implying that licenses might influence the development processes by differentially attracting volunteer developers. Finally, software license choice is one of the first project-configuration factors for project leaders to consider at the beginning of any OSS projects, and thus its strategic implication to project success is of critical concern to project leaders. If the choice of license actually impacts the development process activity, then we would be facing a practically substantive, in addition to academically interesting, inquiry. Thus, the main research question of this article is “How do different OSS licenses relate to the level of activity of the OSS software development process?” Drawing on social movement theory (SMT) as the underlying theoretical grounding (Simon, Loewy, & Doe, 1998), we argue that OSS development communities constitute a social movement in which developers voluntarily participate to create a collective software good in a fashion that opposes the mainstream model of proprietary software production. We suggest that different licensing decisions may offer developers different levels of selective incentives and also be aligned with the ideology of OSS to a different degree, thus affecting a project’s attractiveness to volunteer developers as well as developers’ coding activity and permanence in the project. To provide empirical evidence, we used a unique combination of archival data sources (i.e., information from publicly available source code repositories, mailing list archives, and other publicly available electronic sources) and analyzed the development process of 62 OSS projects over an average of 3 years of development time. We also used semistructured interviews with 10 experienced OSS developers to guide our hypotheses and enrich our conclusions. The remainder of this article is organized as follows. The next section provides a brief discussion of OSS licenses and OSS development activity. We then develop hypotheses by drawing on SMT. After the hypotheses development, we describe the empirical methodology used. Later, we review the results obtained and discuss the findings. Finally, 998

we review the limitations of this work and suggest future research ideas.

Construct Definitions OSS Licensing Software licenses regulate the scope of use, redistribution, warranty, and attribution of software products. Briefly, an “open source” license requires that the source code of the program be made publicly available and allows the software to be modified and redistributed (under different terms depending on the particular license) (St. Laurent, 2004). Some OSS licenses include a clause requiring any modification to a program’s original source code to be released, if redistributed, under the same terms under which it was acquired (Stallman & Lessig, 2002). This clause, known as “copyleft,” was created to foster developer cooperation and prevent the source code from being “hijacked” by third parties into proprietary licensed products for private economic gain (O’Mahony & Ferraro, 2004). Practically, copyleft allows members of the developer community to “own” the project’s public knowledge, but only as long as they are willing to share it back (Lee & Cole, 2003). The prime example of copylefted license is Richard Stallman’s General Public License (GPL; Stallman & Lessig, 2002). Under the GPL, all derivative works redistributed must be released under the original license (the GPL). For this reason, the GPL is often described as a “viral” license (Fink, 2003). Stallman later created the GNU Lesser General Public License (LGPL), which requires only the originally covered part of a derivative to be licensed under the same terms; this license also is copylefted, although it is not viral (compare St. Laurent, 2004). Nevertheless, not all OSS licenses are copylefted. For instance, the source code may be put in the public domain and later modified and redistributed under proprietary licenses without copyleft, such as the Berkeley Software Distribution License (BSD) (compare St. Laurent, 2004). Alternatively, a proprietary-style license may be used, which may allow limited external developer participation without extensively compromising intellectual property. Extant research on the implications of the use of copyleft licenses is scant and has mainly focused on its economic and legal underpinnings. For instance, Mustonen (2003) considered software implementation costs to explain whether copylefted products can coexist with proprietary software. Harhoff, Henkel, and von Hippel (2003) used a game-theoretic model to justify the decision to include a license-bound obligation to reveal the product’s source code. Stewart & Maruping. (2006) focused on product popularity and activity vis-à-vis viral licenses. Lerner and Tirole (2005) identified factors influencing decisions on license choices. Very limited recent research has investigated the development outcomes of OSS projects running under license-bound restrictions (e.g., Stewart & Maruping, 2006), but no research has yet been conducted to explore the impacts of copyleft licenses on various types of OSS development activities.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

OSS Development Process Activity We define development activity according to the developers’ life cycle. Three critical success factors relating to talent life cycle have been characterized in contemporary knowledge-based organizations: recruiting knowledge workers (recruitment), motivating them to work productively (production) (Neufeld & Fang, 2005), and retaining them (retention) (Horwitz, Heng, & Quazi, 2003). We argue that these also are true for the success of OSS projects. OSS success depends on a continuing process of volunteers contributing to knowledge-intensive activities such as developing codes, fixing bugs, adding features, and releasing software (Raymond, 2001), without a formal employment contract. Therefore, for an OSS project to remain active, it is critical to recruit voluntary developers, make them contribute to coding activities, and retain them in the development process. Accordingly, in this study, we examine the activity of the OSS development process along these three dimensions: developer recruitment, activity, and retention. One major indicator of activity in developer recruitment is developer membership. By their predominantly self-organized nature, OSS project teams are free from organizational mandates in terms of adherence to formal constraints (Hardgrave, Davis, & Riemenschneider, 2003). Unlike commercial software development projects, where there is deliberate planning and control of their staffing, OSS software projects move forward by engaging volunteer external developers who freely choose whether and when to provide their services. There is, in fact, supporting anecdotal evidence that OSS projects that fail do not have a sizeable team of active volunteers (Krishnamurthy, 2002) and that attracting developers to an OSS project on an ongoing basis might be plausibly associated with development performance (Crowston et al., 2003). Thus, for OSS, the number of volunteer developers attracted by the project needs to be considered as part of our development activity metric. Coding activity has long been identified as a key development process performance indicator (e.g., Ancona & Caldwell, 1992; Clark & Fujimoto, 1991; Iansiti, 1993; Imai, Ikujiro, & Takeuchi, 1985; Jones, 1986). Accordingly, we include the project-level developer coding activity as an indicator for OSS project activity. Development speed refers to the time elapsed between consecutive versions of the software. It is generally accepted that useful software needs to evolve continuously and actively to cope with changing user requirements and to keep up with the rapid obsolescence of related hardware and ancillary software (Lehman, 1998). Parnas (1994) eloquently summarized the issue: “The only programs that don’t get changed are those that are so bad that nobody wants to use them” (p. 279). Indeed, software development strategies are deliberately multigenerational. This means that the software industry has not one, but multiple, consecutive project completion dates for each successive file release. In OSS development, there is a strong tendency to “release early and release often,” which implies that an active, frequent release cycle

is considered a sign of a healthy development process. Thus, we include project development speed as yet another activity indicator. High developer dropout rate has been documented in early work on software engineering as a signal of project failure (Jones, 1986), although empirical results measuring actual developer turnover are conspicuously lacking. Experienced developers are an invaluable asset because software projects require a significant amount of project-specific experience that is gained only after a substantial amount of time on the job (Jones, 1986). This is especially true for OSS projects (Fang & Neufeld, in press). Hence, a project with longer developer activity duration (developer permanence) stands a better chance of being successful than does a project in which developers continually desert and where new members have to be hired and retrained. We thus include developer permanence as yet another indicator of project development process activity. Theoretical Background We argue that OSS development projects feature major characteristics of a social movement. According to Tajfel (1981), social movement refers to an . . . effort by large numbers of people, who define themselves and are also often defined by others as a group to solve collectively a problem they feel they have in common, and which is perceived to arise from their relations with other groups. (p. 244)

Germane to all social movements, such as the environmental and community politics movements (Klandermans, 1997; Omoto & Snyder, 1995), are the issues of active mobilization, of commitment to the group’s ethical norms, and of self-identification as a result of a struggle with a more dominant group with conflicting interests (Gamson, 1991). A brief review of the historical roots of OSS, supported by findings in recent OSS research (Ljungberg, 2000) allows us to argue that OSS project development communities exhibit the typical characteristics of a social movement. Early in the evolution of the craft of software development, the unrestricted sharing of source code among developers was an important ethical imperative uniting the programmer community. The existence of this definite ethical body shaped the so-called “culture of hackerdom” (Himanen, 2001; Levy, 1984). The central normative orientations of this culture are freedom of information and knowledge, and universal accessibility to technology (Holtgrewe & Werle, 2001); however, during the early 1970s, businesses began to take notice of the economic potential of software and, not surprisingly, conflicts soon arose between commercial greed and the free code sharing of the hacker culture. For example, in the early 1980s, the telecom giant AT&T increasingly focused on distributing commercial releases of the Unix operating system. AT&T required its former university collaborators to sign nondisclosure agreements and subsequently started charging license fees for the source

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

999

code of the Unix version that included substantive contributions made by those university researchers. This software “commoditization” process conflicted with the traditional norms and practices of sharing information (Wayner, 2000). Shortly after the commoditization, volunteer developers took collective action initiating a wave of projects to rebuild a free version of the Unix system that had been appropriated by AT&T (Wayner, 2000). Along the same lines, in 1984, Stallman founded the Free Software Foundation and invented copyleft as a legal device that could be used to protect the free sharing of source code (Stallman & Lessig, 2002). These episodes in the OSS history have suggested that the conflicts between commoditization and the earlier software development norms and practices were influential on the creation and growth of the OSS movement (Holtgrewe & Werle, 2001). It has in fact been argued that OSS emerged in part as a defensive collective reaction to the private appropriation of source code (Stallman & Lessig, 2002). Moreover, the inclusion of copyleft licenses in the design of OSS projects seems to be indicative of an opposition to private appropriation. Along this line of thinking, we consider it reasonable to use theories underlying social movement (Tajfel, 1981) as a legitimate theoretical framework for hypothesizing the relationships between the software license of an OSS project and its development process performance. A significant portion of the literature on social movement theory (SMT) concerns the processes by which people voluntarily engage in social movements. SMT posits that people make decisions on whether they want to be involved in a social movement by weighting costs and benefits as well as by assessing whether their own ideals correspond with the group’s norms and standards (Simon et al., 1998). On one hand, the SMT views participation in a collective action as a function of the expected balance between costs and benefits of participation (Klandermans, 1997). Individuals will participate only when they conceive potential benefits as exceeding involved costs; however, note that the goal of the movement, if achieved, is a collective good. All people will benefit from the collective results regardless of the efforts exerted in participation, which may encourage a tendency towards “free riding.” Therefore, additional selective incentives other than the collective good itself are needed to attract participation in social movements (Olson, 1967). On the other hand, the recent SMT literature also has emphasized the role of collective identification in promoting collective action (Simon et al., 1998). A distinct characteristic of all social movements is that of collective identity—the process by which movement participants socially construct artifacts of a “we” that become part of their individual self-categorization (Melucci, 1989). The notion of collective identity is leveraged in the literature to complement the cost– benefit approach of selective incentives that has traditionally explained the basic sociopsychological process influencing participation behavior (Klandermans, 1984; Turner, Hogg, Oakes, Reicher, & Wetherell, 1987). This is especially the case for socially or politically weak groups striving for social change, whose selective incentives are usually too weak to 1000

justify voluntary participation by themselves. Building upon self-categorization theory (Klandermans, 1984; Turner et al., 1987), SMT research has suggested that the members of a social category generally perceive themselves as similar to rather than different from each other relative to other people outside the social category (i.e., categorize themselves in terms of ingroup vs. outgroup). Salience of ingroup– outgroup categorizations can increase the perceived identity between self and the ingroup category, and enact a process of stereotyping oneself more as an interchangeable instance of the ingroup social category rather than as an idiosyncratic individual of “self.” The process of self-stereotyping consequently forms the basis for collective actions such as voluntary participation. In fact, recent research has shown that people tend to participate more frequently in social movements with which they feel better identified (Gamson, 1991; Simon et al., 1998). In summary, SMT suggests that actions in social movement may be a result of one, or both, of the two pathways: (a) calculation of the costs and benefits of participation and (b) identification with the movement. Hypotheses Development Drawing upon SMT, we propose that copyleft licenses contribute to the OSS development process because copyleft provides a set of selective incentives to volunteer developers other than access to the source code itself; and because it also strengthens the developers’ collective identity by signaling adherence to the idea that “software should be free.” We argue that OSS projects under a copyleft license can offer stronger selective incentives and therefore are more attractive to volunteer developers than are noncopyleft OSS projects. Von Hippel and Von Krogh (2003) suggested that developers who contribute to OSS do not lose private benefits and that individuals who participate benefit significantly more than do free-riders. Existing OSS research has indeed identified several examples of the selective benefits that accrue to developers who voluntarily participate, such as software use value (Von Hippel, 2001), peer recognition and reputation (Baldwin & Clark, 2006; Raymond & Trader, 1999), and personal enjoyment and career advancement opportunities (Lerner & Tirole, 2002). While project license may not have an influence on a programmer’s personal enjoyment in coding, we contend that it might affect the strength of the other selective incentives. Copylefted licenses are designed to more strongly prevent the OSS developers’ contribution from being appropriated by commercial organizations than are noncopylefted licenses. Commercial appropriation is undesirable to OSS developers because closure of the software source code would reduce the visibility of their contribution in the software community. Commercial appropriation might thereby jeopardize developers’ reputation benefits and potential career opportunities associated with the OSS model. Furthermore, commercial appropriation might keep OSS developers from freely using a software product which included substantive contributions by these

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

developers, thereby possibly compromising the potential use value of the software to the developers (Wayner, 2000). Thus, we conclude that overall OSS projects under a copyleft licenses may be more attractive to OSS developers due to better potential individual benefits. The collective-identification perspective of SMT also can predict that OSS developers should be more attracted to projects using a copyleft license. It has been found that OSS developers share a distinct set of ideological tenets from commercial developers, “software freedom” being one essential part (Hertel et al., 2003; Stewart & Gosain, 2006). Studies have found that [OSS developers] “disliked proprietary software and wanted to defeat them [sic]” and that an important motivator for OSS participation was the belief that “software should be freely available” (Hertel et al., 2003). There is strong evidence that OSS developers highly regard the societal value of preventing source code from being appropriated and enclosed by third parties (Hertel et al., 2003). Since copyleft is well-known as a legal device to protect the free sharing of source code, OSS developers may identify themselves more closely with copylefted OSS projects than with noncopylefted ones. In other words, as predicted by the collective-identification perspective of the SMT, using copyleft license may enhance the salience of the distinct ideology of an OSS project, which could strengthen the sense of identity of OSS developers with the project. Robert Young, a co-founder of Red Hat Linux, echoed this argument during an interview with one of the authors, noting that volunteer developers tend to shy away from unfamiliar licenses due to distrust in the motives of the license originators, some of which purposely obscure the license’s scope and philosophy in legalese. He also stated that most OSS developers are identified with GPL/LGPL licenses and their concept of copyleft, making these two licenses the de-facto first choice within the OSS community. Indeed, of the 10 OSS developers we interviewed, 8 absolutely recognized the fact that a GPL/LGPL license prevented private appropriation of their work. One of the interviewed developers stated: “I work in a bunch of [OSS] projects, and all of them are GPL’d.” From these arguments and evidence, we propose that projects equipped with copylefted licenses will attract more developers. H1: Developer membership is higher in copylefted OSS projects than in noncopylefted OSS projects.

From the selective incentive perspective of SMT, volunteer developers should prefer to productively cooperate in low-appropriability regimes vis-à-vis regimes where their intellectual property can be more easily appropriated by third parties because the former provides higher probability of making one’s work visible to other developers. The more they produce, the more likely that their work is likely to be visible and attract attention within the OSS community. From the collective-identification perspective, social identity or self-categorization of group processes has long been argued as pivotal in promoting group behavior such as collective actions (Ellemers, Spears, & Doosje, 1997;

Tajfel & Turner, 1986). SMT research has shown that people who more strongly identify themselves with a group are more likely to stereotype themselves as interchangeable instances of that group, and hence participate more frequently in the group’s collective activities (Gamson, 1991; Simon et al., 1998). Likewise, OSS developers may identify themselves more closely with copylefted projects, and hence display a higher level of coding activity in these projects. Our interview with a lead member in the OpenOffice.org project also hinted that copylefted projects might be more productive. Between two concurrent porting teams porting OpenOffice.org to Mac OSX (one under the GPL and the other under a more commercial license mandated by Sun Microsystems), the GPL team (NeoOffice.org) was, in this developer’s opinion, much more aggressive in advancing the product such that “[it] is actually much farther along and more effective than Sun’s port.” Thus, we can propose: H2: Developer coding activity is higher in copylefted OSS projects than in noncopylefted OSS projects.

Since a higher level of coding activity implies that at the same level of resources (project members) there is a greater total effort spent, controlling for project size, the more active these developers are, the faster the project will progress, leading to more frequent releases. From the SMT perspective, since copylefted license provides developers with higher selective incentives (e.g., visibility, reputation, and use value) and stronger identification with the open source ideology to participate more actively in the project, it is more likely that the project will advance more quickly, resulting in more frequent releases. Thus, given the previous discussion, we can hypothesize: H3: Project speed is faster in copylefted OSS projects than in noncopylefted OSS projects.

Similarly, if copylefted projects offer a developer higher selective incentives in terms of visibility and reputation of his or her contributions, and the use value of the software product to the developer, we can expect that developers will be more likely to remain on these projects longer than on noncopylefted projects. Furthermore, SMT suggests that when a collective identity becomes a critical part of one’s personal identity, the participant will develop emotional attachments to the group (Tajfel, 1978): The group’s solidarity becomes indistinguishable from personal honor, and thus participants are more committed to the movement (Gamson, 1991). Recent research has indeed found that OSS developers’identity construction contributes to sustained participation in the community (Fang & Neufeld, in press). Thus, to the extent that OSS developers in copylefted projects more strongly collectively identify than do those in noncopylefted projects, the former will exhibit higher commitment (i.e., longer permanence). This discussion then leads to our fourth hypotheses: H4: Developer permanence is higher in copylefted OSS projects than in noncopylefted OSS projects.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

1001

Potential control variables from previous literature include programming language, size of software, and temporal status of the project (Scacchi, 1995). Objective-oriented languages should be more efficient than procedural languages. Additionally, as software size increases, it becomes more difficult to understand and maintain, decreasing observed developer coding activity (Boehm, 1981). We also expected that over time, developer coding activity would decrease and speed would increase because developers may lose their initial momentum (Jones, 1986). For OSS, whether the project has corporate sponsorship also may affect development outcomes (Stewart & Gosain, 2006). Methods Sampling We selected the project as our unit of analysis and drew data sources from Source Forge (SF) (http://sourceforge.net). SF is the largest Web-based hosting service for OSS projects, and already has been used as a major data source for empirical OSS studies (Koch & Schneider, 2002; Mockus, Fielding, & Herbsleb, 2002; Newby, Greenberg, & Jones, 2002). The scale of SF as an OSS repository is deceptive though. Following our own exhaustive crawling of all projects declared in SF, only approximately 20% of them contained any code at all, with the other 80% just “empty names” without any source code. Even smaller was the number of projects that added any code after the first posting in the repository. Given this situation, we needed to devise a way to select only “healthy” OSS projects that have tractable activity data. Thus, our sample is purposely biased (i.e., we are only interested in active projects). To devise a way to select only “healthy” OSS projects, we focused on projects hosted in SF that met three criteria: They must be collaboratively (a) developed and (b) ported, and (c) had activity data publicly available. First, the collective identification aspect within SMT requires a focus on projects featuring collective actions. An OSS project must consist of multiple developers to necessarily qualify as a collective action. Hence, we considered only those projects with more than one developer. Second, the selective incentives perspective within SMT considers that developers participate for better visibility and reputation, and we therefore focused on projects with salient user interest. To identify projects that attracted user interest, we took a two-pronged approach. On one hand, since OSS historically has been closely associated with the Linux operating system, it made sense to inspect the popular Linux distributions to identify programs that were being used by a sizeable community. We selected OSS projects that were included in at least one of the three most popular Linux distributions: Fedora Core Linux, Mandrake Linux, and SuSe (Novell) Linux. These were selected according to a popularity ranking published by distrowatch, a Web site specializing in tracking and rating Linux distributions (see http://distrowatch.com). On the other hand, a project that is relevant to a large user community is usually adapted from 1002

its original development platform to some other computer architecture, or “ported” (Crowston et al., 2003). The Fink project (see http://fink.sourceforge.net) maintains a database of OSS project packages that have been ported to the Mac OS architecture. We cross-referenced the lists of software packages provided by the three Linux distributions and the Fink project, resulting in 1,513 unique ported projects, of which 244 collaboratively developed projects were hosted in SF. These 244 projects became our sampling frame. Third, our sample was eventually determined by the availability of activity data; 62 projects remained in the sample (25% of the sampling frame). These 62 projects had, on average, more than 2 years of coding activity. This selected sample was compared to a random sample of another set of 62 projects taken from the rest of the sampling frame in terms of application type, size, and the use of programming languages, and no obvious differences between them were identified. We also compared the distributions of copylefted versus noncopylefted licenses between the two samples using χ2 tests; no significant differences were observed. For each project, each quarter was taken as one time period during which constructs were evaluated.1 This approach eventually yielded approximately 700 instances for which the evolution of projects could be registered. The exact case count for each particular analysis departs slightly from that number due to missing data for specific variables. These cases with missing data were excluded listwise from the analyses. Data Collection and Measurement There are several potential archival-data resources available for this kind of study. Previous studies (Von Hippel & Von Krogh, 2003) suggested that the files recording changes to the source code, such as the concurrent versioning system (CVS) log files, are adequate data sources. Another source of archival data is the electronic archives of e-mail communications between developers (Ahuja & Carley, 1999). Project home pages containing basic data such as project name, registration date, and so forth also have been a data source for extant empirical work (Krishnamurthy, 2002). The sheer amount of information that has to be parsed and extracted for multisource, multiproject studies such as this one imposes a big overhead on collection efforts, even with intensive use of automated procedures. This is possibly why extant research has relied heavily on a single data source, or tapped into multiple sources for only one or a small number of projects (Mockus et al., 2002). Our study, in contrast, makes simultaneous use of all the aforementioned data sources. Our data-collection strategy substantively improves on existing approaches in two ways. First, we consistently used multiple data sources for each project, such as source 1 The Durbin–Watson statistic (Savin & White, 1977) for the regression did not allow rejecting the null hypothesis of nonautocorrelation. However, after applying the Durbin–Watson correction in the regression model (Dillon & Goldstein, 1984), we obtained statistical results of the same order and direction for the coefficients. We thus concluded that nonindependence of time-based data did not invalidate the results in our model.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

FIG. 1. TABLE 1.

Data collection.

Licenses.

License type Copylefted

License name

n

GNU General Public License (GPL) GNU Lesser General Public License (LGPL)

28 16

Subtotal copylefted

44

Berkeley Software Distribution License (BSD) Mozilla Public License (MPL) Proprietary-Style License Public Domain Artistic License MIT License

10 2 2 2 1 1

Subtotal noncopylefted

18

has been discussed in several resources (St. Laurent, 2004; Stallman & Lessig, 2002). By consulting these resources, we were able to categorize all of the licenses as either copylefted or noncopylefted.

code files, CVS log files, e-mail archives, and summary data from project home pages. Second, our sample included repeated measurements spanning a large development period. The data-collection strategy is depicted in Figure 1.

Developer membership. We measured developer membership by the total number of developers in a project at a given time (i.e., adding up the core and noncore developers for each period). In accordance with previous studies, we define core developers as those with access to writing code in the CVS of the project, and noncore developers as those who contributed code, but who do not have direct access to the CVS (Fielding, 1999). The number of core developers was calculated from the CVS log files. We approximated the number of noncore developers by counting the number of developers who were not included in the core group but who posted messages to the developers’ mailing list. We manually screened the retrieved results to identify and deal with inconsistencies or delete obviously invalid data such as spam messages.

Copyleft. A program written in Practical Extraction and Report Language (PERL) was developed to automatically extract data from project Web pages in sourceforge.net. These data included registration date, programming language, application description, and license type. Table 1 provides a breakdown of the 62 projects in terms of their licenses. The presence of a copyleft clause in OSS licenses

Coding activity. A developer’s coding activities are reflected in the project’s “credit files” or CVS records in several different ways; the most straightforward is adding lines of code (LOC); however, sometimes LOC need to be deleted, for instance, to make code run faster or delete faulty instructions. Another possible measure of core developer coding activity is the number of CVS modification events (i.e., commits;

Noncopylefted

Total

62

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

1003

TABLE 2.

Factor analysis; total variance explained.

Component 1 2 3

TABLE 3.

Eigenvalues

% of variance

Cumulative %

2.40 0.34 0.25

80.12 11.47 8.41

80.12 91.59 100.00

Factor analysis; loadings.

Item

Loading

Commits per core developer LOC added per core developer LOC deleted per core developer

0.914 0.890 0.882

the number of times the source code was modified). Each commit can include any number of LOC added or deleted. Consequently, the number of commits, the number of LOC added, and the numbers of LOC deleted in a given period of time are three different entities that capture different facets of developers’ coding activity. We used a CVS client program to download the source code files for each project, obtaining snapshots for dates corresponding to quarterly periods from the project’s registration date. The CVS client program provided a history of the modifications committed to files, yielding the name of the committer, date of commit, name of file modified, total LOC added, and total LOC deleted for each commit. We computed ratios of LOC added, LOC deleted, and number of commits to the number of core developers for each time period. Next, we used factor analysis to extract the common variance among the three measures of coding activity, using logarithmic (base 10) transformations to improve linearity. We passed robustness check of the factor analysis by examining and addressing the key assumptions of number of cases, linearity of factor items, multivariate normality, and absence of outliers (Fox, 1991; Guadagnoli & Velicer, 1988; Tabachnik & Fidell, 2000).2 We anticipated that the three items discussed would load on only one significant factor that would reflect the core developer activity level in the project. The results of the factor analysis are shown in Table 2. The criterion of eigenvalues higher than 1.0 (Kaiser, 1974) yielded only one significant factor, as expected. The loadings of the three indicators (Table 3) also are well above the accepted threshold value of 0.7 (Sethi & King, 1991). Additionally, the first factor explains a substantial 80% of the total variance of the scale. We thus deemed the calculated coding activity factor score as acceptable. Project development speed. Project development speed was operationalized by the average number of days between consecutive project file releases. (The shorter the time, the faster the project.) This was obtained by parsing the date stamps attached to file release numbers in project source files. All interrelease times were computed: between major releases, 2 Please

1004

refer to the Appendix for details of the robustness check.

minor releases, and builds. The conclusions were consistent using all types of interrelease times. Developer permanence. Developer coding activity time (developer permanence) was measured by the time in days between a developer’s first appearance in the CVS log files and the last modification made by that developer. To analyze this data, we needed to determine a suitable time of cutoff point after which we would consider that a developer deserted the project. To reduce the subjectivity of this determination, we explored the statistical distribution of the intercommit times to arrive at an educated estimation of the cutoff point. In other words, if the time since a developer’s last commit was longer than the time corresponding to a high percentile of the intercommit time distribution, we assumed that the developer had left the project and did not intend to reappear. We selected 135 days as a cutoff point, the 99th percentile of that distribution. Control variables. To control for programming language, we divided the projects’ languages, using a dummy variable, into two generally accepted categories: procedural languages such as C and object-oriented languages such as C++ (Lewis, Henry, Kafura, & Schulman, 1991). Software size was measured by the total LOC for each project at each of the defined points in time, and was collected using another PERL script. The temporal status of the project was controlled for by including the elapsed time since the first LOC was generated. We also included the number of noncore developers as covariate in the developer permanence analysis since it can be considered as a proxy for the chances of participation, and this has been found to be positively related to permanence (Griffeth, Hom, & Gaertner, 2000). Finally, a thorough inspection of the origins and history of all the final sample’s projects did not reveal any company-sponsored projects; hence, we can ascertain that sponsorship does not play a role in biasing the results we obtained. Results We used ordinary least squares (OLS) and Cox regression to identify significant associations between dependent and independent variables. We used logarithmic (base 10) transformations to improve linearity for the number of developers, size of the programs, interrelease time, and coding activity factor score. We passed the robustness check of the regression models by examining the threshold number of cases, relationship linearity, normality and homoscedasticity of residuals, and addressing the concern about nonindependence of error terms due to repeated measures of panel data by applying Durbin–Watson correction (Dillon & Goldstein, 1984).3 3 In

the Appendix, we detail how we do transformations, examine key assumptions of the regression models, and address concerns about possible nonindependence of error terms due to the use of repeated measures (panel data).

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

TABLE 4.

Correlations.

1. Coding activity 2. Copyleft 3. Object-oriented 4. Size 5. Membership 6. Noncore developers 7. Speed 8. Time

M

SD

1

2

3

4

5

6

7

8

0.0

1.0

1

0.117*** 1

−0.083* −0.168**** 1

0.366**** −0.154 0.013 1

0.181**** −0.102** 0.002 0.474 1

0.328**** 0.136**** 0.064 0.347**** 0.304**** 1

−0.157**** −0.115**** −0.107**** 0.079* 0.025 −0.181**** 1

−0.153**** 0.025 0.007 0.216**** 0.058 0.116** −0.003 1

4.7 6.2a 0.5 158a 8.4a

0.6 3.2a 0.4 175 6.2

Note. Speed reported is between consecutive minor releases. Ms and SDs of dummy variables are not reported. a To ease interpretation, Ms and SDs reported are of nontransformed variables. *p < .05. **p < .01. ***p < .001. ****p < .0001.

TABLE 5.

Ordinary least squares, developer membership, coding activity, and development speed. Developer membership

Intercept Time Copyleft Object-oriented Size Developers F n R2

Coding activity

Development speed

Coefficient

t

Coefficient

t

Coefficient

t

−0.702**** 0.014*** 0.207**** 0.347****

−4.94 3.64 5.51 11.81

−3.148**** −0.056**** 0.334**** −0.179** 0.642**** 0.352**** 49**** 674 0.27

−11.33 −7.62 4.56 −2.37 10.48 4.90

1.876**** 0.010*** −0.137**** −0.020 0.099**** −0.215**** 16**** 632 0.11

14.47 2.85 −3.90 −0.53 3.55 −6.38

65**** 693 0.22

****p < .0001. ***p < .001. **p < .01. *p < .05.

In the first OLS regression analysis, the dependent variable was the number of developers. In the second analysis, the dependent variable was the project’s coding activity factor score. In the third analysis, the dependent variable was the project’s speed (interrelease times). Core developer permanence was analyzed using Cox regression, a method that accommodates censored data and does not require strict normality of covariates’ distributions and can accommodate categorical and ratio-type independent variables. Missing data were deleted listwise. Zero-order correlations are shown in Table 4. Results of the OLS analyses are shown in Table 5. In the case of developer membership, the signs of coefficients were as expected, and the model was significant (p < .0001). Hypothesis 1 was supported. The number of developers was significantly (p < .0001) associated with copyleft, and the positive sign of the coefficient indicates that the copylefted projects attracted more developers. The time and project size controls also were significant predictors of the number of developers. Results for developer coding activity showed that Hypothesis 2 was supported. Developer coding activity was significantly (p < .0001) higher in copylefted projects. As expected, coding activity was negatively associated with the time control variable. Object-oriented programming languages

were associated with lower coding activity, as expected per our LOC-based metric. Size was significantly associated with coding activity, although not in the expected direction, with bigger projects showing a higher level of coding activity. The model for project speed shows that Hypothesis 3 was supported. Copylefted projects had a shorter interrelease time than did noncopylefted projects, as explained by the negative, significant (p < .001) coefficient. For the control variables, as time passes, interrelease time expands; the same happens as the project grows larger. The effect of using object-oriented languages did not significantly affect speed, although the coefficient had the expected sign. Results of the Cox regression model for developer permanence are shown in Table 6. Hypothesis 4 was not supported. The results show that developer permanence was significantly (p < .05) associated with license type and core team size; however, and contrary to our expectations, noncopylefted projects had longer developer permanence than did copylefted projects (almost 4 months longer on average). The corresponding coefficient shows that there is a 27.2% increase in the odds of a developer leaving the project in a copylefted project compared to a noncopylefted project. The coefficient for core team size was small, but significant, suggesting that team size has little effect on developer permanence. To

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

1005

TABLE 6.

Cox regression, developer permanence. B

SE

Wald

Exp(B)

0.007 0.002 −0.240

0.004 0.002 0.109

4.174** 1.539 4.882**

1.007 1.002 0.786

License type Noncopylefted Copylefted

M duration (days) 481 369

n 226 629

Censored 45% 40%

−2 Log-likelihood 6117.37

χ2 11.87**

df 3

Variable Core developers Noncore developers Copyleft

**p < .05.

TABLE 7.

Difference, copylefted/noncopylefted.

Developer membership Coding activity Speed Developer permanence

+22% +30% −13% −23%

facilitate the interpretation of coefficients of logarithmically transformed variables, effect sizes are reported in Table 7. Discussion Our study demonstrates that overall, the social movement account of developers’ voluntary participation in OSS projects does seem to predict the relationships between license choice and OSS development process activity. Through the SMT lens, our study provides an encompassing view of the underlying motives of OSS developers’ voluntary participation in OSS projects by accounting for both selective benefits and ideological identification motives that have been separately discussed in the OSS literature (Hertel et al., 2003; Stewart & Gosain, 2006). Based on the SMT account, the copyleft licensing might be seen as a preferred legal artifact to leverage in an OSS movement in that it simultaneously satisfies most of the individualistic and the collective aspects of OSS developers’ needs for voluntary participation in a project. The composition of our sample also supports this finding. Although SF does not formally advocate the GPL/LGPL (in contrast, e.g., to the Free Software Foundation’s Savannah hosting service), more than two thirds of the projects in our sample appear to be copylefted. Specifically, we found that copylefted projects are associated with higher developer membership than are noncopylefted projects. This result lends support to the argument that OSS projects under a copylefted license are more attractive to volunteer OSS developers. This is a key result because attracting external developers to participate is the prime motive for opening the code of a project. We also found that copylefted licenses are associated with higher core developer coding activity, lending support to the argument that core developers are more active in 1006

copylefted OSS projects than in noncopylefted ones. From a social movement perspective, heightened coding activity may be the result of developers being more strongly motivated by private benefits and/or more strongly identified with the ethical tenets signaled through copylefted licenses. Furthermore, the finding about copyleft’s negative association with project speed (consistent whether using major, minor, or build interrelease times) also makes copylefted licenses attractive because it suggests that copylefted licenses help a project evolve more rapidly, with potential for better adapting to fluctuating requirement sets and competitive pressures. Contrary to our expectations, the results show that copylefted projects are associated with lower developer permanence. It implies that although OSS developers may expect to benefit more from and collectively identify more strongly with a copylefted project, they will not necessarily stay with the project longer. A possible explanation for this is that copylefted projects attract the most productive, skilled developers, who benefit from the visibility that copylefted OSS projects give to advance careers and hence have a shorter life within the OSS projects. Rationally, if a private company is headhunting for skilled developers from the OSS community, the most productive developers will be the first to be tempted (perhaps such as the ones in copylefted projects). Another alternative explanation implied in Shah’s (2006) study is that those who participate in noncopylefted projects may be intrigued with the unique use value of the software product, but not have had an alternative copylefted choice, resulting in a stronger continuance commitment (e.g., tenure in an organization due to limited other choices). Future research should explore whether these alternatives can be confirmed empirically. Our control variables also yielded some interesting findings. The positive coefficient for the size control variable in the coding activity regression also was unexpected, but it might suggest that larger projects attract more able developers—or less skilled developers, who may be less productive, try their hand at smaller, simpler projects. Alternatively, larger projects have evolved structures that enable easier contribution while smaller ones do not. For instance, it is relatively easy to add certain kinds of code to Linux

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

(e.g., device drivers) since there are well-defined interfaces and architecture in place. In addition, results relating developer coding activity to total developer membership (where noncore developers predominate) may indicate the importance of noncore developers as generators of input for the development process, which will have to be confirmed in future research. Furthermore, this study’s results on the positive relation between noncore developers and coding activity are a direct empirical confirmation of what Lee and Cole (2003) suggested in their study of the Linux kernel. They argued that only additions to the core developer group would result in decreasing contributions; additions to the periphery group (our noncore or peripheral developers) might produce positive rather than negative returns. This would suggest that a firm willing to set up an OSS project should prioritize securing a sizeable number of peripheral developers from the developer community and also set mechanisms in place to sustain the number of developers along the life of the project. Theoretical Implications The findings of the present study make important contributions to our theoretical understanding of the relationships between OSS project characteristic and development process activity in two aspects. First, our study contributes to the OSS literature by enhancing our understanding of the implications of OSS software license choice to development process success in terms of developer voluntary participation. Although prior work has implied the influence of software license choices on OSS development success, related empirical research is insufficient. A notable exception (Stewart & Maruping, 2006) narrowed the conceptualization of project activity to a single indicator (i.e., number of project releases), self-admittedly representing “a narrow view of the construct [which] may be capturing coding activity that is not directly related to building the software” (p. 142). Our study extends prior research by establishing the significance of a copylefted license to a wider range of OSS development activity indicators, thus not only making the linkages to OSS development activities more complete but also increasing the generalizability of the study. Second, our study offers a new theoretical lens—SMT—to understand the underlying mechanisms through which voluntary developers may be attracted to, participate in, and remain on OSS projects. In addition to encompassing the private-benefit account of OSS developer participation in the existing literature, the SMT provides an identification-categorization account of social participation, thus extending our theoretical understanding of why developers may more actively perform in copylefted projects. Methodologically, we have extended the previous related archival-based empirical studies by consolidating several different sources of archival data, and we expect that this will be relevant to and useful for future studies on OSS software development. Additionally, we have devised and tested new simple, but more generalizable and robust, measures related to key indicators of software development process activity. Practically, the findings of our study inform OSS

project leaders that license choice should not be taken as an easy, convenient decision. Rather, it has strategic implications to the success of the OSS software-development process. Thus, our study provides important practical contributions to practitioners in terms of their managerial decision on OSS licensing. Practical Implications Our study has implications to practitioners wishing to initiate an OSS project. For them, the choice of licensing terms is a complex decision involving issues such as legal liability, intellectual property dissemination, competitive responses, and project visibility (Lerner & Tirole, 2005). The effect of license type on development process success is one more dimension that can inform that decision. Our study suggests that to increase the likelihood of development process success, OSS leaders may consider using copylefted licenses. Moreover, the effect of copyleft has particular implications for for-profit organizations. Although at first glance OSS projects seem antiprofit, the existing literature has suggested that companies can incorporate open source ideas into their business strategy (Holtgrewe & Werle, 2001). However, a key issue in such endeavors is arriving at the “optimum licensing policy” (Lerner & Tirole, 2002). Our findings suggest that for the sake of development success, proprietary software or hardware makers do not need to rule out copylefted licenses from their strategies. Perhaps there lies an opportunity to design a new license with the spirit of copyleft that is more palatable to private business than the GPL/LGPL, or to develop a stronger case for a dual model of licensing wherein commercial customers are offered one type of proprietary license and the developer community another, less restrictive license (Kalina & Czyzycki, 2005). Other indirect avenues through which private software or hardware makers can benefit include creating a copylefted OSS product that is synergistic with a proprietary application (Varner, 2000) or drawing demand for hardware by bundling it with a copylefted application (Mustonen, 2003). Limitations and Future Research As a whole, the results paint a clear picture of the effects of license choices on the multiple key indicators of OSS development activity; however, the findings of the present study need to be taken in conjunction with an awareness of several limitations. First, our sample is nonprobabilistic and did not allow us to assess self- selection bias. Second, the analysis was limited to data obtained through SF, leaving its generalizability to other potential data sources (e.g., Freshmeat) unexamined. Further work may try to cross-confirm this study’s findings exploring other sampling-frame definitions. Third, this study assumes that the ethical/ideological profile of OSS developers is relatively homogeneous, which although supported by extant research (e.g., Lakhani & Wolf, 2003), constitutes in itself an opportunity for future research. Our sample has projects both small and large, and a wide

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

1007

range of applications; hence, we do not think we introduced substantive bias in terms of product size or developer background. Future research can be taken in a couple of other directions. For instance, OSS project activity is an important prerequisite to the final project success; thus, further research should investigate the implications of different licenses on OSS product performance. In addition, our results on developer permanence were theoretically unexpected and merit further inquiry. Such future work should investigate what factors influence developers’ attrition in an OSS project. OSS is being adopted at an increasing pace in all industries and sectors, from business to government. In that light, arriving at a complete understanding of how to effectively design an OSS project to maximize the odds of its success is an important academic and practical endeavor. Our research is one step towards the goal.

Acknowledgments Earlier versions of the article were presented at the 2005 Americas Conference on Information Systems (with a best paper award), at Richard Ivey School of Business, and at the research seminar series of City University of Hong Kong. The authors thank Derrick Neufeld, Arjun Bhardwaj, Choon Ling Sia, and conference/seminar participants for their helpful comments. This research was partially supported by CERG and SRG research grants at City University of Hong Kong. References Ahuja, M.K., & Carley, K.M. (1999). Network structure in virtual organizations. Organization Science, 10(6), 741–757. Ancona, D.G., & Caldwell, D.F. (1992). Demography and design: Predictors of new product team performance. Organization Science, 3(3), 321–341. Awazu,Y., & Desouza, K.C. (2004). Open knowledge management: Lessons from the open source revolution. Journal of the American Society for Information Science and Technology, 55(11), 1016–1019. Baldwin, C.Y., & Clark, K.B. (2006). The architecture of participation: Does code architecture mitigate free riding in the open source development model? Management Science, 52(7), 1116–1127. Boehm, B.W. (1981). Software engineering economics. Englewood Cliffs, NJ: Prentice Hall. Brooks, F.P. (1975). The mythical man-month: Essays on software engineering. Reading, MA: Addison-Wesley. Câmara, G., & Fonseca, F. (2008). Information policies and open source software in developing countries. Journal of the American Society for Information Science and Technology, 58(1), 121–131. Clark, K.B., & Fujimoto, T. (1991). Product development performance. Boston: Harvard Business School Press. Comrey, A.L., & Lee, H.B. (1992). A first course in factor analysis. Hillsdale, NJ: Erlbaum. Crowston, K., Annabi, H., & Howison, J. (2003). Defining open source software success. Proceedings of the 24th International Conference on Information Systems (pp. 14–17), Seattle, WA. Crowston, K.B., & Scozzi, B. (2002). Open source software projects as virtual organizations. IEE Proceedings–Software, 149(1), 3–17. Dillon, W.R., & Goldstein, M. (1984). Multivariate analysis: Methods and applications. New York: Wiley. Ellemers, N., Spears, R., & Doosje, B. (1997). Sticking together or falling apart: In-group identification as a psychological determinant of group

1008

commitment versus individual mobility. Journal of Personality and Social Psychology, 72(3), 617–626. Fang, Y., & Neufeld, D.J. (in press). Understanding sustained participation in open source software projects. Journal of Management Information Systems. Fielding, R.T. (1999). Shared leadership in the Apache Project. Communications of the ACM, 42(4), 42–43. Fink, M. (2003). The business and economics of Linux and open source. Upper Saddle River, NJ: Prentice Hall PTR. Fox, J. (1991). Regression diagnostics. Newbury Park, CA: Sage. Gamson, W.A. (1991). Commitment and agency in social movements. Sociological Forum, 6, 27–50. Green, S.B. (1991). How many subjects does it take to do a regression analysis? Multivariate Behavioral Research, 26(33), 449–510. Griffeth, R.W., Hom, P.W., & Gaertner, S. (2000). A meta-analysis of antecedents and correlates of employee turnover: Update, moderator tests and research implications for the next millennium. Journal of Management, 26(3), 463–488. Guadagnoli, E., & Velicer, W.F. (1988). Relation of sample size to the stability of component patterns. Psychological Bulletin, 103(2), 265–275. Hardgrave, B.C., Davis, F.D., & Riemenschneider, C.K. (2003). Investigating determinants of software developers’ intentions to follow methodologies. Journal of Management Information Systems, 20(1), 123–151. Harhoff, D., Henkel, J., & von Hippel, E. (2003). Profiting from voluntary information spillovers: How users benefit from freely revealing their innovations. Research Policy, 32(10), 1753–1769. Hermann, U. (2004). Unmaintained free software. Retrieved January 29, 2009, from http://www.unmaintained-free-software.org/links.php Hertel, G., Niedner, S., & Herrmann, S. (2003). Motivation of software developers in open source projects: An Internet-based survey of contributors to the Linux kernel. Research Policy, 32(7), 1159–1177. Himanen, P. (2001). The hacker ethic and the spirit of the Information Age. London: Secker & Warburg. Holtgrewe, U., & Werle, R. (2001). De-commodifying software? Open source software between business strategy and social movement. Science Studies, 14(2), 43–65. Horwitz, F.M., Heng, C.T., & Quazi, H.A. (2003). Finders keepers? Attracting, motivating and retaining knowledge workers. Human Resource Management Journal, 13(4), 23–44. Huber, P.J. (1967). The behavior of maximum-likelihood estimates under non-standard conditions. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability (pp. 221–233), Berkeley, CA. Iansiti, M. (1993). Real world R&D: Jumping the product generation gap. Harvard Business Review, 71(3), 131–147. Imai, K., Ikujiro, N., & Takeuchi, H. (1985). Managing the new product development process: How Japanese companies learn and unlearn. In R.H. Hayes, K. Clark, & C. Lorenz (Eds.), The uneasy alliance: Managing the productivity-technology dilemma (pp. 337–375). Boston: Harvard Business School Press. Jones, C. (1986). Programming productivity. New York: McGraw-Hill. Kaiser, H.F. (1974). An index of factorial simplicity. Psychometrika, 39(1), 31–36. Kalina, I., & Czyzycki, A. (2005). The ins and outs of open source. Consulting to Management, 16(3), 41–46. Kennedy, P. (2000). A guide to econometrics. Boston: MIT Press. Klandermans, B. (1984). Mobilization and participation: Social psychological expansions of resource mobilization theory. American Sociological Review, 49(5), 583–600. Klandermans, B. (1997). The social psychology of protest. Oxford, United Kingdom: Basil Blackwell. Koch, S., & Schneider, G. (2002). Effort, cooperation and coordination in an open source software project: GNOME. Information Systems Journal, 12(1), 27–42. Krishnamurthy, S. (2002). Cave or community? An empirical examination of 100 mature open source projects. First Monday, 7. Lakhani, K., & Wolf, R.G. (2003). Why hackers do what they do: Understanding motivation and effort in free/open source software projects.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

In J. Feller, B. Fitzgerald, S. Hissam, & K.R. Lakhani (Eds.), Perspectives on free and open source software (pp. 3–22). Cambridge, MA: MIT Press. Lee, G.K., & Cole, R.E. (2003). From a firm-based to a community-based model of knowledge creation: The case of the Linux kernel development. Organization Science, 14(6), 633–649. Lehman, M.M. (1998). Software’s future: Managing evolution. IEEE Software, 15(1), 40–44. Lerner, J., & Tirole, J. (2002). Some simple economics of open source. Journal of Industrial Economics, 50(2), 197–234. Lerner, J., & Tirole, J. (2005). The scope of open source licensing. Journal of Law, Economics, and Organization, 21(1), 20–56. Levy, S. (1984). Hackers. Heroes of the computer revolution. Garden City, NY: Anchor Press/Doubleday. Lewis, J.A., Henry, S.M., Kafura, D.G., & Schulman, R.S. (1991). An empirical study of the object-oriented paradigm and software reuse. Proceedings of the 6th Conference on Object-Oriented Programming Systems, Languages and Applications (pp. 184–196), Phoenix, AZ. Ljungberg, J. (2000). Open source movements as a model for organizing. European Journal of Information Systems, 9(4), 208–216. Mahalanobis, P. (1936). On the generalized distance in statistics. Conference of the National Institute of Science of India (pp. 49–55), Calcutta. Melucci, A. (1989). Nomads of the present: Social movements and individual needs in contemporary society. Philadelphia: Temple University Press. Mockus, A., Fielding, R.T., & Herbsleb, J. (2002). Two case studies of open source software development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology, 11(3), 309–346. Mustonen, M. (2003). Copyleft—The economics of Linux and other open source software. Information Economics and Policy, 15, 99–121. Netcraft. (2007). February 2007 Web Server Survey. Retrieved January 29, 2009, from http://news.netcraft.com/archives/web_server_survey. html Neufeld, D., & Fang, Y. (2005). Individual, social and situational determinants of telecommuter productivity. Information & Management, 42(7), 1037–1049. Newby, G.B., Greenberg, J., & Jones, P. (2002). Open source software development and Lotka’s Law: Bibliometric patterns in programming. Journal of the American Society for Information Science and Technology, 54(2), 169–178. O’Mahony, S., & Ferraro, F. (2004). Hacking alone? The effects of online and offline participation on open source community leadership. Cambridge, MA: Harvard Business School Press. Olson, M. (1967). The logic of collective action. Cambridge, MA: Harvard University Press. Omoto, A.M., & Snyder, M. (1995). Sustained helping without obligation: Motivation, longevity of services and perceived attitude change among AIDS volunteers. Journal of Personality and Social Psychology, 68, 671–686. Parnas, D.L. (1994). Software aging. Proceedings of the 16th International Conference on Software Engineering (pp. 279–287), Los Alamitos, CA. Paulson, J.W., Succi, G., & Eberlein, A. (2004). An empirical study of open-source and closed-source software products. IEEE Transactions on Software Engineering, 30(4), 246–256. Raymond, E.S. (2001). The cathedral & the bazaar: Musings on Linux and open source by an accidental revolutionary (2nd ed.). Sebastapol, CA: O’Reilly. Raymond, E.S., & Trader, W.C. (1999). Linux and open-source success. IEEE Software, 16(1), 85–89. Savin, N.E., & White, K.J. (1977). The Durbin–Watson Test for serial correlation with extreme sample sizes or many regressors. Econometrica, 45(8), 1989–1996. Scacchi, W. (1995). Understanding software productivity. In D. Hurley (Ed.), Software engineering and knowledge engineering: Trends for the next decade. Singapore: World Scientific Press. Sen, R. (2007). A strategic analysis of competition between open source and proprietary software. Journal of Management Information Systems, 24(1), 233–257.

Sethi, V., & King, W.R. (1991). Construct measurement in information systems research: An illustration in strategic systems. Decision Sciences, 22(4), 455–472. Shah, S. (2006). Motivation, governance and the viability of hybrid forms in open source software development. Management Science, 52(7), 1000– 1014. Simon, B., Loewy, M., & Doe, J. (1998). Collective identification and social movement participation. Journal of Personality and Social Psychology, 74, 646–658. St. Laurent, A.M. (2004). Understanding open source and free software licensing. Sebastopol, CA: O’Reilly. Stallman, R.M., & Lessig, L. (2002). Free software, free society: Selected essays of Richard M. Stallman. Boston: GNU Press. Stewart, K.J., & Gosain, S. (2006). The impact of ideology on effectiveness in open source software development teams. MIS Quarterly, 30(2), 291–314. Stewart, K.J., & Maruping, L.M. (2006). Impacts of license choice and organizational sponsorship on user interest and development activity in open source software projects. Information Systems Research, 17(2), 126–144. Tabachnik, B.G., & Fidell, L.S. (2000). Using multivariate statistics. Needham Heights, MA: Allyn & Bacon. Tajfel, H. (1978). Social categorization, social identity, and social comparison. In H. Tajfel (Ed.), Differentiation between social groups. London: Academic Press. Tajfel, H. (1981). Human groups and social categories: Studies in social psychology. Cambridge, United Kingdom: Oxford University Press. Tajfel, H., & Turner, J.C. (1986). The social identity theory of intergroup behavior. In S. Worchel & W.G. Austin (Eds.), Psychology of Intergroup Relations (pp. 7–24). Chicago: Nelson-Hall. Turner, J.C., Hogg, M.A., Oakes, P.J., Reicher, S.D., & Wetherell, M.S. (Eds.). (1987). Rediscovering the social group: A self-categorization theory. Oxford, United Kingdom: Basil Blackwell. Varner, P.E. (2000). The economics of open source. Retrieved January 29, 2009, from http://www.digitalparlor.org/osddp/files/issues/The%20 Economics%20of%20Open%20Source%20Software.doc Von Hippel, E. (2001). Innovation by user communities: Learning from open source software. MIT Sloan Management Review, 42, 82–86. Von Hippel, E., & Von Krogh, G. (2003). Open source software and the private-collective innovation model: Issues for organization science. Organization Science, 14, 209–223. Wayner, P. (2000). How Linux and the free software movement undercut the high-tech titans. New York: Harper Business. White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50(1), 1–26.

Appendix: Robustness of Results OLS and its related analytical techniques, such as multiple regression and factor analysis, rest on the fulfillment of basic assumptions. Our evaluation of the assumptions for this article’s findings allowed us to rule out the results as artifacts of the methods used. Robustness of Factor Analysis for Productivity Factor Score The first evaluation is the fulfillment of the number of data points necessary to obtain reliable correlations, which are at the heart of the factor analytical approach. Guidelines from Comrey and Lee (1992) specified a number of 500 cases per factor as “very good.” Alternatively, Guadagnoli and Velicer (1988) posited that when the observed loadings are in excess of 0.80 (such as in this case), only 150 cases per factor are sufficient. In our case, the number of valid data points for the factor analysis exceeds the mentioned guidelines.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

1009

The second concern is the linearity of the factor items. A correlational technique such as factor analysis would ignore any nonlinear relation between factor items, and hence, if nonlinearity is strong, the amount of variance captured by the factor would be seriously decreased and subsequent regression analyses may lose a good part of their power. Hence, all possible bivariate relations between two different items in the factor are supposed to be linear for factor analysis. The assessment of item linearity is done through the inspection of bivariate scatter plots (Fox, 1991). A matrix scatter plot between all possible pairs of items (not shown) depicted a predominant lack of linearity between pairs of variables. To alleviate that, logarithmic (base 10) transformations were used in all items, and the scatter plots were repeated (not shown). The transformed items greatly improved linearity and hence were deemed as better for the factor analysis. A concern about using transformed data is interpretability. In this case, we think a logarithmic transformation of productivity items is theoretically congruent with empirical observations (Brooks, 1975) that accounted for variations of decimal orders of magnitude in developers’productivity measured on an LOC count basis. This fact lets us think that arithmetic increases of underlying factors of developer effort may impact on variations of orders of magnitude in productivity and that logarithmically transformed variables can be safely interpreted. The third assumption is related to multivariate normality of the factor items. Normality is necessary only if the number of factors is to be statistically inferred (Tabachnik & Fidell, 2000). In our case, we posited a priori only one factor of interest. Nevertheless, to assess the extent to which our items may depart from normality, we ran a nonparametric Kolmogorov–Smirnoff test that checks the distribution of the item data against the null hypothesis of normal distribution (not shown). Acceptable (nonsignificant; p > .05) results were obtained in all cases except in the case of LOC deleted per developer, where the statistic was marginally significant (p = .03). Given that the results lean towards the judgment of multivariate normality and that these tests are very sensitive in cases of relatively large samples, as confirmation we also plotted the items’ histograms (not shown). A symmetric, bell-shaped histogram in all cases (not shown) supported the adequacy of the item data distribution. The next assumption is the absence of outliers. Outliers might influence correlation results having disproportionate influence on the factor solution. Transformed data were screened for multivariate outliers by regressing one of the items on the remaining two and calculating the Mahalanobis distance for each case (Fox, 1991; Mahalanobis, 1936). Mahalanobis distance, when higher than the critical chi-square statistic for a number of degrees of freedom equal to the number of variables, signals possible multivariate outliers. The critical value from a chi-squared table with p < .01 and 3 df is 13.816. Only three cases had a distance greater than the critical value, with the greatest value being 15.521. Given the relatively few outlying cases, the small deviation from the critical value, and the absence of theoretical clues to guide the decision of removal of offending cases, 1010

the influence of multivariate outliers in the factor analytical solution was discarded, and all cases were kept in the sample. Robustness of the Regression Models We will refer here to the regression model for productivity. The same tests were run (and similar conclusions obtained) for the other regression models as well. Results are summarized in Table A1. The number of cases is the first check we need to consider for a regression model. Green (1991) recommended using a minimum number of cases given by N, where N = 104 + n, and n is the number of predictors to test for individual coefficients. In our case, with six predictors (one intercept and five independent variables), we largely exceeded his recommendation. The second assumption is linearity of the dependent variable with the other continuous regressors. We plotted the productivity score against size and total number of developers, obtaining a nonlinear scatter plot (not shown). When we repeated the plot with log- (base 10) transformed size and number of developers, a better shape resembling a linear association was obtained, but the shape of the scatter plots still was not definitively clear. Instead of trying another transformation, which might be theoretically questionable, we decided to test whether the regression model retained its features when we split the continuous independent variables around the median and into two categories (low and high). The sign and general size of coefficients are conserved with respect to the original model, signaling that slight departures from linearity do not impact the significance of the coefficients and their direction. The third and fourth assumptions we need to confirm are the normality and homoscedasticity of residuals. This is assessed by inspecting the residuals plot (not shown). The general shape and distribution of residuals did not raise concerns. The fifth assumption, and an important one considering that we are using an OLS model with repeated values along time for the same project, is related to the independence of error terms. The existence of a strong autocorrelation of residual terms implies that the coefficient estimates will not be unbiased (Kennedy, 2000). One way to identify problems of autocorrelation is to observe the scatter plot of residuals against time, but we could not observe any apparent trend. Another way to assess autocorrelation is the inspection of the Durbin–Watson statistic in the original regression; its value was 1.790. The tables developed by Savin and White (1977) allow determining if the presence of autocorrelation is significant. In our case, from tables for p < .01 and models with an intercept term, five regressors, and a number of cases equal to 200 (the maximum tabulated and a conservative estimation for bigger sample sizes), the upper bound for d for a nonsignificant effect of autocorrelation is 1.725. Since the observed d value is higher, then we could not reject the null hypothesis of nonautocorrelation. Another test for autocorrelation consists in adding to the original regression model a lagged value of the dependent

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

variable as regressor. If the lagged value is significant, then the existence of autocorrelation can be presumed. We performed that test (not shown), and the significant value (p < .01) of the coefficient for the lagged dependent variable as regressor indicates on this occasion that we need to consider that a significant first-order positive autocorrelation of residuals may exist. We adopted the remedy indicated by Dillon and Goldstein (1984). They recommended replacing the original model with one that uses the Durbin–Watson correction. This correction assumes a positive first-order error autocorrelation, and consists in running the regression with the difference between original and one-period lagged independent and dependent variables. In the new regression model, the significance of coefficients can be assessed without the deleterious influence of autocorrelation. We performed the correction, shown as Model AR(1) in Table A1, and again obtained the same order and

Table A1.

direction of coefficients. Additionally, inspection of this model’s residuals did not show any noticeable trend that may suggest higher order autocorrelation. Then, we obtained support for the validity of the conclusions from the original model. We also wanted to assess if we obtained similar results independently of using a factor-analytic derived productivity factor. We performed a regression using the highest loading item as dependent variable. The results were consistent with the original model (Model “LOC added only” in Table A1), then ruling out the findings as artifacts of the factor analytical productivity score used. Finally, we used in our models the Huber–White robust standard error estimators (Huber, 1967; White, 1982) to account for possible bias arising from a potential clustering. The results also are shown in Table A1 and are similar to those of the original OLS model.

Robustness, productivity. Original model (OLS)

No transformations

LOC added only

AR- (1)

Robust regression

Coeff.

t

Coeff.

t

Coeff.

t

Coeff.

t

Coeff.

t

Intercept Time Copyleft Object−oriented Size Developers

−3.148**** −0.056**** 0.334**** −0.179** 0.642**** 0.352****

−11.33 −7.62 4.56 −2.37 10.48 4.90

−0.146 −0.056**** 0.233** −0.245** 0.652**** 0.311****

−1.59 −7.64 3.13 −3.12 9.22 4.47

−0.108 −0.043**** 0.233** −0.146* 0.603**** 0.240***

−0.40 −5.96 3.25 −1.96 10.13 3.41

−3.216**** −0.042**** 0.324**** −0.217** 0.709**** 0.362****

−11.07 −5.48 4.40 −2.79 10.07 4.51

−3.192**** −0.055**** 0.367**** −0.218** 0.657**** 0.329****

−11.30 −7.46 4.93 −2.83 10.55 4.51

F n R2 Durbin-Watson d

49.01**** 674 0.268 1.790

31.11**** 757 0.172

36.93**** 677 0.216

43.67**** 573 0.278 1.520

48.73**** 674

Robust regression uses Huber-White error estimators. **** p < 0.0001. *** p < 0.001. ** p < 0.01. * p < 0.05.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2009 DOI: 10.1002/asi

1011

Suggest Documents