Moneyball for nanoHUB: Theory-Driven and Data-Driven Approaches to Understand the Formation and Success of Software Development Teams Noshir Contractor Jane S. & William J. White Professor of Behavioral Sciences Northwestern University Evanston IL 60201
[email protected]
”Your goal shouldn’t be to buy players. Your goal should be to buy wins. In order to buy wins, you need to buy runs.” (Bakshi, M., & Miller, B. (2011). Moneyball, Motion picture. USA: Columbia Pictures.
Keynote Abstract The same principle that transformed baseball may hold the key to building more innovative scientific teams. In 2002, Billy Beane changed baseball when he fielded a $41 million baseball team for the Oakland Athletics that successfully competed with the $125 million New York Yankees. We increasingly turn to teams to solve wicked scientific problems from sequencing the human genome to curing cancer. Building scientific dream teams who produce breakthrough innovations at minimal cost is not unlike choosing the players who will go on to win the World Series. Like pre-Beane baseball, much of the selection of scientific dream teams currently rests on an assessment of the caliber of the individual scientists, with far less attention paid to the relationships that gel the team together, and the factors that determine how those pivotal relationships come about. Given the increasing importance of teams in producing high-impact innovations, it is important for success in all of the domains in which teams are critical that we understand how to assemble innovation-ready teams. While there is considerable research on how to make teams more effective once they are formed, there is growing evidence that the assembly of the team itself influences the range of possible outcomes. Most prior work on teams is based on the premise that the team has been “formed” and fails to investigate the mechanisms that influence the assembly of teams and their impact on team processes and outcomes. This paper seeks to understand and enable the assembly of innovative scientific teams. We use theory-driven (social science theories) as well as data-driven (data/text mining and machine learning algorithms) to discern factors that explain/predict assembly of innovative scientific teams. F. Daniel, J. Wang, and B. Weber (Eds.): BPM 2013, LNCS 8094, pp. 1–3, 2013. c Springer-Verlag Berlin Heidelberg 2013
2
N. Contractor
We define team assembly as the set of principles that jointly determine how a team is formed. Team assembly is a multilevel construct, capturing the sets of factors occurring at four levels of emergence that determine how teams come together. The theoretical mechanisms of team assembly can be categorized at four levels of emergence: compositional, relational, task-based, and ecosystem. All four approaches are well-captured using network approaches, with the aim of understanding the impact of these four sets of factors on the likelihood that a team-assembly edge (or, in network parlance, a hyperedge) will form. The first level of compositional emergence considers each team as an aggregation of people and uses the composition of individual attributes and team attributes to explain individuals’ motivations to join teams. The second level of relational emergence also considers prior relations (such as prior collaboration or friendship) among team members to explain why members assemble into a team. A third level of task-based emergence adds attributes of the task (such as the development of open source versus proprietary software) to attributes of individuals and the relations among them to explain why people join teams. Individuals joining a certain project are represented as a bipartite graph with linkages between individuals and their project. The fourth level of ecosystem emergence captures how the larger intellectual ecosystem might explain the emergence of successful scientific teams. For instance, an ecosystem surrounding a software development team would include prior or current collaborators of those who are on the team, and collaborators of their collaborators and so on. The ecosystem approach is a novel theoretical advance in research on teams by focusing on the explanatory power of, rather than discounting as a “bug,” the fact that individuals belong simultaneously to multiple teams that have overlapping members. We conducted this research in the context of nanoHUB (http://nanohub. org), a cyberinfrastructure developed as part of the NSF-funded Network for Computational Nanotechnology. nanoHub offers a platform where teams assemble to develop software, documents, presentations and tutorials for education and research. These materials are published on nanoHUB and then rated, tagged, downloaded and utilized in ways that provide objective metrics of team outcomes. Over the past 10 years of operation nanoHUB has served a community of users that has grown to more than 250,000 annually from 172 countries worldwide. These visitors come from all of the Top 50 engineering graduate schools and from 21% of all available educational (.edu) domains, and they access more than 3,000 seminars, tools, tutorials, courses, and teaching materials posted on the site. During the past 12 months, more than 12,500 registered users have accessed over 269 simulation tools through nanoHUB’s unique, web-based simulation infrastructure, and they have launched some 430,357 simulation runs. Hence nanoHUB is uniquely suited for us to observe teams engaged in the creation of scientific products – both basic and applied. In nanoHUB, scientists can self assemble and we can observe the choices that they make with self-assembly. We developed a theory-driven approach by elucidating factors that influence team assembly at the compositional level (attributes of individuals on the team), relational level (prior collaboration, co-authorship and citations between
Moneyball for nanoHUB
3
individuals), task level (attributes of the task, such as development of open vs closed software), and ecosystem level (their prior and current membership in the landscape of all teams). We also developed a data-driven approach by using machine learning techniques to identify which of a set of features were the best predictors of team assembly and success. We offer substantive interpretations of the results of the data-driven models by eliciting specific decision trees that predicted high probabilities of team formation and success. Interpretations of decisions trees from data-driven approaches offer new insights that can in turn be used to guide the development of new social science theories about the team assembly. As such this paper argues for a new iterative computational social science methodology that combines both theory and data driven approaches. Results of the research described here will help (i) individual researchers assemble their own dream team, (ii) university administrators to help organize interdisciplinary initiatives for research and education, (iii) leaders of cyberinfrastructure such as the NSF-funded nanoHUB, use a dashboard and recommender system, to monitor and enable high performing virtual collaboration within the nanoHUB community, (iv) program officers at funding agencies who make decisions about the likely payoff of scientific teams, and (v) science policy makers on how to design and fund research programs that incentivize the assembly of dream teams.