With the advent of recent technological and computational ... increase in efficiency, scientists at all career stages are finding ... But the use of technology and.
Education
Advanced Technologies and Data Management Practices in Environmental Science: Lessons from Academia Rebecca R. Hernandez, Matthew S. Mayernik, Michelle L. Murphy-Mariscal, and Michael F. Allen Environmental scientists are increasing their capitalization on advancements in technology, computation, and data management. However, the extent of that capitalization is unknown. We analyzed the survey responses of 434 graduate students to evaluate the understanding and use of such advances in the environmental sciences. Two-thirds of the students had not taken courses related to information science and the analysis of complex data. Seventy-four percent of the students reported no skill in programming languages or computational applications. Of the students who had completed research projects, 26% had created metadata for research data sets, and 29% had archived their data so that it was available online. One-third of these students used an environmental sensor. The results differed according to the students’ research status, degree type, and university type. Changes may be necessary in the curricula of university programs that seek to prepare environmental scientists for this technologically advanced and data-intensive age. Keywords: data life cycle, data repository, education, environmental sensors, eScience
W
ith the advent of recent technological and computational advances, scientists are using increasing numbers of in situ environmental sensors, model simulations, crowdsourcing tasks, and embedded networked systems that enable environmental studies to incorporate various spatio temporal scales and to produce unprecedented amounts of data (Porter et al. 2005, Benson et al. 2010). Such technologies and an increasing interest in synthesis studies of environmental phenomena have made data valuable beyond their immediate use (Peters et al. 2008). The flood of data that digital technologies produce (Hey and Trefethen 2003) underscores the urgency of a rapid adoption of pertinent skills and best practices by environmental scientists in the proper management of data sets. Studies in which such preparedness in the environmental sciences is evaluated are absent; however, academic institutions may play a role in imparting the relevant knowledge and skills to the next generation of scientists. As electronic devices become smaller and cheaper and as complementary computer power grows and applications increase in efficiency, scientists at all career stages are finding technology useful for addressing topics from global epidemics to climate change. Such integration has transformed
both the experimental techniques and the solitary working platforms known by predecessors in the field in the not-sodistant past (Nature 2003). But the use of technology and interdisciplinary collaborations often necessitates analytical tools for the integration and analysis of large and hetero geneous data sets. In a survey of a distributed seminar course for ecology graduate students incorporating 11 American universities, Andelman and colleagues (2004) found that over 90% of the students did not have skills in the scripted programming languages that they considered essential for large data set integration and analysis. The degree to which academic institutions have modified their curricula or programs in anticipation of increasing demand for scientists with technological and computational competency is unknown. Another trend yet to be quantified is an increase in the number of environmental scientists who follow proper data management practices to improve their research. Exemplifying this trend, the National Science Foundation (NSF) now requires that all grant applications include data management plans (NSF 2010). Regardless of the size of a project or its associated data products, creating and following through with such plans requires fulfilling metadata requirements and completing the
BioScience 62: 1067–1076. ISSN 0006-3568, electronic ISSN 1525-3244. © 2012 by American Institute of Biological Sciences. All rights reserved. Request permission to photocopy or reproduce article content at the University of California Press’s Rights and Permissions Web site at www.ucpressjournals.com/ reprintinfo.asp. doi:10.1525/bio.2012.62.12.8
www.biosciencemag.org
December 2012 / Vol. 62 No. 12 • BioScience 1067
Education data life cycle (e.g., collection, management, interpretation, long-term archiving; Wallis et al. 2010). Metadata are the documentation and annotations used to manage, share, and preserve data resources. Many believe that metadata standards are critical for overcoming widespread problems of linguistic uncertainty that can render environmental data unshareable (Regan et al. 2002). The degree to which programs and advisers in the environmental and ecological sciences are instructing graduate students to correctly capture and record metadata or to use metadata standards, such as the Ecological Metadata Language (EML), is unknown. In addition, it is unknown whether programs and advisers are supporting and conveying the responsibility of proper data archiving in online data repositories (e.g., Dryad; www. datadryad.org) and thereby completing the data life cycle. When graduate students are not trained in data archival methods or do not take independent action to archive their graduate research data sets, they may be less likely to archive data sets in future research endeavors. As an example, the Networked Digital Library of Theses and Dissertations already contains over one million graduate products whose original data may be available only by contacting the author, or even worse, the data may have been misplaced. The continuance of this practice would be a huge loss of opportunity to the academic community, however large or small each individual student’s data set may be, especially if the number of graduate degrees awarded continues to grow (see supplemental figure S1, available online at http://dx.doi. org/10.1525/bio.2012.62.12.8). In this study, our first goal was to evaluate the technological and computational experience of environmental scientists and their data management practices in the formative stages of their career. Specifically, we were interested in the breadth of coursework completed by environmental graduate students that was germane to computational and information science and to the analysis of large and complex data sets. We also sought to determine the proficiency levels of graduate students with analytical tools, including programming languages and computational applications that are frequently employed in environmental studies. Finally, we evaluated the students’ data management practices, environmental sensor use, and interdisciplinary collaborations, comparing between those who had completed and those who had not completed their master’s research project or dissertation. A secondary goal was to compare master’s students with doctoral students and also to determine whether differences exist among different institution types in California. Specifically, we surveyed private California universities, the University of California (UC), and California State University (CSU). Private univer sities differ in their major funding sources, whereas the latter two differ in their function (i.e., institutions with exclusive jurisdiction in PhD and professional instruction or undergraduate-focused institutions with primarily master’s degree graduate programs, respectively; Douglass 2007). Using survey responses of current and former graduate students, we highlight the degree to which academia is 1068 BioScience • December 2012 / Vol. 62 No. 12
facilitating the integration of technology, computation, and data management in the environmental sciences and discuss its implications for the contribution of research data products to the greater body of scientific knowledge. Finally, we draw on these results to elucidate methods by which environmental scientists at all career stages may excel in this technological and data-intensive era. Graduate students’ responses and the datacollection process During the months of June, July, and August 2011, we conducted an online survey (using www.surveymonkey.com; see supplemental form 1). We solicited responses from master’s and doctoral students in academic departments related to environmental or ecological sciences from 27 California universities, including 4 private schools, 9 public universities in the UC system, and 14 public universities in the CSU system. CSU institutions offer research-based master’s degrees and, in general, do not support doctoral programs. All private universities and UC institutions surveyed support both master’s and doctoral programs; however, all of the survey respondents for these university types were planning to complete or had completed a doctoral degree. We excluded universities that did not respond to requests for participation and from whose students we received fewer than three responses. Private universities were those classified as research institutions by the Association of Independent California Colleges and Universities (n = 7), that offer an environmental- or ecology-related graduate program (n = 4), and that were receptive to participation (n ≥ 3). In total, 23 universities, including 18 academic programs from 11 California State Universities, 16 academic programs from 9 Universities of California, and 4 academic programs from 2 private universities, were represented. The survey responses were solicited through e-mail. When it was possible, we sent e-mail solicitations to graduate student electronic mailing lists within each surveyed department. If such mailing lists were not available, we collected student e-mail addresses from online department directory pages and e-mailed the students directly. For a few surveyed universities, we also e-mailed faculty members within the relevant departments and asked them to forward our solicitation e-mail to students. If our first solicitation to a particular department did not result in responses, we sent a second solicitation e-mail. Students who had completed their graduate degree more than two years prior or answered no to the question “[Do] your education and research foci fall within the ecological or environmental sciences?” were excluded from our analyses. The response rates were difficult to calculate, because the survey was, in most cases, sent to departmental mailing lists, the sizes of which were unknown. Instead, we counted the number of students listed on departmental Web pages. Using this proxy measure, we calculated approximate response rates of 23% for the UC sample and 25% for the private sample. We did not calculate a response rate for the CSU sample, www.biosciencemag.org
Education because department lists were not provided. We processed and statistically analyzed all of the survey data using scripts in R (www.r-project.org). For all of the survey questions, means were derived using the number of responses for each university as a weight, and the associated 95% confidence intervals (CI) were reported. We determined the differences in responses among the three university types and between the master’s and doctoral students by using chi-squared analyses based on counts derived at the response level. We used Student’s t-test scores to determine significant differences between the responses of those students with thesis or dissertation research in progress and those who had completed their research on the basis of weighted means at the individual university level. It was possible that the students would respond that their research project was both completed and in progress; this scenario occurred, for example, when a student had progressed from a research-based master’s to a doctoral program.
The students completed the least amount of coursework in networking, metadata, and information technology. The students showed little intention of eventually taking additional courses in this discipline (1.0%, 95% CI = 1.6), but that intention was numerically greatest for bioinformatics and computational biology (2.4%, 95% CI = 3.8). A large number of the students—74.6% (95% CI = 6.0)— stated that they had not completed any coursework related to the management and analysis of complex data (table 2). Approximately one-third (30.5%, 95% CI = 6.4) of the students stated that they had taken at least one course in geographic information systems (GIS), 29.2% (95% CI = 6.3) had taken coursework in modeling, and 19.6% (95% CI = 6.1) had taken courses in spatial analysis. Less than 20% of the students had taken a course in remote sensing (16.1%, 95% CI = 5.8), time series analysis (12.1%, 95% CI = 3.2), meta-analysis (6.9%, 95% CI = 3.4), or data mining (4.9%, 95% CI = 3.0).
Survey results In total, 498 graduate students responded to the survey, and of those, 434 met the study’s criteria. The number of eligible responses varied according to the student’s thesis or dissertation status (progress, n = 326; completed, n = 131), according to their education level (master’s student, n = 124; doctoral student, n = 385), and according to university type (California State University, n = 124; University of California, n = 261; private university, n = 49) (supplemental table S1).
Skills. A majority—74.0% (95% CI = 6.6)—of the students stated that they had no skills in the programming languages and computational applications evaluated in this survey. Only 17.2% (95% CI = 4.7) of the students, on average, stated that they had basic skill levels in these areas. The students had the least experience with EML (99.1% stated that they had no experience, 95% CI = 4.7; figure 1), Java (90.5%, 95% CI = 12.1), or IDL (Interactive Data Language; 90.5%, 95% CI = 0.7). The students claimed a basic skill level or higher in GIS (e.g., ArcGIS; 55.5%) and statistical applications, including R (55.9%), and JMP, SPSS, or SAS (53.0%).
Coursework. Over 80% (82.3%, 95% CI = 5.3; table 1) of
the students in our survey stated that they had completed none of the eight computer and information science courses evaluated in this study. Over 20% of the students had completed coursework in introductory computing (23.8%, 95% CI = 5.9) and computer programming (22.9%, 95% CI = 4.6).
Advanced technologies. Approximately one-third (36.7%, 95%
CI = 8.7) of the students whose program was still in progress planned to use environmental sensors in their research study (figure 2). This number paralleled the percentage of
Table 1. The mean percentage of surveyed graduate students who had taken or intended to take courses in subjects related to computational and information science. 0 courses completed Course
Mean
95% CI
Introductory computing
69.4
Computer programming
63.8
Data structures or algorithms
1 course completed
2 courses completed
3 or more courses completed
Intended future coursea
Mean
95% CI
Mean
95% CI
Mean
95% CI
Mean
95% CI
6.7
23.8
5.9
4.2
2.9
1.8
1.1
0.7
0.5
8.0
22.9
4.6
4.6
2.5
6.8
3.2
1.8
3.1
81.7
5.7
14.2
4.4
1.8
1.2
1.1
1.6
1.1
1.9
Networking
95.1
2.6
3.3
2.3
0.7
0.6
0.5
0.6
0.5
0.4
Information technology
90.8
4.9
7.4
4.3
0.7
1.0
0.7
0.6
0.5
1.8
Database management
86.1
4.0
11.0
3.5
1.6
1.1
0.5
0.6
0.9
0.7
Metadata
94.2
4.1
4.4
4.0
0.7
1.8
0.2
0.5
0.5
0.4
Bioinformatics or computational biology
76.9
6.5
15.5
4.7
3.6
1.7
1.6
1.9
2.4
3.8
All courses
82.3
5.3
12.8
4.2
2.2
1.6
1.7
1.3
1.0
1.6
Abbreviation: CI, confidence interval. a The survey stated, “0, but I will take one soon.”
www.biosciencemag.org
December 2012 / Vol. 62 No. 12 • BioScience 1069
Education Table 2. The mean percentage of surveyed graduate students who had taken or intended to take courses in subjects related to the management and analysis of large or complex data. 0 courses completed
1 course completed
2 courses completed
3 or more courses completed
Intended future coursea
Course
Mean
95% CI
Mean
95% CI
Mean
95% CI
Mean
95% CI
Mean
95% CI
Spatial analysis
71.7
7.1
19.6
6.1
3.6
2.1
2.8
1.3
2.3
1.1
Geographic information systems
54.3
9.4
30.5
6.4
7.8
3.5
3.5
1.8
3.9
3.7
Remote sensing
77.2
6.9
16.1
5.8
3.7
2.0
2.3
1.5
0.7
0.5
Modeling
54.7
7.9
29.2
6.3
7.8
2.8
5.4
2.1
2.8
2.1
Time series analysis
82.1
4.1
12.1
3.2
3.6
2.2
0.7
0.5
1.5
0.9
Meta-analysis
91.0
3.5
6.9
3.4
0.7
1.8
0.0
0.0
1.4
0.8
Data mining
91.4
3.3
4.9
3.0
1.1
1.9
0.5
0.6
2.1
1.9
74.6
6.0
17.1
4.9
4.1
2.3
2.2
1.1
2.1
1.6
All courses
Abbreviation: CI, confidence interval. a The survey stated, “0, but I will take one soon.”
students who had completed their research and had, in fact, used environmental sensors (33.1%, 95% CI = 10.1). More than 10% (i.e., 14.9%, 95% CI = 9.8) of the students whose research was in progress did not know what an environmental sensor was or what it meant to use it in environmental research, but this number was halved (7.5%, 95% CI = 0.7) for the students who had finished their research. There was no significant difference between the percentage of students whose research was in progress and who intended to use a sensor in that research and that of the students who had completed their research and who actually did use a sensor (table 3). The doctoral students whose research was still in progress planned to use environmental sensors significantly more than did the master’s students, and there was a nearly significant difference in education level for the students who had used environmental sensors in their research (p = .0520; table 4a, 4b). The students at the UC institutions planned on using environmental sensors in their research (41.9%) significantly more than did those in private (27.1%) and CSU-system (28.5%) universities (supplemental table S2). Interdisciplinary collaboration. The percentage of students who
had collaborated with someone whose expertise was outside the environmental or ecological sciences was significantly lower (37.6%, 95% CI = 1.4) than the percentage of students whose work was in progress who stated that they had planned such collaborations (55.4%, 95% CI = 7.5; table 3). The percentage of students who planned an interdisciplinary collaboration was significantly larger than that of students who were finished with their research and actually had done so (table 3). There was no significant difference in interdisciplinary collaboration activities between the master’s and doctoral students (table 4a, 4b). There were significant differences in interdisciplinary collaboration among the students at different university types who had finished their research (table S2). Specifically, the CSU students were less likely to 1070 BioScience • December 2012 / Vol. 62 No. 12
collaborate (28.1%) than were the students at UC institutions (39.8%), who were also less likely to collaborate than the students at private universities (51.7%). Data management. Approximately 72.3% (95% CI = 6.2) of
the students who were still in the process of completing their master’s or doctoral research were planning on completing the data life cycle in their research, and 65.3% (95% CI = 6.7) of these students intended to archive their research data so that it would be available online (table 3). Of those who had already completed their graduate degree, 63.9% (95% CI = 16.2) stated that they had completed the data life cycle, whereas only 29.3% (95% CI = 13.1) had made it available online—significantly less than the prospective figure from the students still in the midst of their research (table 3). A large portion of the students stated that they did not plan on making their data available online, and this number was greater for the students who had already completed their thesis or dissertation (45.9%, 95% CI = 1.3) than for those whose research was still in progress (28.0%, 95% CI = 6.7). Almost one-third of the students whose research was in progress did not know what it means to create metadata for their data sets (28.0%, 95% CI = 8.8), and a similar number (34.7%, 95% CI = 9.3) did not plan to create metadata for their data sets. For the students who had finished their research, 25.6% (95% CI = 1.3) created metadata, 63.2% (95% CI = 1.7) did not, but 12.0% (95% CI = 1.3) planned to do so some time in the future. The students’ data management practices varied according to degree type (table 4a, 4b). The doctoral students were more likely to complete or to plan to complete the data life cycle. However, the master’s students showed significantly greater intent to create metadata and to archive their data products such that it would be available online than did the doctoral students. There were no significant differences among the different university types regarding data life cycles www.biosciencemag.org
Education suggest that many of the skills and practices that would enable scientists to use these new opportunities are only marginally instructed in formal graduate programs in California in the environmental sciences.
Percentage
NONE BASIC PROFICIENT 0
EXPERT
20
40
60
80
100
C, C#, C++
Environmental curricula: New courses and skill sets. Students can and do learn new
Programming language or computational application
EML
ENVI
GIS (e.g., ArcGIS)
IDL
Java
JMP, SPSS, SAS
MATLAB
Access
Python
SQL, MySQL
R
Figure 1. The level of proficiency of the surveyed graduate students with programming languages or computational applications. The error bars represent 95% confidence intervals. Abbreviations: EML, Ecological Metadata Language; GIS, geographical information systems; IDL, Interactive Data Language.
and metadata creation (table S2). But students at private universities (69.5%) and UC institutions (67.0%) were more likely to make their research data available online than students at a CSU institution. The extent of graduate student preparation Environmental studies in which new kinds of technology, computation, data life cycle techniques, and open-source dissemination are employed hold promise for addressing many important societal issues, including the measurement of biodiversity shifts (Kelling et al. 2009) and the assessment of climate change (Graham et al. 2010), but our results www.biosciencemag.org
methods and technologies on their own, but advanced computation, in situ field sensor technologies, and digital data management best practices will only become standard tools and skills if they are integrated into formal curricula. Among the topics that we surveyed, GIS and modeling courses were the most widely studied by the students: About one-third of them had taken a GIS or modeling course. Only two other topics in our survey even reached 20%. This suggests that most environmental scientists in training are not taking the initiative to expand their knowledge in these areas through formal courses. The development of novel courses requires many resources, including expertise, time, and funding. In some cases, it may be worthwhile to integrate new material or skills into existing courses. However, external organizations may provide relevant materials that can be incorporated into an institution’s curriculum. The DataONE organization, for example, develops educational programs related to data management, such as internships, workshops at professional meetings, and educational modules on specific data management topics (see www.dataone.org/education for more information).
Learning to capitalize on technology. In this study, we show that environmental sensors are important methodological instruments for a large proportion of graduate students. A limitation of our study is that we did not assess the levels of complexity in the sensor setup (e.g., an individual device versus a sensor network) or in data streams derived from such devices. More complex scenarios often require that users have knowledge in areas in which few of the students in our survey had taken courses, such as data structures and algorithms, database management, and networking (table 1). Researchers will also need to understand how new technologies can be used, their strengths and limitations, and techniques for analyzing the numerous and complex data that they output. For example, one must December 2012 / Vol. 62 No. 12 • BioScience 1071
Education a 100
Percentage of students (research completed) 90
80
70
60
50
40
30
20
10
10
20
30
40
50
60
70
80
90
100
Completed the data life cycle (collection, management, interpretation, archival) Yes
Created metadata
No
Used environmental sensors
No, but I plan on doing so
Archived research data so that it is available online Collaborated with researcher outside environmental science
b 100
Percentage of students (research in progress) 90
80
70
60
50
40
30
20
10
10
20
30
40
50
60
70
80
90
100
Complete the data life cycle (collection, management, interpretation, archival) Yes, I plan to
Create metadata Use environmental sensors
what this means
Archive research data so that it is available online Collaborate with researcher outside environmental science
Figure 2. (a) Mean percentage of responses for the surveyed graduate students (a) who had completed their master’s or doctoral research or (b) who had not yet completed their master’s or doctoral research. The error bars represent 95% confidence intervals. The respondents were earning or had earned their master’s or doctoral degree in the ecological or environmental sciences at a California State University, the University of California, or a private California university.
Table 3. The mean percentage of surveyed graduate students who responded that they planned to complete (n = 326) or had already completed (n = 131) the relevant research steps. Research project status In progress
Completed
Research step
Mean
95% CI
Mean
95% CI
Completion of the data life cycle
72.3
6.2
63.9
16.2
t(455)
p
3.388
.0008