A Python Library for Teaching Computation to ... - GeoScienceWorld

6 downloads 0 Views 437KB Size Report
Mar 14, 2018 - Python library and collection of Jupyter Notebooks based on defined scientific computation learning goals for seismology students.
SRL Early Edition

A Python Library for Teaching Computation to Seismology Students by John M. Aiken, Chastity Aiken, and Fabrice Cotton ABSTRACT Python is at the forefront of scientific computation for seismologists and therefore should be introduced to students interested in becoming seismologists. On its own, Python is open source and well designed with extensive libraries. However, Python code can also be executed, visualized, and communicated to others with “Jupyter Notebooks”. Thus, Jupyter Notebooks are ideal for teaching students Python and scientific computation. In this article, we designed an openly available Python library and collection of Jupyter Notebooks based on defined scientific computation learning goals for seismology students. The Notebooks cover topics from an introduction to Python to organizing data, earthquake catalog statistics, linear regression, and making maps. Our Python library and collection of Jupyter Notebooks are meant to be used as course materials for an upper-division data analysis course in an Earth Science Department, and the materials were tested in a Probabilistic Seismic Hazard course. However, seismologists or anyone else who is interested in Python for data analysis and map making can use these materials.

INTRODUCTION Computers have transformed the way seismologists conduct research. Tasks such as numerical modeling, data analysis, and statistics are all performed by seismologists on computers. It is practically impossible today for seismologists to process earthquake data without a computer because most data are digital. Seismologists use computers to visualize data, analyze waveforms, pick phase arrival times, locate earthquakes, and investigate earthquake catalog statistics, to name just a few examples. Many seismologists today accomplish these tasks using the Python programming language. However, in many cases, our courses do not instruct students to use the same modern tools that seismologists use. Understanding computational tools and techniques that seismologists use is not only important for training future seismologists but it also teaches students new problem-solving skills and a new representation doi: 10.1785/0220170246

of the problems themselves. Thus, computation and computational thinking are vital skills for future scientists. Computation and computational thinking afford students the ability to solve and explore methods in an iterative and explorative way (Wing, 2006). Computational thinking is the process used to translate a problem into appropriate procedures for a computer to solve it. Often times, this involves writing code. Through writing codes to solve physics problems, students are able to generalize models in a way that is inaccessible to analytical methods (Caballero et al., 2014). Complicated problems can be solved simply by students writing codes (e.g., statistical analysis of very large earthquake catalogs, modeling mantle dynamics). Common errors in coding can be generalized to conceptual issues students have with the concepts being taught in the course thus aiding instruction (Caballero et al., 2012). However, for beginner students it is often important to provide the appropriate scaffolding in order for them to be successful (Vygotsky, 1980). Scaffolded code is code building blocks provided to students that they then complete. This is quite similar to the way scientists actually code. In most cases, scientists do not start writing new code from scratch to perform an analysis. Often, they are using older codes maintained by colleagues or predecessors. Scaffolding code for students is simply translating this practice into the classroom. Why choose Python? Python is an easy language to read that is popular in modern seismology. Python is also free for anyone to install and use on any computer. Several widely used seismology libraries are written in Python (e.g., ObsPy; Beyreuther et al., 2010), and libraries such as Pandas, Numpy, and SciPy provide simple yet powerful statistical and data manipulation tools (McKinney, 2010; van der Walt et al., 2011). A textbook on computational seismology has also been written that uses Python and is designed for teaching a graduate level seismology course (Igel, 2017). Python and Jupyter Notebooks have been chosen by the Global Earthquake Model foundation for analysis and dissemination of results (Pagani et al., 2014). Thus, Python is becoming a reference language for seismology and seismic hazard. Additionally, the Anaconda Python distribution allows for easy install on all systems. Anaconda is a Python distribution built for conducting science. It is an inclusive installation of Python that includes common scientific libraries (Numpy, Scipy, Matplotlib, and Pandas), a package, and virtual environment manager (conda). Moreover, Anaconda is recommended

Seismological Research Letters

Volume XX, Number XX – 2018

1

SRL Early Edition

by the ObsPy developers for individual users (such as students) installing ObsPy (see Data and Resources). Anaconda also includes Jupyter Notebooks (hereafter, Notebooks). Jupyter is an open-source web application that promotes the creation and sharing of Python code, visualizations, and text within Notebooks. The Notebooks are easy to share and have become widely accepted in the scientific world (Shen, 2014). Journals such as Nature are now accepting Notebooks as supplemental materials and demonstrations of the analysis as described in the articles. Taken together, these make Python a powerful and affordable tool for anyone to use. This article describes a set of Notebooks and a Python library that were constructed for use by seismologists, seismology students, or anyone else who is interested in getting started with Python. The Notebooks were written for a Probabilistic Seismic Hazard course taught at Potsdam University 2016– 2017 but, in general, are designed to be used in any course that an instructor would want students to learn to input data into Python, analyze that data, and then plot that data on maps and figures. The Notebooks provide introductory examples of data input and output using modern scientific libraries in Python. They also provide real-world examples of processing seismic data while learning Python. In the next few sections, we discuss the learning goals of the Notebooks, the toolkits, and the Notebook library itself, and how to get started with using the Notebooks.

LEARNING GOALS When designing the Notebooks, we aimed for them to be modular for the course being taught (within the context of seismology). That is, as long as one of the course’s goals was to include students learning computational skills (in this case, data analysis in the Python programming environment), these materials could be included. Thus, we developed broad learning goals for the students which dictated the materials we designed. These learning goals are: 1. students should feel comfortable with standard data types in Python (list, string, float, int, dict, etc.); 2. students should be able to import data and export data using Pandas; 3. students should be able to make plots in Python that have horizontal and vertical axes using a variety of data; 4. students should be able to manipulate histograms in Python and create visualizations of histograms; 5. students should be able to make maps in Python that include data such as points and heat maps; 6. students should be able to perform exploratory data analysis and generate a hypothesis from it; and 7. students should be comfortable with using Notebooks. These learning goals were communicated to the students at the beginning of a Probabilistic Seismic Hazard course. Each goal is designed to address a specific skill that a seismologist would need to do in their day to day work (Table 1). Most importantly, these learning goals recognize that students may come with no prior preparation in Python. 2

Seismological Research Letters

Volume XX, Number XX

PYTHON COURSE MATERIALS FOR SEISMOLOGY STUDENTS PACKAGE The materials were designed to be repurposed to any course whose instructor wants to teach data analysis and map making in Python to students with little prior experience. Thus, we call the library Python Course Materials for Seismology Students. The Notebooks within this library use open-source data as well as open-source modules and toolkits, e.g., Pandas, Matplotlib, Basemap, and Cartopy, to teach students how to read data, perform simple statistical analysis on it, and visualize the result. We will first describe the data, modules, and toolkits used in the Notebooks and then the Notebooks we designed to achieve the learning goals (Table 1). Modules and Toolkits The Pandas module (derived from panel data) is an open-source Python data analysis library (Mckinney, 2010). Pandas provides high performance, easy to use data structures, and data analysis tools for the Python language, and it is useful for manipulating numerical tables and time series in a structured format. It offers tools for reading and writing data, handling missing data, reshaping and indexing data, and some basic plotting tools. A Pandas user can, for example, group data by values in a column, count the number of values, and then plot the result—all in a single line. Thus, Pandas is a powerful tool that simplifies data input and output so students can focus on data analysis. The Matplotlib module is a Python plotting library that can be used to make publication quality figures. It is designed to be familiar to users of many types of script-based plotting software such as R (R Core Team, 2013) and gnuplot (see Data and Resources). The goal of Matplotlib is to make plotting simple. A user can generate plots, histograms, power spectra, bar charts, etc., with just a few lines of code. Similar plots can be made in open-source tools such as R and gnuplot, and R can be used in Notebooks. However, we do not describe these tools here because our goal is to focus on using Python to complete these tasks. Basemap and Cartopy are map-making toolkits used within the Matplotlib library. Neither Basemap nor Cartopy perform the plotting; these toolkits facilitate the transformation of coordinates and then Matplotlib integrates, constructs, and visualizes the transformed coordinates as contours, images, vectors, etc. Basemap and Cartopy are both equally usable for making maps in the Python language. However, Basemap came under new management in 2016 with maintenance for the Basemap toolkit being guaranteed until 2020. It has been announced that Cartopy will replace Basemap (see Data and Resources). Thus, in our library, we demonstrate map making using both toolkits. However, in the future, Basemap Notebooks will no longer be supported. Jupyter Notebooks The Notebooks we present here demonstrate reading data, gridding data, statistics, linear regression, and introductory data handling, plotting, and map making. The Notebooks are freely – 2018

SRL Early Edition Table 1 Learning Goals and Their Associated Jupyter Notebooks 1

Learning Goal Python data types

2

Import and export data

3

Basic figure making

4 5

Histograms Map making

6

Exploratory data analysis

7

Jupyter Notebooks

Notebook Title Introduction to Python Introduction to reading data and plotting using Pandas Introduction to reading data and plotting using Pandas Introduction to plotting data as a heat map Introduction to reading data and plotting using Pandas Introduction to plotting data as a heat map Introduction to scatter plots and histograms Introduction to scatter plots and histograms Creating a map in cartopy and plotting data on it Plotting focal mechanisms on a cartopy map Plotting a pretty map using cartopy Plotting heat map data on a map using cartopy Maps for unshaped data using cartopy Introduction to plotting data as a heat map Introduction to scatter plots and histograms Creating a map in cartopy and plotting data on it Plotting heat map data on a map using cartopy All notebooks

Instead of having a linear progression (notebook 1, notebook 2, notebook 3, etc.), we assign Notebooks to learning goals to aid in the modularity of the materials.

available at GitHub website (see Data and Resources). The primary purpose of our Notebooks is to teach Python to students with little to no experience with using programming languages. The Notebooks are useful for teaching students how to program because the programming concepts can be separated into unique, executable cells, that is, scaffolded (Fig. 1). Once installed, students can execute the Notebooks so they can visualize the process or make changes to the code themselves. Dividing the concepts and code into cells, students learn how to program in chunks that show the content and structure and why codes sometimes do not work. It makes learning programming more accessible to students who may otherwise struggle. Most online tutorials demonstrate how to use programming languages the correct way with little help when code fails, but learning often comes from failure. At times there are more complicated errors, and there are resources for answering difficult questions about code failure (e.g., Stack Overflow). However, in some of our Notebooks we model commonly made coding errors to demonstrate how programming languages sometimes fail and how to correct it. Figure 1 is a simple example of scaffolded code illustrating code failure, but there are other more complicated code failure points such as in the “Introduction to reading data and plotting using Pandas” Notebook, which introduces students to errors that can occur when importing data (Fig. 2). The “Introduction to plotting data as a heat map” Notebook illustrates the steps that we went through to make the exercise. The process of making the exercise involved reading data, plotting it, not seeing the result,

and then repeating the process until the data were visualized correctly. We did it this way instead of having every step working because we hope it shows that the scientific process is one of the struggle and continuous improvements. Other Notebooks have scaffolded code that builds upon itself. For example, the “Creating a map in cartopy and plotting data on it” Notebook first illustrates how to make a basic map and then reuses the same code to add earthquake data on top of it. In this way, the map-making Notebook is a demonstration of increasing code complexity, that is, we want students to see how the map changes with each additional step. There are several Notebooks designed to encourage exploratory data analysis (Table 1). These Notebooks include teachable moments that stimulate students thinking about the data they are engaging with and what it is telling them. For example, the “Introduction to scatter plots and histograms” Notebook asks students to consider if the number of stations used to locate events correlates with their magnitudes (Fig. 3). It also teaches students basic Gutenberg–Richter analysis and linear regression, applicable for a course teaching statistical seismology. This latter point is not shown in Figure 3. However, it can be found in the notebook itself. After students become accustomed with data analysis and handling, Notebooks in the basic figure making and mapmaking learning goals can be used to demonstrate and visualize data (Table 1). Basic figure making Notebooks illustrate how to make simple figures using real earthquake data with the Pandas module without map projection. Once students have mastered basic figure making, they can be introduced to map-making

Seismological Research Letters

Volume XX, Number XX – 2018

3

SRL Early Edition



Figure 1. Example of using scaffolded code as a visual tool for teaching coding to students in a structured way. Coding concepts can be separated into unique executable cells. We use this style of teaching throughout many of the Notebooks. This example is taken from the “Introduction to Python” Notebook. Note, in Figures 1–3 showing screenshots of notebooks, the figures do not show the entire notebook as that would not be feasible to visualize. In all cases, there is more content available in the notebook provided on the repository than is represented here. The color version of this figure is available only in the electronic edition.

Notebooks where projection is utilized. Map-making Notebooks illustrate adding features to maps, adding earthquake data to maps (including focal mechanisms), importing satellite imagery and shaded relief, annotating maps, and plotting gridded and nongridded data as heat maps with and without interpolations. There are detailed discussions about the effects of interpolation when gridding data. For the map-making Notebooks, there are currently two versions—one that uses the Python library Basemap and one that uses the Python library Cartopy. The Basemap and Cartopy map-making Notebooks use similar data sets and provide similar results so there are no differences in content. We made two map-making versions of the Notebooks for two reasons: (1) to provide instructors an option based on their familiarity with map making in Python, and (2) because Basemap will be sunset in 2020 and will no longer exist in a useable form. Future map-making options could include the Generic Mapping Tool (GMT) Python bindings when they are released and stable (see Data and Resources). Utilities Package Along with the Notebooks provided as examples, an additional package called “utilities” is provided. This library focuses more closely on analyzing earthquake catalogs providing many usable 4

Seismological Research Letters

Volume XX, Number XX

functions for tasks such as converting timestamps to Python time objects, calculating different types of distances, calculating Gutenberg–Richter b-values, bootstrapping, making volumetric selections of data, and calculating parameter sweeps of statistics. For example, the function get_node_data returns a data within a given radius and a given center longitude and latitude. It is offered as is and is not intended to replace tools such as ZMAP (Wiemer, 2001) but rather as a jumping-off point for seismologists using Python to explore earthquake catalogs. This package also exists outside of the learning goals structure the example notebooks rely on and can be thought of as advanced topics. Course Use The Notebooks were used in a one semester course at the University of Potsdam, Germany. Nine Master’s level students took an elective course in probabilistic seismic hazard analysis (PSHA), most of which had little to no prior programming experience. These Notebooks served as the basis for a semesterlong project where students would calculate the hazard curve for a megacity region they selected. Students were taught 3 hrs per week for 15 weeks. During the first 8 weeks, the first 90 min of class were spent introducing students to concepts of PSHA and statistics. The second 90 min block was used – 2018

SRL Early Edition

▴ Figure 2. Scaffolded code illustrating data importing that goes wrong with a real-world seismology data set. In this case, an earthquake catalog from the Southern California Earthquake Data Center is used. Detailed discussions of what goes wrong and how to fix it are conversational “teachable moments” in our notebooks. The color version of this figure is available only in the electronic edition. Resources). The cartopy_environment.yml file can be found in the repository, which can be used to create a virtual environment where Notebooks can be executed using the Cartopy module. To create the virtual environment from the cartopy_ environment.yml file, students can follow this guide for creating a virtual environment (see Data and Resources). The Basemap and Cartopy modules do not play well together. If you plan to use both, you will need to create separate virtual environments. For this reason, we also provide a basemap_environment.yml file for creating a Basemap-friendly virtual environment separate from Cartopy. To test if an environment works, run the associated Notebook using the respective mapping environment (i.e., change the kernel) and execute each cell (e.g., Python Course Materials for Seismology Students; see Data and Resources). If it works, you are ready to start!

for students to apply the concepts. A typical second half of class would consist of a brief lecture on the computational concepts of that day’s Notebooks (e.g., how and why you should grid data). Then, students were given time to explore the Notebooks. Typically, a live coding session would happen as well (Rubin, 2013). In these sessions, the instructor would project their own blank notebook on the screen and give the students a task (e.g., graph a sine wave). The students were then to tell the instructor what each line of code the instructor should type to complete the given task. All of the Notebooks were designed to teach students general skills in Python programming and data analysis. The last four weeks the students spent working full time on their PSHA megacities projects. Class time was dedicated to students either working on projects, asking questions about projects and giving short presentations about their projects, or getting help with code. The ultimate results of the course were student conference style talks and posters about their assessment of hazard in megacities.

SUMMARY

Getting Started with These Materials For all examples, we recommend using Anaconda and virtual environments. Anaconda provides all prerequisite packages to be installed on any operating system with little fuss. Anaconda can be downloaded from the software’s webpage (see Data and

In this article, we present learning goals and a Notebook library for a computational seismology course in Python using realworld freely accessible seismology data sets. Our Notebook library provides a centralized location of scaffolded code that was developed for first- and second-year graduate students to learn data handling, data analysis, and map-making using

Seismological Research Letters

Volume XX, Number XX – 2018

5

SRL Early Edition

▴ Figure 3. Scaffolded code showing exploratory notebooks for students. Jupyter Notebooks have inline visualizations using Matplotlib as the default library. The color version of this figure is available only in the electronic edition. Python. Although these Notebooks were written for students, ideally anyone who is interested in learning Python can use these Notebooks. As a test of usability, the Notebooks were introduced in a Master’s level Probabilistic Seismic Hazard Analysis course, where most students had little to no previous programming experience. These notebooks are designed to sit between advanced course studies (such as Igel, 2017) and introductory courses offering students with no background in programming exposure to map making and statistics using Python. For all intents and purposes, this collection of codes and Jupyter Notebooks are complete. However, we do want to develop more Notebooks for specific courses or allow more flexibility to teachers given their familiarity with different tools. Future updates will include map making via GMT-based Python library when the toolkit becomes more stable (see Data and Resources). Also, we would like to develop a module based on seismic-waveform data analysis via ObsPy that would be useful for an upper division undergraduate or graduate-level course in observational seismology. Finally, we would like to offer more examples of using Python to fetch seismic data from sources such as Incorporated Research Institutions for Seismology and the 6

Seismological Research Letters

Volume XX, Number XX

U.S. Geological Survey. Contributions would certainly be welcome as well as feedback on integrating the materials into a course both from students and instructors.

DATA AND RESOURCES The Jupyter Notebooks presented in this article use openly available data from a variety of sources. These data sources are provided in the Notebooks, but we state them again here. Earthquake catalogs were obtained from the Advanced National Seismic System (ANSS) via http://www.ncedc. org/anss/catalog‑search.html (last accessed October 2017) and the Southern California Earthquake Center (SCEC) via http://service.scedc.caltech.edu/eq-catalogs/date_mag_ loc.php (last accessed October 2017). Focal mechanisms were obtained from the Global Centroid Moment Tensor (CMT) catalog via http://www.globalcmt.org/CMTsearch.html (last accessed October 2017). Peak ground acceleration data are available from the North American Space Agency (NASA) with registration via http://sedac.ciesin.columbia.edu/data /set/ndh-earthquake-distribution-peak-ground-acceleration /data-download#close (last accessed October 2017). We also – 2018

SRL Early Edition

illustrate using the ArcGIS server services for high-resolution satellite images in a map format. For more information, see http://resources.arcgis.com/en/help/main/10.1/#/ Approaches_for_publishing_services_with_ArcGIS/024100 000002000000/ (last accessed October 2017). The other data are from the following websites: https://github.com/obspy/ obspy/wiki#installation (last accessed January 2018), http:// matplotlib.org/basemap/users/intro.html (last accessed October 2017), https://github.com/mnky9800n/Python -course-materials-for-seismology-students/ (last accessed January 2018), https://github.com/GenericMappingTools/ gmt-python (last accessed October 2017), https://www. anaconda.com/download/ (last accessed October 2017), https://conda.io/docs/user-guide/tasks/manage-environments .html (last accessed October 2017), and https://github.com/ mnky9800n/Python-course-materials-for-seismology-students /blob/master/example_notebooks/Testing%20Python%20 (Cartopy%20version).ipynb (last accessed January 2018). The information about gnuplot can be found at T. Williams and C. Kelley, “Gnuplot 4.6: An interactive plotting program” available at http://gnuplot.sourceforge.net/ (last accessed April 2013).

OpenQuake engine: An open hazard (and risk) software for the global earthquake model, Seismol. Res. Lett. 85, no. 3, 692–702. R Core Team (2013). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. Rubin, M. J. (2013). The effectiveness of live-coding to teach introductory programming, Proc. of the 44th ACM Technical Symposium on Computer Science Education, ACM, Denver, Colorado, 651–656. Shen, H. (2014). Interactive notebooks: Sharing the code, Nature 515, no. 7525, 151. van der Walt, S., S. Chris Colbert, and G. Varoquaux (2011). The NumPy array: A structure for efficient numerical computation, Comput. Sci. Eng. 13, 22–30, doi: 10.1109/MCSE.2011.37. Vygotsky, L. S. (1980). Mind in Society: The Development of Higher Psychological Processes, Harvard University Press, Cambridge, Massachusetts. Wiemer, S. (2001). A software package to analyze seismicity: ZMAP, Seismol. Res. Lett. 72, no. 3, 373–382. Wing, J. M. (2006). Computational thinking, Comm. ACM 49, no. 3, 33–35.

John M. Aiken1 Centre for Computing in Science Education Department of Physics University of Oslo Sem Sælands vei 24 Blindern, Oslo 0316 Norway

ACKNOWLEDGMENTS

[email protected]

Chastity Aiken2 Department of Marine Geosciences - LAD Ifremer ZI de la pointe du Diable CS 10070 29280 Plouzané, France

The authors would like to thank Emily Wolin for her constructive feedback. Part of this work was supported by the Seismology and Earthquake Engineering Research Infrastructure Alliance for Europe (SERA) project funded by the EU Horizon 2020 programme under Grant Agreement Number 730900.

REFERENCES

[email protected]

Beyreuther, M., R. Barsch, L. Krischer, T. Megies, Y. Behr, and J. Wassermann (2010). ObsPy: A Python toolbox for seismology, Seismol. Res. Lett. 81, no. 3, 530–533. Caballero, M. D., J. B. Burk, J. M. Aiken, B. D. Thoms, S. S. Douglas, E. M. Scanlon, and M. F. Schatz (2014). Integrating numerical computation into the modeling instruction curriculum, Phys. Teach. 52, no. 1, 38–42. Caballero, M. D., M. A. Kohlmyer, and M. F. Schatz (2012). Fostering computational thinking in introductory mechanics, AIP Conf. Proceedings, Vol. 1413, 15–18. Igel, H. (2017). Computational Seismology: A Practical Introduction, Oxford University Press, London, United Kingdom. McKinney, W. (2010). Data structures for statistical computing in Python, Proc. of the 9th Python in Science Conference, 51–56. Pagani, M., D. Monelli, G. Weatherill, L. Danciu, H. Crowley, V. Silva, P. Henshaw, L. Butler, M. Nastasi, L. Panzeri, and M. Simionato (2014).

Fabrice Cotton3 GFZ German Research Institute for Geosciences Telegrafenberg, 14473 Potsdam, Germany fcotton@gfz‑potsdam.de

Published Online 14 March 2018 1

Also at GFZ German Research Centre for Geosciences, Telegrafenberg, 14473 Potsdam, Germany. 2 Also at The University of Texas at Austin, Institute for Geophysics, 10601 Exploration Way, Austin, Texas 78758 U.S.A. 3 Institute for Earth and Environmental Sciences, University of Potsdam, Karl-Liebknecht-Straße 24/25, 14476 Potsdam, Germany.

Seismological Research Letters

Volume XX, Number XX – 2018

7

Suggest Documents