Towards Reproducible Scientific Benchmarking in ...

Towards Reproducible Scientific Benchmarking in Geophysics: the CREDO Toolkit Patrick Sunter, Wendy Sharples, Steve Quenette, Wendy Mason, Jerico Revote 9 Nov 2010, eResearch Australasia Workshop VPAC, Monash University

Computational Software Development Conversations   Geophysicist:

“How does the new Patch Recovery algorithm compare to the results in the Van Keken Paper?”   Or:   “I’m sure the multigrid version of the code was working faster

than it does with the latest checkout a couple of months back?”   These conversations matter to both researchers and software

engineers, and the computational science we do as a group

2

Towards Reproducible Scientific Benchmarking in Geophysics: the CREDO Toolkit

Why Benchmarks focus   What are benchmarks?   Target problems with some sort of assessment metric (numerical,

scientific, computational)   In domains such as geophysics with complex physics, arguments about modelling: benchmarking key part of the field

  Benchmarks as ‘Boundary objects’ (Blackstock et al, 2009)

3


Benchmarking, Testing & Reproducibility   Reproducibility motivations

(Fomel, Hennenfent, 2007):

  scientific integrity   Robust (& productive) software

development   Technology transfer

  Specifically re benchmarks of

a HPC App like Underworld:

  Database of what code could do,

Within the world of science, computation is now rightly seen as a third vertex of a triangle complementing experiment and theory. However, as it is now often practiced, one can make a good case that computing is the last refuge of the scientific scoundrel. – Randall LeVeque

when, where, how fast, with different algorithms   Records of what model ‘suites’ I ran and what pre and postprocessing necessary to get a certain result 4


Start from where you are: StGermain to Underworld   ‘Declarative’ XML models based on with C code with

Plugins, inc gLucifer vis.   Custom scripts to configure sets of runs, postprocessing (Gnuplot, VTK)   Unit tests, some system tests in Perl scripts, hooked up to Bitten C.I. system   (See Quenette et al, 2007, ‘Explaining StGermain’ and

Moresi et al, 2007, ‘Computational approaches to studying non-linear dynamics of the crust and mantle ’)

5


BaseApps/ RayleighTaylor.xml SlabSubduction.xml

Inspirations & early design decisions   Madagascar “an open-source software package for multidimensional data analysis and reproducible computational experiments.”   Pyre: “an extensible, object-oriented framework for specifying and staging complex, multi-physics simulations.”   Kepler: “designed to help scientists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines.”

  … re-use vs developing custom apps is always a trade-off   In this case light-weight integration of benchmarking, testing

and analysis was the goal …

6


… evolving CREDO workflow/architecture

7


System components   Components:   Python package CREDO (mix of OO  

     

and scripted), Matplotlib usage CREDO scripts distributed in code itself, connected to Scons configuration tool, and thus … Bitten C.I. System (report plugins) Fairly ‘agile’ methodology of small targets, freq. releases, user feedback Documentation, inc A.P.I. & examples, generated using Sphinx

  Why Python? (vs ‘swiss army

chainsaw’ of Perl):

  Relates well to the goals of CREDO:

explorative scripting, (relatively) easy refactoring into more reproducible   Scientific software ecosystem:   (MayaVI, VTK, SciPy, Numpy, Mystic) 8


What it looks like: system tests

9


System tests (2)   Declarative simple objects (e.g.

“SysTestSuite”) and lists also support reproducibility over time   Known Test Suites can have Post-hoc modification to try different solver options   And basic knowledge of Python O.O. structures assists this   E.g. testing standard Underworld sys test regression suite using multigrid solver

Underworld TestSuite (30 SysTests)

10

Multigrid mods: XMLs Solver opts Model parameters


Benchmark: Rayleigh-Taylor convection

11


Benchmark: thermal convection

ThermakConvBench markDim.xml 12

Model Results: - Images - Data CREDO records (XML): - Test suite ran - Models ran - Test results


Design & development reflections   Trade-off between multiple package

goals not always easy   Creative tension b/w reproducibility & ‘explorability’   (Procedural script vs declarative objects)

  Fine-line regarding what to

expose & directly link to UW’s XML   Python’s lightweight package system facilitates separating ‘CREDO’ code from user modules   Documentation integrated closely into development 13


CREDO Future & eResearch bigger picture Toolkit improvement

Connections

  Current work

  Data & metadata

  Currently working on better

performance result tracking   Reporting of benchmarks (XLST, ReportLab)   Re-factor to work for computational

apps not based on StGermain   Computational cost, esp. multi-runs:   Grid (Grisu)   Parametrics -> Inversion (talk to

others)   Online file storage (e.g. ARCS) helpful also

  Output, XML records of both

models and system test / reproducibility   (AuScope grid) Python (w/ Numeric etc) should work well for data input conversion   Publishing & collaboration:

  Benchmark & analysis scripts

shareable, and potentially supports publications

  Reassess workflow tools (eg

Kepler) relation development

14


Thank you https://www.mcc.monash.edu.au/trac/AuScopeEngineering/ wiki/CREDO https://www.mcc.monash.edu.au/credo-doc/   Questions?

15


Towards Reproducible Scientific Benchmarking in ...

Towards Reproducible Scientific Benchmarking in ...

Suggest Documents

Towards Reproducible Scientific Benchmarking in ...

Towards Reproducible Scientific Benchmarking in ...

Provenance for Scientific Workflows Towards Reproducible ... - CCI

Towards a scientific blockchain framework for reproducible data analysis

Towards interoperable and reproducible QSAR

Towards reproducible MRM based biomarker

Towards Scalability Benchmarking in Cloud ... - CloudScale Project

Towards Reproducible Descriptions of Neuronal Network Models

Towards reproducible, scalable lateral molecular electronic ... - Core

Towards quantitatively reproducible substrates for ... - Semantic Scholar

IEEE GlobeCom2013 - Towards Reproducible Performance Studies ...

Towards Reproducible Performance Studies Of Datacenter Network ...

Scanning the horizon: towards transparent and reproducible ...

Advancements in RNASeqGUI towards a Reproducible Analysis of ...

Benchmarking Intelligent Service Robots through Scientific Competitions

Benchmarking international scientific excellence - Semantic Scholar

Benchmarking Mexico & Brazil Open Government Websites - Scientific

Benchmarking Mexico & Brazil Open Government Websites - Scientific ...

Scientific Benchmarking with Temporal Logic ... - Semantic Scholar

Towards Unbiased Benchmarking of Evolutionary and ... - CiteSeerX

Towards Benchmarking Microscopic Traffic Flow ... - Semantic Scholar

Towards an Extensible Middleware for Database Benchmarking

Towards the Knowledge-Driven Benchmarking of Autonomic ...

Towards Benchmarking of Video Motion Tracking Algorithms