Research Corner with Damien Irving
Research Corner with Damien Irving
BAMOS June 2017
29
3. A way to systematically report and handle errors in the data
Before a data file is submitted to a CMIP project, it is supposed to have undergone a series of checks to ensure that the data values are reasonable( e. g. nothing crazy like a negative rainfall rate) and that the metadata meets community agreed standards. Despite these checks, data errors and metadata inconsistencies regularly slip through the cracks and many hours of research time is spent weeding out and correcting these issues. For CMIP5, there is a process( I think) for notifying the relevant modelling group( via the ESGF maybe?) of an error you’ ve found, but it will be many months( if ever) before a file gets corrected and re-issued. For easy-to-fix errors, researchers will therefore often generate a fixed file( which is only available in their personal directories on the NCI system) and then move on with their analysis.
The obvious problem with this sequence is that the original file hasn’ t been flagged as erroneous( and no details of how to fix it archived), which means the next researcher who comes along will experience the same problem all over again. The big improvement I think we can make between CMIP5 and CMIP6 is a community effort to flag erroneous files, share suggested fixes and ultimately provide temporary corrected data files until the originals are re-issued. This is something the Australian community has talked about for CMIP5, but the farthest we got was a wiki that is not widely used.
In an ideal world, the ESGF would coordinate this effort. I’ m imagining a GitHub page where CMIP6 users from around the world could flag data errors and for simple cases submit code that fixes the problem. A group of global maintainers could then review these submissions, run accepted code on problematic data files and provide a“ corrected” data collection for download. As part of the ESGF, the NCI could push for the launch of such an initiative. If it turns out that the ESGF is unwilling or unable, NCI could facilitate a similar process just for Australia( i. e. community fixes for the CMIP data that are available in the NCI data library).
4. Community maintained code for common tasks
Many Australian researchers perform the same CMIP data analysis tasks( e. g. calculate the Niño 3.4 index from sea surface temperature data or the annual mean surface temperature over Australia), which means there’ s a fairly large duplication of effort across the community. To try and tackle this problem, computing support staff from the Bureau of Meteorology and CSIRO launched the CWSLab workflow tool, which was an attempt to get the climate community to share and collaboratively develop code for these common tasks. I actually took a onemonth break during my PhD to work on that project and even waxed poetic about it in a previous Research Corner article( August 2015). I still love the idea in principle( and commend the Bureau and CSIRO for making their code openly available), but upon reflection I feel like it’ s a little ahead of its time. The broader climate community is still coming to grips with the idea of managing its personal code with a version control system; it’ s a pretty big leap to utilising and contributing to an open source community project on GitHub, and that’ s before we even get into the complexities associated with customising the VisTrails workflow management system used by the CWSLab workflow tool. I’ d much prefer to see us aim to get a simple community error handling process off the ground first, and once the culture of code sharing and community contribution is established the CWSLab workflow tool could be revisited.
In summary, as we look towards CMIP6 in Australia, here’ s how things look from the perspective of a scientist who’ s been wrangling CMIP data for years:
1. The NCI virtual desktops are ready to go and fit for purpose
2. The ARCCSS software for locating and downloading CMIP5 data is fantastic. Developing and maintaining a similar tool for CMIP6 should be a high priority.
3. The ESGF( or failing that, NCI) could lead a communitywide effort to identify and fix bogus CMIP data files
4. A community maintained code repository for common data processing tasks( i. e. the CWSLab workflow tool) is an idea that is probably ahead of its time.