Research Corner with Damien Irving
Research Corner with Damien Irving
BAMOS June 2017
29
3 . A way to systematically report and handle errors in the data
Before a data file is submitted to a CMIP project , it is supposed to have undergone a series of checks to ensure that the data values are reasonable ( e . g . nothing crazy like a negative rainfall rate ) and that the metadata meets community agreed standards . Despite these checks , data errors and metadata inconsistencies regularly slip through the cracks and many hours of research time is spent weeding out and correcting these issues . For CMIP5 , there is a process ( I think ) for notifying the relevant modelling group ( via the ESGF maybe ?) of an error you ’ ve found , but it will be many months ( if ever ) before a file gets corrected and re-issued . For easy-to-fix errors , researchers will therefore often generate a fixed file ( which is only available in their personal directories on the NCI system ) and then move on with their analysis .
The obvious problem with this sequence is that the original file hasn ’ t been flagged as erroneous ( and no details of how to fix it archived ), which means the next researcher who comes along will experience the same problem all over again . The big improvement I think we can make between CMIP5 and CMIP6 is a community effort to flag erroneous files , share suggested fixes and ultimately provide temporary corrected data files until the originals are re-issued . This is something the Australian community has talked about for CMIP5 , but the farthest we got was a wiki that is not widely used .
In an ideal world , the ESGF would coordinate this effort . I ’ m imagining a GitHub page where CMIP6 users from around the world could flag data errors and for simple cases submit code that fixes the problem . A group of global maintainers could then review these submissions , run accepted code on problematic data files and provide a “ corrected ” data collection for download . As part of the ESGF , the NCI could push for the launch of such an initiative . If it turns out that the ESGF is unwilling or unable , NCI could facilitate a similar process just for Australia ( i . e . community fixes for the CMIP data that are available in the NCI data library ).
4 . Community maintained code for common tasks
Many Australian researchers perform the same CMIP data analysis tasks ( e . g . calculate the Niño 3.4 index from sea surface temperature data or the annual mean surface temperature over Australia ), which means there ’ s a fairly large duplication of effort across the community . To try and tackle this problem , computing support staff from the Bureau of Meteorology and CSIRO launched the CWSLab workflow tool , which was an attempt to get the climate community to share and collaboratively develop code for these common tasks . I actually took a onemonth break during my PhD to work on that project and even waxed poetic about it in a previous Research Corner article ( August 2015 ). I still love the idea in principle ( and commend the Bureau and CSIRO for making their code openly available ), but upon reflection I feel like it ’ s a little ahead of its time . The broader climate community is still coming to grips with the idea of managing its personal code with a version control system ; it ’ s a pretty big leap to utilising and contributing to an open source community project on GitHub , and that ’ s before we even get into the complexities associated with customising the VisTrails workflow management system used by the CWSLab workflow tool . I ’ d much prefer to see us aim to get a simple community error handling process off the ground first , and once the culture of code sharing and community contribution is established the CWSLab workflow tool could be revisited .
In summary , as we look towards CMIP6 in Australia , here ’ s how things look from the perspective of a scientist who ’ s been wrangling CMIP data for years :
1 . The NCI virtual desktops are ready to go and fit for purpose
2 . The ARCCSS software for locating and downloading CMIP5 data is fantastic . Developing and maintaining a similar tool for CMIP6 should be a high priority .
3 . The ESGF ( or failing that , NCI ) could lead a communitywide effort to identify and fix bogus CMIP data files
4 . A community maintained code repository for common data processing tasks ( i . e . the CWSLab workflow tool ) is an idea that is probably ahead of its time .