BAMOS Vol 30 No.1 2017 | Page 25

Research Corner with Damien Irving

Research Corner with Damien Irving

BAMOS March 2017
25
While pretty much all data analysis and visualisation tasks could be achieved with a combination of these core libraries, their highly flexible, all-purpose nature means relatively common / simple tasks can often require quite a fair bit of work( i. e. many lines of code). To make things more efficient for data scientists, the scientific Python community has therefore built a number of libraries on top of the core stack. These additional libraries aren’ t as flexible— they can’ t do everything like the core stack can— but they can do common tasks with far less effort …
Generic additions
Let’ s first consider the generic additional libraries. That is, the ones that can be used in essentially all fields of data science. The most popular of these libraries is undoubtedly pandas, which has been a real game-changer for the Python data science community. The key advance offered by pandas is the concept of labelled arrays. Rather than referring to the individual elements of a data array using a numeric index( as is required with numpy), the actual row and column headings can be used. That means Fred’ s height could be obtained from a medical dataset by asking for data [‘ Fred’,‘ height’], rather than having to remember the numeric index corresponding to that person and characteristic. This labelled array feature, combined with a bunch of other features that simplify common statistical and plotting tasks traditionally performed with scipy and matplotlib, greatly simplifies the code development process( read: less lines of code).
One of the limitations of pandas is that it’ s only able to handle one- or two-dimensional( i. e. tabular) data arrays. The xarray library was therefore created to extend the labelled array concept to x-dimensional arrays. Not all of the pandas functionality is available( which is a trade-off associated with being able to handle multi-dimensional arrays), but the ability to refer to array elements by their actual latitude( e. g. 20 South), longitude( e. g. 50 East), height( e. g. 500 hPa) and time( e. g. 2015 – 04 – 27), for example, makes the xarray data array far easier to deal with than the numpy array.( As an added bonus, xarray also builds on netCDF4 to make netCDF input / output easier.)
Discipline-specific additions
While the xarray library is a good option for those working in the weather and climate sciences( especially those dealing with large multi-dimensional arrays from model simulations), the team of software developers at the Met Office have taken a different approach to building on top of the core stack. Rather than striving to make their software generic( xarray is designed to handle any multi-dimensional data), they explicitly assume that users of their iris library are dealing with weather / climate data. Doing this allows them to make common weather / climate tasks super quick and easy, and it also means they have added lots of useful functions specific to weather / climate science.
In terms of choosing between xarray and iris, some people like the slightly more weather / climate-centric experience offered by iris, while others don’ t like the restrictions that places on their work and prefer the generic xarray experience( e. g. to use iris your netCDF data files have to be CF compliant or close to it). Either way, they are both a vast improvement on the netCDF4 / numpy / matplotlib experience.
Simplifying data exploration
While the plotting functionality associated with xarray and iris speeds up the process of visually exploring data( as compared to matplotlib), making minor tweaks to a plot or iterating over multiple time steps is still rather cumbersome. In an attempt to overcome this issue, a library called holoviews was recently released. By using matplotlib and bokeh under the hood, it allows for the generation of static or interactive plots where tweaking and iterating are super easy( especially in the Jupyter Notebook, which is where more and more people are doing their data exploration these days). Since holoviews doesn’ t have support for geographic plots, geoviews has been created on top of it( which incorporates cartopy and can handle iris or xarray data arrays).
Sub-discipline-specific libraries
So far we’ ve considered libraries that do general, broad-scale tasks like data input / output, common statistics, visualisation,