Research Corner with Damien Irving
Research Corner with Damien Irving
BAMOS March 2017
25
While pretty much all data analysis and visualisation tasks could be achieved with a combination of these core libraries , their highly flexible , all-purpose nature means relatively common / simple tasks can often require quite a fair bit of work ( i . e . many lines of code ). To make things more efficient for data scientists , the scientific Python community has therefore built a number of libraries on top of the core stack . These additional libraries aren ’ t as flexible — they can ’ t do everything like the core stack can — but they can do common tasks with far less effort …
Generic additions
Let ’ s first consider the generic additional libraries . That is , the ones that can be used in essentially all fields of data science . The most popular of these libraries is undoubtedly pandas , which has been a real game-changer for the Python data science community . The key advance offered by pandas is the concept of labelled arrays . Rather than referring to the individual elements of a data array using a numeric index ( as is required with numpy ), the actual row and column headings can be used . That means Fred ’ s height could be obtained from a medical dataset by asking for data [‘ Fred ’, ‘ height ’], rather than having to remember the numeric index corresponding to that person and characteristic . This labelled array feature , combined with a bunch of other features that simplify common statistical and plotting tasks traditionally performed with scipy and matplotlib , greatly simplifies the code development process ( read : less lines of code ).
One of the limitations of pandas is that it ’ s only able to handle one- or two-dimensional ( i . e . tabular ) data arrays . The xarray library was therefore created to extend the labelled array concept to x-dimensional arrays . Not all of the pandas functionality is available ( which is a trade-off associated with being able to handle multi-dimensional arrays ), but the ability to refer to array elements by their actual latitude ( e . g . 20 South ), longitude ( e . g . 50 East ), height ( e . g . 500 hPa ) and time ( e . g . 2015 – 04 – 27 ), for example , makes the xarray data array far easier to deal with than the numpy array . ( As an added bonus , xarray also builds on netCDF4 to make netCDF input / output easier .)
Discipline-specific additions
While the xarray library is a good option for those working in the weather and climate sciences ( especially those dealing with large multi-dimensional arrays from model simulations ), the team of software developers at the Met Office have taken a different approach to building on top of the core stack . Rather than striving to make their software generic ( xarray is designed to handle any multi-dimensional data ), they explicitly assume that users of their iris library are dealing with weather / climate data . Doing this allows them to make common weather / climate tasks super quick and easy , and it also means they have added lots of useful functions specific to weather / climate science .
In terms of choosing between xarray and iris , some people like the slightly more weather / climate-centric experience offered by iris , while others don ’ t like the restrictions that places on their work and prefer the generic xarray experience ( e . g . to use iris your netCDF data files have to be CF compliant or close to it ). Either way , they are both a vast improvement on the netCDF4 / numpy / matplotlib experience .
Simplifying data exploration
While the plotting functionality associated with xarray and iris speeds up the process of visually exploring data ( as compared to matplotlib ), making minor tweaks to a plot or iterating over multiple time steps is still rather cumbersome . In an attempt to overcome this issue , a library called holoviews was recently released . By using matplotlib and bokeh under the hood , it allows for the generation of static or interactive plots where tweaking and iterating are super easy ( especially in the Jupyter Notebook , which is where more and more people are doing their data exploration these days ). Since holoviews doesn ’ t have support for geographic plots , geoviews has been created on top of it ( which incorporates cartopy and can handle iris or xarray data arrays ).
Sub-discipline-specific libraries
So far we ’ ve considered libraries that do general , broad-scale tasks like data input / output , common statistics , visualisation ,