Intelligent Data Centres Issue 43 | Page 49

WE WERE TRYING TO LOOK AT WAYS OF REDUCING THE IMPACT OF CONGESTION AND THE ROCKPORT NETWORK LOOKED LIKE IT WAS A GOOD WAY OF ACHIEVING THAT .
END USER INSIGHT END USER INSIGHT interconnect topologies that provide a connectivity mesh in which every network endpoint can efficiently forward traffic to every other endpoint . The Rockport Switchless Network is a distributed , highly reliable , high-performance interconnect providing pre-wired supercomputer topologies through a standard plug-andplay Ethernet interface .
Alastair Basden , DiRAC / Durham University , Technical Manager of COSMA HPC Cluster , discusses the project in further detail and outlines the benefits Rockport has provided .
What does it mean to be the technical manager of COSMA HPC Cluster at Durham University – what does this require ?
I manage a HPC system here . It ’ s part of the DiRAC Tier I national facility , operated by one of the UK research councils . DiRAC has four different HPC deployments around the country and the one that we have here in Durham is called COSMA . What it means is that I basically keep the system running on a day-to-day basis with the help of my team , and we do various repairs , etc . that might be required . We answer user queries , but we also plan for future extensions and consider where we need to be taking the system in the future to be able to meet the future needs of our researchers .
Can you tell us about some of the challenges the university was looking to overcome ahead of its work with Rockport ?
A HPC system is comprised of three main elements – there ’ s the compute side of things which is just lots of processes ; there ’ s the storage side of things which is where the data is saved : we do large simulations here – cosmology simulations – and we have to save all our data to storage ; and then there ’ s the network fabric which links the nodes together and couples the storage as well . One of the problems we find is that the network isn ’ t as fast as we ’ d like . Ideally , we would have an infinitely fast network but we ’ re never going to get that . One of the problems we have is if there ’ s congestion on the network , things can slow down . So , if there are other jobs or simulations running that use lots of network bandwidth , then it can affect our jobs . So we were trying to look at ways of reducing the impact of congestion and the Rockport network looked like it was a good way of achieving that .
How do data centres play a part in the network operations of Durham University ?
The university itself has several data centres – COSMA is hosted in one of those . COSMA is a self-contained unit used by the university researchers but also researchers from all over the world .
What would you prioritise in the design and build of a data centre – how does this contribute to uptime ?
When we ’ re planning and building a new data centre , the two key requirements are the input power – there has to be enough power coming in from the grid ; but also how to get rid of the excess heat – so the cooling facilities . So that ’ s two key things you have to think about when designing a data centre , as well as physical floor space .
Once you ’ ve got your facility , your building and the infrastructure in place , you have to think about how you design the kit inside it – so , the type of compute , type of network and type of storage .

WE WERE TRYING TO LOOK AT WAYS OF REDUCING THE IMPACT OF CONGESTION AND THE ROCKPORT NETWORK LOOKED LIKE IT WAS A GOOD WAY OF ACHIEVING THAT .

We are in a semi-fortunate position that we don ’ t need 100 % uptime ( we ’ re not a mission-critical service ). Our researchers know that sometimes the system will go down and they can plan their research around these regular maintenance periods , three times a year . This means we can get away with less redundancy than a system which cloud providers might run that require 100 % uptime . This in turn means we have cost savings we can invest into more compute nodes .
How do you maintain and operate the data centre once up and running – where does networking come into this ?
There ’ s a team of us that keep the system running . We do preventative maintenance , we do active maintenance – if something has failed we go in and replace parts . We are also always on the lookout for security issues so we also do a lot of software patching . And networking generally is one of those things that , until something goes wrong , we can leave it be . We regularly look at the status of the network and if there are problems such as links that have died etc . – we can go in and replace components , whether that be network cards or cables etc .
Do you have colocated data centres throughout the campus and if so , what benefits does this bring ?
Around the university there are a small number of data centres and each has its own remit , its own job to do . There are two main data centres for doing high performance computing – we have bits of kit in both of those , but we ’ re fairly well isolated within a single data centre .
How do you expect your work with Rockport to evolve moving forward ?
We ’ ve just installed the new Rockport fabric , which we finalised recently . We ’ re now starting to run codes on it and look at performance , how well it scales etc . Moving forward , we ’ ll be looking at the cost competitiveness of the Rockport fabric – it is certainly in the running for being installed in these new systems . � www . intelligentdatacentres . com
49