This article is an excerpt from GovLoop’s recent guide, “5 Cloud Trends to Watch in Government.” Download the full guide here.
At the end of 2017, the National Institutes of Health launched the NIH Data Commons Pilot Phase. It involves a consortium of data scientists, computer scientists, IT engineers, cloud service providers and biomedical researchers that is looking at ways to store and share biomedical data in the cloud. The goal is to accelerate biomedical discoveries, while also adhering to data privacy and security requirements. We spoke with Vivien Bonazzi, Project Leader of the NIH Data Commons Pilot Phase effort, to find out more. Bonazzi’s comments were lightly edited for length and clarity.
GOVLOOP: How did the pilot come to be?
BONAZZI: Biomedical data – there’s huge amounts of it. We’re talking terabytes and petabytes of data, and it’s pretty hard to use that data in the current storage that we have either at local researchers’ sites or even at NIH, so people have started using the cloud. The cloud does two things: You bring the tools to the data, so that you have all the data there and multiple people can come in and use that data. The second thing is shareability. You can have anybody potentially around the world in different geographical locations working together. If the data is all in one place, then you can actually work on it.
GOVLOOP: What does the pilot entail?
BONAZZI: One part is: How do we store data on the cloud and also pay for the compute? Another one is: What are a collection of services that we can operate over the data in the cloud so that we can make maximal utility for the folks who don’t necessarily have strong computational backgrounds? Over the next six months, we’re going to be testing those ideas and saying, “OK, if we have this data in the cloud, how do we make sure that we do have the right authentication [and] authorization system that does all the things that NIH needs to make sure that the data’s protected?” That’s one of the elements. Another one is this term FAIR, which is findable, accessible, interoperable and reusable or reproducible. The idea behind that is if you just put data in the cloud or anywhere and you can’t find it, you can’t use it and it’s not reproducible, then essentially, you’ve just got junk.
GOVLOOP: Would any of this be possible without cloud technology?
BONAZZI: We’ve been doing it without cloud technology for a while, but the killer for us has been the volume of data. You could argue that you could do this on local servers, but the problem is you’ll be looking at very large network servers. If you have five groups working on this and they all have their own five individual servers with replication of the data on each one of those, I would argue that’s not very cost-effective. It stops collaboration between researchers because they can’t work on each other’s systems. The cloud allows the unification point, where you can have potentially one copy of data and multiple researchers with approved authority to use that data. And they don’t have to maintain the IT systems associated with it in the local services.