Genomic researcher tames container data problem to speed new therapies

DNA equence

If you think you’ve got big data challenges, consider the plight of the Translational Genomics Research Institute. The Phoenix-based nonprofit manages and share data on thousands of patients across a network of academic, government, clinical and corporate partners. With each genome sequence consuming an average of 6 terabytes of data, speed and reliability are critical.

TGen analyzes this data to develop diagnostics, prognostics and therapies for cancer, neurological disorders, diabetes and other complex diseases. Its high-performance computing environment has more than five petabytes of data on site, and it expects that volume to grow dramatically as the cost of sequencing continues to plummet.

Not surprisingly, the institute is constantly seeking to simplify complex processes. That’s why it latched onto containers two years ago. Containers, which are essentially portable, simplified software wrappers for applications, promised an easier way to exchange applications and workflows with partners, without fretting about hardware platforms and operating systems.

“I saw that it would be possible to use technologies like containers to make workflows static and easy reproducible,” said Chief Information Officer James Lowey (pictured).

The persistence problem

James Lowey, Vice President of Technology, TGenLowey isn’t afraid to be a first mover. He has already built two supercomputers that are ranked in the top 500 in the world, and he helped TGen fashion its own technology to transfer multi-terabyte files. But he didn’t have an immediate answer to a structural problem of containers: lack of data persistence.

That’s because containers were initially intended to be used for application development, not for production workloads. There was no reason to bind data stores closely to application logic in that scenario, so when a container shuts down, the data it’s using disappears. Workarounds exist, but they’re manually intensive and don’t scale well. Many vendors are working to solve the persistence problem, which one survey reported last year is now the No. 1 barrier to container adoption. But it’s still up to users to pick through the options.

Containers considerably simplified the process of sharing workflows with partners, but data still had to be handled separately.”I figured it would be great if we could package workflows into a single package, drop it into a high speed data transfer server and run it with one command,” Lowey said. “But if a collaborator has a big file system, we have to send the container and data separately. If they don’t match up, you have to troubleshoot.” With multi-terabyte file sizes, that’s like searching for a DNA strand in a haystack.

TGen found its solution in Portworx Inc., a maker of data persistence software for containers. Portworx pools existing servers and storage resources, both on-premises and in the cloud, into an integrated storage cluster that provides data services directly to containers.

Making storage management easy

TGen found the Portworx solution integrated easily with its GlusterFS network file system running on Dell EMC Isilon network-attached storage. It also adapted well to frequent change. “We bring clusters online and offline constantly. We were emphatic with the Portworx people that things had to be easy,” Lowey said.

So far, they have been. Connecting data to containers is as simple as “spinning up Portworx when you start a container,” Lowey said. “You batch it up into a tarball [compressed file] and send it.” TGen has deployed Portworx across multiple hardware platforms, including blade servers, micro-blade servers and standard “pizza box” servers.

Lifting the data portability burden should pay off as TGen’s scope expands. The company is moving from chemistry-based testing into silicon sequencing in an effort to make tests easier to analyze and meet compliance requirements. Data volumes will continue to grow; lab tests must be kept a minimum of seven years and up to 25 years in some cases.

By storing containers with their associated data, the company can more easily tap into archived information for analytics. Because containers are portable, TGen doesn’t have to worry about obsolescence. “Chemistry doesn’t change but IT does,” Lowey said.

The benefits of binding data to containers is in “knowing that the data will be secure and we can reproduce results on demand,” Lowey said. “We’re looking at managing fully containerized workflows instead of keeping raw data and doing everything from scratch.”

DNA sequence image courtesy of TGen