Issue Date: September 14, 2015
Dealing With The Life Sciences Data Stampede
Information technology is a field that thrives on creating its own language. It’s a lingo of clichés and broad generalities in which purveyors of new IT for science and business routinely speak of their products—whether hardware, software, or services—as “solutions” for a particular “space.” Despite the pressure to come up with new iterations of computing products, little happens to refresh the terms, such as a “paradigm shift,” that surround them.
Now and then, however, something new or unusual comes along. Witness “elephant flow,” which describes around-the-clock transmission of huge quantities—terabytes—of data. Elephant flow is endemic to the financial and social media sectors. It also has become a force in health care, particularly in drug discovery, where digital laboratory instruments automatically generate enormous volumes of data. Much, but not all, of these data are crucial to the discovery and development of new drugs.
The stampede of information did not show up overnight, certainly, and large research labs have been wrestling for years with the technical and work process challenges of storing, analyzing, and sharing data. Hardware and software vendors also have been hard at work designing products that can handle an indefinite volume of data while providing analysis and storage maintenance.
But as the volume of life sciences data reaches elephantine proportions, change is accelerating. Labs are investigating, and investing in, technology that may fundamentally alter both their protocols for handling data and their core research practices.
At trade shows and in interviews, data managers across the life sciences sector say they are itching to convert basic, if quite large, data storage to data networks that combine storage and analytical intelligence. Some seek to decentralize data monitoring by distributing analytical intelligence on a network capable of indefinite expansion. Others are contemplating a shift from file-based storage to a more Web-surfable technique called object storage that categorizes data according to tagged attributes. Meanwhile, decisions are being made on achieving an optimum balance between data stored on-site and in the cloud.
“The real driver for doing all this nasty slog work is that the cost of storage is such that we can no longer provide scientists with infinite pools of resources,” says Chris Dagdigian, cofounder of BioTeam, a Middleton, Mass.-based IT consultancy. “It used to be that rather than understanding what’s going on, it was far cheaper and operationally easier just to make the storage bigger and not worry about management.” This is no longer working, despite the steady drop in the price of storage, because of the volume of data involved.
George Vacek, business director for life sciences at DataDirect Networks (DDN), a data storage technology supplier in Santa Clara, Calif., agrees, adding that his firm has had “dozens” of discussions over the summer with pharmaceutical companies looking for greater capacity and control in data storage. The technology to make these changes has been available for some time, he says, and resistance to deploying it is now breaking down.
“The sector is hitting the transition point where companies feel like deer caught in the headlights as they see this data volume coming down on them,” Vacek says.
And those data are not coming only from genomics research, which is recognized as a key culprit. Vacek notes that high-resolution microscopy and imaging systems, some allowing three-dimensional monitoring over long periods of time, are generating data that take up huge amounts of storage space.
Other system and service vendors also see the life sciences sector at a tipping point. For instance, the scientific publisher Elsevier has developed a business in consulting and data management. Timothy Hoctor, the firm’s vice president of professional services for life sciences research, says the focus in the sector is shifting from accommodation to analysis.
Data storage is simple, he says. “You can take a database and dump things into it.” Accessing and analyzing data from huge stores, however, is becoming a serious crux.
“What is changing now is the vastness of the data and the differentiated data types that are coming in,” Hoctor says. Genomics data, for example, reside in conjunction with data on patients involved in clinical trials. “Not only are more and more data generated, but there are more and more potential uses for those data,” he says.
Vendors are beginning to place greater emphasis on data analysis in designing large-volume storage. Seattle-based Qumulo has introduced a system that takes distributed analysis software across a network of file storage modules.
The founders of the three-year-old company were credited with innovating a technique called scale-out network attached storage (NAS) at a previous company, Isilon, which is now owned by the data storage firm EMC. Scale-out NAS is composed of distributed data storage clusters that share an analysis core. Scale-out NAS is currently the state of the art in file-based data storage, a practice that traditionally employed a stack of data storage units to which new capacity couldn’t easily be added.
“They decided to get the band back together,” says Brett Goodwin, vice president of marketing at Qumulo. The company developed what it calls the first “data aware” scale-out NAS product, the Qumulo Scalable File System. Goodwin says the product is available as software that can be installed at each data storage cluster, networking both storage capacity and local data analysis. The net result, he claims, is improved data management and lower-cost storage.
Often with standard scale-out NAS, “the researcher calls the storage manager and says that storage is running slow and asks why,” Goodwin says. “Traditionally, the storage administrator had no ability to answer that question. It could take days to resolve.” Qumulo’s system will allow data managers to identify the process that is slowing the network and determine whether it needs to be running.
Traditional scale-out NAS has advanced as well with products such as InsightIQ, a central data intelligence system from EMC. The product provides a central window into all of the data storage in a scaled-out cluster.
Goodwin says Qumulo is working on its next move—a cloud-compatible version of its product—noting that most large research enterprises will have data networks that coordinate in-house and external cloud storage.
BioTeam’s Dagdigian says the data-aware approach to file storage developed by Qumulo will give researchers “a lot more metrics about where files are, how they are accessed, and who is accessing them, as well as which files have not been touched in a long time. There are a lot of people in the industry, myself included, who simply on the basis of the pedigree of the founders of Qumulo are paying very close attention.”
The University of Utah Scientific Computing & Imaging (SCI) Institute, a greenhouse for research imaging software development, is preparing to convert its ample data storage to the Qumulo system. “The system we are using now has been great,” says Nick Rathke, assistant director of IT at the SCI Institute. “We have had it for four or five years with virtually no downtime. The problem is that we don’t have real-time analytics. Also, you have to buy storage in such large chunks that it’s a real financial bottleneck for us.”
Adding smaller increments of storage using the Qumulo system will reduce the cost of storage capacity from about $700 per terabyte to about $400. And the SCI Institute will certainly be adding storage capacity.
The center, a federally funded software development institute, focuses on medical imaging and currently provides 32 different software packages. It is staffed by a handful of professional developers and hundreds of grad students at various departments at the university, who, Rathke explains, get their Ph.D.s and leave their code behind for the SCI Institute to support and maintain.
The institute’s data are mounting in tandem with the advances in digital research imaging instrumentation. “Part of our storage problem now is that our faculty members are doing bigger and bigger projects,” Rathke says. “Data sets that a few years ago were a few hundred gigabytes are now a terabyte or tons of terabytes of data.”
One source of the blowup is higher-resolution scanning. “We had one researcher here who did her Ph.D. three or four years ago tracing neurons across a rabbit retina. The rabbit retina set was in the 14–20 terabyte range,” Rathke says. Under new researchers, “that same data set has grown to well over 40 terabytes because they keep adding in the slices, and the tissue samples keep getting thinner and thinner.”
As a federally funded institute, the SCI Institute faces budget as well as storage space limitations, Rathke notes. And cloud storage is not a cost-effective option given the high level of engagement with the data in the labs of the SCI Institute. A more disciplined approach to managing data is thus required.
“Five or 10 years ago, storage was a big black box,” Rathke says. “We’d say project X, you get X amount of storage, and have at it. Now you have to manage that storage because the projects are getting bigger and bigger and bigger.”
The traditional routine of removing five-year-old data no longer makes room for the data payload of new research. The SCI Institute is looking to better monitor its data usage with a strategy for adding new storage capacity. “Our faculty says this is where Qumulo is important to us and is likely to be more and more important to us in the future,” Rathke says.
At the Broad Institute, advances in genomics sequencing have likewise created strains on data storage and management. “Seven or eight years ago, DNA sequencing experienced two simultaneous changes that each increased the volume of data we were seeing by three orders of magnitude,” says Christopher Dwan, acting IT director at the institute. “It became 1,000-fold faster to generate DNA data. It also became 1,000-fold cheaper by the base pair.”
The technology available to store and manage the data has also advanced. “The blocking and tackling of making a file system that can store a petabyte is kind of solved,” Dwan says. “But we have to innovate a lot in what we keep and how we keep it.”
What keeps Dwan up at night is data organization. “I have a standing joke for when people come to me and ask how to spend less money storing data. I say store less data. We all laugh, but then I ask, ‘Do you know what you’re storing? Do you know what all is in there, and do you really need it?’ ”
Anastasia Christianson, the head of IT for translational R&D at Bristol-Myers Squibb, says her main priority is to ensure that scientists are able to do what they need in the easiest and fastest way possible. Mohammad Shaikh, director of scientific computing services at BMS, adds that he is especially focused on speed. Like their counterparts at other major drug companies, they both face a huge ramp-up in data as they coordinate efforts to deliver a storage and analysis infrastructure for BMS’s research labs.
And like other large pharma companies, BMS is at a “turning point,” Shaikh says, with data generated internally and externally at an increasing rate.
“The velocity of those data is such that it is not possible to host it internally,” he says. “We are looking at many options, and cloud storage is the most promising.” The company, which was one of the first users of the Isilon scale-out NAS, has investigated the Qumulo system, Shaikh says. But much of the work at BMS’s data center focuses on how to use the technology rather than which technology to use.
Determining what can be stored in the cloud, for example, is a key consideration. Christianson says a primary determinant is how actively researchers will be accessing files. Data generated from lab instrumentation that is likely to be investigated right away are best stored internally, she says. External data and files shared in collaborations are good candidates for cloud storage.
Christianson emphasizes that pharmaceutical IT departments, though taxed, are not in crisis mode over data storage. “I think our systems are continually evolving and have been,” she says. The relationship between data system management and IT oversight at the lab level is also evolving.
“Our scientists are a lot more technologically savvy these days than they were five or 10 years ago,” she says, “and our IT professionals, the folks well versed in the technical aspects of data storage and high-performance computing, are much more scientifically aware and business aware.”
Others agree that the relationship between data and research IT management is key. “You definitely need to understand the domain to implement and manage a data management policy,” BioTeam’s Dagdigian says.
“It doesn’t matter if the data manager comes from an IT or science background because the data management can be learned,” Dagdigian says. “What is essential is that the authority for data management stays with the scientists. In my world it is completely inappropriate for an IT person to determine where to store a piece of scientific information, how to store it, or when to delete it.”
As the life sciences research world takes on the data elephant in the room, Vacek of DDN says the relationship between lab and data managers is critical to success. “I think it’s true that the best research is done by organizations that have people who are strong on both the science and the IT,” he says. “But if you can extend your capabilities in IT infrastructure, you can solve problems you weren’t able to solve before, and that puts you in the lead on the research side.”
- Chemical & Engineering News
- ISSN 0009-2347
- Copyright © American Chemical Society