From the moment that it was envisioned, the $1,000 decoded genome drew a sardonic rejoinder: Nice, but what about the $1 million analysis? Nonetheless, the decade following the initial decoding of the genome, which cost $3 billion, was characterized by a focus on streamlining the sequencing technology without much of a concern for interpreting all the data.
But now, with Illumina claiming to have hit the $1,000 mark with its HiSeq X Ten sequencer earlier this year, attention in genomics has turned to the several stages of analysis that must follow the generation of the sequencing data known as base calls.
As with sequencing technology up to now, the focus in data analytics is on cutting time and bringing down costs for researchers. But data analysis providers also have their eye on moving the analysis out of the research arena into the clinic and, eventually, into handheld devices that will give genomic analyses the look and feel of a modern-day blood test.
Currently, most of the action is going on in supercomputer banks that can store and process reams of data. But new tools—hardware, software, and Web-enabled services—have emerged to expedite the process. Researchers and equipment vendors are also working to develop means of integrating phenotypic data with genomics data in clinical analysis.
Tim Hunkapiller, president of the consulting firm Discovery Biosciences, sees genomics data analysis starting on a path to commoditization, and he sees new users on the path ahead—people with a practical application for the data.
“As things become less expensive, they become approachable in the real world to people who aren’t scientists,” Hunkapiller says. “In the research world, the concept was that data are always a good thing, and we can figure it all out as we go along.” In the clinic, however, doctors have little time for writing new algorithms as needed in an open-ended inquiry. “The data are viewed as useful, not as entertaining,” he observes.
Tools for data analysis are incorporating technologies from fields such as mobile communications. Early efforts focusing on computer hardware have evolved toward software that is easily updated and expanded with new functions as techniques advance. Now with open access software available online, vendors and researchers are moving toward the cloud.
Jeffrey Reid, head of genome informatics at Regeneron Pharmaceuticals’ Regeneron Genetics Center (RGC), a genomics data research venture launched at the drug company in January, is working with DNAnexus, a software developer that offers a Web-based data analysis service.
“I think people have rightly derided the idea of the $1,000 genome, given the million-dollar interpretation,” says Reid, who recently moved to Regeneron after working as a research professor at Baylor College of Medicine’s Human Genome Sequencing Center. “It is true in my previous life at Baylor.” There, Reid’s group worked on clinical exome sequencing, which focuses on all the protein-coding genes in a genome. “We saw that one of the big costs and time bottlenecks was in the interpretation.”
But just as sequencing starts to become a commodity technology, Reid says, production analysis—the generation of variant calls against a reference genome for a particular individual—is “becoming something that ... well, let’s say it isn’t easy, but it’s straightforward.”
RGC’s primary goal, according to Reid, is to better understand the genome through exome sequence analysis, find new genotype/phenotype correlations, and bring that understanding into Regeneron’s drug development pipeline. “So the thing that is most interesting to me scientifically,” Reid says, “is figuring out how to enable that downstream analysis and tie those data to electronic medical records.”
RGC and Geisinger Health System, a physician-led health care provider in Pennsylvania, have started a five-year program to sequence and genotype 100,000 patients and study long-term health outcomes. RGC is also working with DNAnexus to develop a data analysis infrastructure that can manage the genotypic and phenotypic analysis load.
DNAnexus, originally a software firm that offered its genomics product as an online service, is now marketing what is called a “platform as a service”—an Internet-based scaffold for configuring a mix of in-house, commercial, and open access software of the user’s choice.
Meanwhile, Reid’s former employer is also deploying the DNAnexus cloud platform. According to Narayanan Veeraraghavan, lead programmer scientist at Baylor’s Human Genome Sequencing Center, a cloud-based approach is best to accommodate the volumes of data and the fluctuations in activity common in data analysis.
Baylor, which operates one of the three largest U.S. government-funded sequencing centers, is nearly a year into what Veeraraghavan claims is the largest biomedical computational project ever undertaken on the Web. The Cohorts for Heart & Aging Research in Genomic Epidemiology (CHARGE) project is studying blood pressure through genomewide association studies of about 14,000 individuals.
“The cost of computing is not the big question,” Veeraraghavan says. “The big one is the cost of storage.” DNAnexus currently uses Amazon cloud computing, he notes, adding that Google introduced a cloud storage service this month—an indication that the supply of services for genomics data analysis is ramping up.
Omar Serang, who holds the title of chief cloud officer at DNAnexus, says product development at his firm has followed the increased need for what he terms elasticity, the ability for an infrastructure to accommodate changes in volume and software rapidly and at minimum cost. “DNAnexus is not selling applications but a place to run them,” Serang says.
Among the companies actually developing commercial software for genomic bioinformatics is Bina Technologies, which for a year and a half has sold a product that supports secondary and tertiary data analyses. According to Chief Technology Officer Sharon Barr, Bina began by developing a device: a chip that took on the frontline analysis. The aim was to be faster than the open access software often used to do the job.
“What we realized is that the market doesn’t need that speed yet,” Barr says. “We are fast enough, but what is also important is accuracy and comprehensiveness. You achieve those by getting the latest and greatest tools.” Software can be upgraded, “but with hardware, it’s difficult to stay cutting-edge.”
In 2013, however, a new entrant, Edico Genome, staked out territory in secondary analysis with a computer chip it calls Dragen. Known as a field-programmable gate array (FPGA), the chip derives from signaling technology that Chief Executive Officer Pieter van Rooyen developed for cell phones at a company he started in his native South Africa in 2010. Finding the chip worked in a device used in rural areas for diagnosing tuberculosis, van Rooyen soon saw an application in genomics data and set up shop in San Diego.
According to van Rooyen, Edico’s chip locks in on one task: the immediate vetting of huge stores of unprocessed data coming from sequencers. “The software solution is a minivan,” he says. “You can do a lot of general-purpose processing, but to optimize your driving, you need a machine that does one thing like a Formula One racing car.”
As for speed, van Rooyen claims the Dragen chip, in a low-end processor, can analyze data almost as fast as they are generated by gene sequencers.
“We use all the data from the sequencing machine, put it together, and what comes out is a relatively small pile of information,” van Rooyen says. “That file can be uploaded to the cloud for further analysis.”
The University of California, San Diego, School of Medicine, which houses a government-funded supercomputing center, has begun using the Dragen chip to do initial alignment of data from gene sequencing. “We had been using standard tools, available publicly, but it took a long time,” says Gene Yeo, associate professor in UCSD’s department of cellular and molecular medicine.
The Edico chip cuts what had been an 18-hour job of aligning data on a genome to half an hour, he says. There is a slight trade-off in accuracy, but Yeo figures this can be corrected as the lab gains more experience working with the chip.
And Yeo’s group is interested in deploying the Dragen chip in a more expansive project—transcriptome sequencing, which aligns all the RNA molecules in a cell or population of cells. “We are interested in working to build up new ways of using the chip to extend beyond the genome,” he says.
Discovery Biosciences’ Hunkapiller says Edico’s claims of speed and accuracy on the front line of data analysis hold up well in practice and will likely establish the use of a chip in conjunction with downstream analysis software plus computing and storage done in the cloud. “What Edico brings to bear, potentially, is the ability to closely tie that computation to the sequencing process itself,” he says. “The data become less and less base calls and more variant calls. And that is what people care about.”
Van Rooyen says he is already looking beyond what current researchers want. He agrees with Hunkapiller that genome analysis tools will be in the hands of nonscientists before long. “Right now, sequencing the genome is where cell phones were in the 1980s,” he says.
The chip, van Rooyen says, will work in conjunction with software, as does the dedicated chip in most cell phones, in personal devices of the future. He also notes that the FPGA chip-software combination is basic to big data computing that’s in development at the likes of Intel, Microsoft, and Google.
“We are productizing,” van Rooyen says. “The chip technology is fundamental to making the genome open to people on a daily basis.”