Partnership applies deep learning to very big data
GlaxoSmithKline has for several years engaged in a campaign of creative destruction, acting to replace traditional methods of manufacturing and research with more efficient ways of doing things. The implementation of programs such as its Manufacturing Technology Roadmap, a manifesto against the standard pharma method of manufacturing drugs in batches, for example, has helped earn the company the reputation of risk-taker among major drug firms.
Now, GSK is turning its attention to drug discovery, where it is hoping that a nascent approach of applying supercomputing to huge data sets will allow it to move from target identification to a molecule ready for the clinic in just one year.
John Baldoni, senior vice president of platform technology and science at GSK, refers to the drastic telescoping of a process that has traditionally taken closer to a decade as the company’s “moon shot” project. It will rely heavily, he says, on the replacement of the iterative chemistry process of identifying and testing molecules, with a networking approach that finds and tests huge numbers of molecules simultaneously, using gigantic stores of data. The technique GSK is exploring, deep learning, exists at the cutting edge of artificial intelligence research in which computers are developing data models that evolve, or “learn,” through computational experience (see page 29).
Facets of the new technique have already been employed elsewhere in industrial research, such as in the development of self-driving cars. But it has not, according to Baldoni, been explored by any major drug company in a large-scale effort to manage the explosion of data in drug research that followed the decoding of the genome and the rise of cloud computing.
“Having spent a year looking at statistics around the drug discovery process,” Baldoni says, “we began asking whether it’s time to rethink how we do things.” Early-stage discovery methods developed with the rise of compound libraries and high-throughput screening in the early 2000s are time-consuming, predicated on huge investments, and currently standard in the industry, he says. Meanwhile, the process of drug discovery was bogging down in data.
“A few people in my department had been talking to folks at the Department of Energy about how we might be able to take advantage of high-performance computing to replace some of the empirical work that we do in the drug discovery process,” Baldoni says. Those discussions led to a partnership between GSK, DOE labs headed by Lawrence Livermore National Laboratories (LLNL), and the National Cancer Institute (NCI).
The partnership, called Accelerating Therapies for Opportunities in Medicine (ATOM), launched earlier this month, aims to develop computing models that will guide researchers based on a computer’s ability to quickly vet millions of molecules for efficacy and structural relationships, models that will adapt as they are applied to new data. ATOM will begin by putting DOE’s supercomputing powerhouse to work on data from GSK and NCI, taking advantage of the drug company’s expertise in chemistry and biology as a framework in pioneering applications of deep learning for drug research.
The partners, having collaborated on defining ATOM’s mission over the past year, are currently deciding where to locate their central laboratory. The group is also acting to expand its membership. Baldoni says it seeks to recruit other major drug companies willing to contribute significant quantities of research data to its computer modeling program.
“What was really driving the conversation was changes in supercomputing,” says Jason Paragas, director of innovation at LLNL, who was involved in early meetings with GSK. The advent of cloud computing, in which companies such as Amazon and Google were hosting huge volumes of data from many sources, and advances in supercomputing set the table for researchers to apply computational learning systems to the highly complex field of drug discovery, drawing from heretofore unapproachable banks of data.
Paragas says new supercomputers that will be used by ATOM have been purchased by a consortium of national research labs, including Oak Ridge and Argonne, both of which will provide systems and research backup to the partnership.
Jim Brase, associate director of computation at LLNL, says ATOM will be taking a deep dive into big data. “We have large amounts of experimental data—genomic data, transcriptome data, assay data—on how biological systems respond to chemicals and their structures. We are seeking to understand from large data sets what particular combinations of things and standards in those data sets are important to building predictive models.”
Brase says the variety of deep learning ATOM might be most interested in is unstructured or unsupervised feature learning, where the focus is on early-stage identification of data sets that go together and significant patterns without predetermined parameters or expectations. The work is comparable to what Google has accomplished with face recognition, he says, but the data sets are much larger and far more complex.
Baldoni says GSK has agreed to contribute information on 500 failed compounds, including complete toxicology and clinical testing data, in addition to 600,000 advanced compounds in screening at the company. In all, GSK will give ATOM access to more than a million compounds screened over the past 15 years, all of which have biological data associated with them.
But that isn’t enough. The partnership, Baldoni says, will need to recruit other large pharmaceutical companies willing to pony up comparable stores of data. He adds that information on failed compounds will be key to achieving breakthroughs in drug discovery.
The problem is that big drug companies are not forthcoming with the data or any information on failed discovery projects. “I am involved in an initiative to get companies to share their data, and I have to tell you it’s extremely frustrating,” he says.
“There are hundreds of thousands of failed molecules or molecules no longer of interest with information on structures, analogs, toxicity, and structure-function relationships,” Baldoni says. “Why would you not want to put them into the greater good? We feel there is an obligation that we have to patients in trials to share these failed compounds so we can develop better drugs faster.”
But the partnership is hopeful that industry will come around and contribute data. Baldoni says the group is in discussion with various research entities and hopes to announce the addition of another large drug company in the coming weeks.
There has been little pushback on the changes to chemistry in research at GSK as the result of bringing in heavy computational firepower in early discovery. On the contrary, says Stacie Calad-Thomson, a GSK chemist who is coordinating the activities of ATOM laboratories and key liaison with LLNL and NCI, there is a great deal of enthusiasm in the lab.
“I think it will have incredible impact on chemistry, allowing us to do research in a more rapid and agile fashion,” she says. “Everyone is very excited about that.”Calad-Thomson also notes that the company has already seen the benefit accrued in tearing up traditional procedures in manufacturing. “Now we’re doing it in the discovery space at the forefront of innovation.”
Baldoni agrees, adding that the company needs to prepare for what’s coming. “There is a recognition that the state-of-the-art computers in the national laboratories will soon be at pharma companies. We might as well start building the tools to use them. We’ll get ahead by a few years, and that is critical.”