If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.



Enter The Data Scientist

Chemistry and biology have merged with high-powered statistical analysis in drug discovery labs

by Rick Mullin
November 12, 2012 | A version of this story appeared in Volume 90, Issue 46

Credit: Len Rubenstein/Broad Institute
Traditional science guides the researcher as information technology fills the laboratory, like this one at the Broad Institute.
Three people work at computers in an office, scientific scrawling can be seen on the glass in the foreground.
Credit: Len Rubenstein/Broad Institute
Traditional science guides the researcher as information technology fills the laboratory, like this one at the Broad Institute.

“Big Data,” the latest business world catchphrase, hit the cover of Harvard Business Review in October to herald a special section on the benefits to virtually any commercial or research venture of analyzing the flood of data pouring out of today’s computers. Organizations that choose to commit to the math are promised a huge opportunity to gain advantage in competitive markets.

The big wash of data did not roll in overnight, however. Data have been critical in drug discovery for a decade. The age of genomics and personalized medicine has taken drug research to the point where scientists, operating in highly computerized laboratories, have no choice but to work with reams of data in order to model diseases and design therapies. Although basic chemistry and biology remain fundamental to drug discovery, the research has veered toward statistics, requiring scientists to take intense crash courses in statistical analysis.

At the same time, mathematicians are entering the lab and learning science. Managers of discovery laboratories in the drug industry point to the rise of the “data scientist,” a new class of mathematician trained specifically to bear down on big data. Considered research rock stars in industries such as finance, marketing, and manufacturing, data scientists are math whizzes willing to learn on the job whatever they need to know about a particular industry. And drug discovery offers them a big career opportunity.

Although effective management of data in the laboratory can lead to breakthroughs, the trick is matching the science to the math. And this has created a kind of cultural standoff. For example, “some say it’s easier to train the mathematician in biology,” says Dawn Van Dam, general manager of the consulting firm Cambridge Healthtech Associates. “Others say it’s easier to train the biologist about math. They both say, ‘Start with who I am, and train the others on my stuff.’ ”

But collaboration between the disciplines is evolving out of necessity, according to Van Dam. “What you have in the industry now is a tectonic shift,” she says. “The earth has moved and opened up, and half of all the research and development people have fallen in.” Drug companies have made major cuts in their research departments in an effort to boost productivity, she says. Many failed, however, to develop an alternative research strategy for their smaller staffs.

Meanwhile, science has become increasingly complicated and data-driven, Van Dam says, and it requires researchers to have high-level science as well as data analysis skills. “I see the job of the chemist being much more complicated and much more knowledge-intensive than it ever has been before.”

Credit: Novartis
Novartis embeds data scientists in its therapeutic groups.
Two researchers in PPE look at a computer screen in a lab.
Credit: Novartis
Novartis embeds data scientists in its therapeutic groups.

Universities and colleges are only beginning to develop postgraduate curricula in fields such as bioinformatics and quantitative biology to prepare graduates for jobs in drug discovery and related research fields. Where programs have emerged, they are often the result of collaborations between math and science departments.

Gilberto Schleiniger, a mathematician at the University of Delaware, worked on developing an undergraduate curriculum in quantitative biology at the university. The mathematics and biology departments collaborate on the program, which combines elements of both disciplines with chemical engineering, he says.

The university began developing its curriculum, Schleiniger says, after the publication of a 2002 National Academy of Sciences report that highlighted the need for mathematical modeling and data analysis in drug discovery. The school consulted with drug companies as it developed the program.

Schleiniger, who previously worked as a math analyst in finance and fluid dynamics, says he found the drug industry lagging others in math. “From the research point of view, math models were a hurdle to overcome,” he says. “Most of the people who did research in drug discovery were set in their own ways and training.” But the industry has become more focused on establishing a regimented approach to data management as a means of improving efficiency in drug discovery.

Like Van Dam, Schleiniger sees a culture clash between mathematical modeling and pharmaceutical science. The term “quantitative biology,” he says, illustrates a resistance to recognizing the role of math in biology. “You wouldn’t call a program ‘quantitative physics,’ because all physics is quantitative,” he says. He emphasizes, however, that the university’s program is focused on science. “Our goal is not to create mathematicians. Our goal is to create great biology researchers and life sciences researchers,” Schleiniger says. “We have other programs to train mathematicians.”

John Quackenbush, a computational biologist at Dana-Farber Cancer Institute and Harvard School of Public Health, is developing a Ph.D. program in bioinformatics. For him, a major challenge is the complexity of the data, which overpowers the capacity of commercially available information technology (IT). Genomics data have blown apart the status quo in bioinformatics and continue to reshape discovery research. “It’s very difficult to teach this in a course because it is such a dynamic situation, with technology evolving before our eyes,” Quackenbush says.

He recommends that students be taught the fundamentals of data analysis by having them design experiments as if they were software engineers. “In software engineering there is something called ‘use-case development’ in which you start at the end result and work backward. You think about how you are actually going to analyze the data and work back from there,” he says. “That gets you thinking about how well powered your design is. If you know you need to make a specific comparison, you can judge whether or not you have enough measurement to make such a comparison.”

Students entering the program with quantitative analysis and statistical training will need to catch up on a lot of biology, Quackenbush warns. “You would think coming into the biostatistics department, students would have a reasonable grasp of biology,” he says. “It’s surprising how little biology many actually know. They often see data as simply data.”

A lot of cross-discipline training is also taking place these days at big pharmaceutical companies. John Reynders, vice president of R&D information at AstraZeneca, says the ability of scientists to handle complicated math is crucial, given the volume and diversity of data entering the drug discovery lab. Questions that arise in drug discovery involve biomarkers, patient populations, and drug efficacies. “There are real-world evidence studies that need to be done on marketed products,” Reynders says. “All of this involves some levels of informatics.” But before the data, he emphasizes, comes basic science that is guided by researchers’ intuition.

Credit: AstraZeneca
Reynders sits at a desk smiling. A book is open in front of him, and a coffee cup with no handle and a notepad flank the book.
Credit: AstraZeneca

Reynders, a mathematician by training, learned science on the job. He graduated from Princeton University with a Ph.D. in applied and computational mathematics and went to work at Los Alamos National Laboratory on high-performance computing. After a short stint at Sun Microsystems during the dot-com era, he took a job at Celera in 2001. Celera, a genomics pioneer, was in the market for researchers with the skill to handle complex mathematical algorithms and a willingness to learn science fast.

“It was definitely learning from the end of a fire hose,” Reynders recalls. “What I have found is that bridging between technology and science enables insights that would be very difficult to arrive at otherwise.” Reynders now views fluency in math as a asset for anyone entering drug discovery research. “There has to be a commitment to learning very aggressively in the domain that is not the first you learned.”

Indeed, given the rise of in silico research and the use of data to design, perform, and evaluate experiments, science is no longer tied to wet-lab chemistry and biology. Some or all of the traditional science can be outsourced by a major drug company with a strong research informatics staff. AstraZeneca has done this to varying degrees in each of its therapeutic research groups, Reynders says. In neuroscience, where there is a high level of modeling of physical and psychological mechanisms and a lot of complex data, AstraZeneca outsources traditional chemistry entirely, and in-house scientists work with the numbers.

Anastasia Christianson, head of biomedical informatics for AstraZeneca’s neuroscience group, says in-house researchers design experiments in concert with the contract research organizations or academic research partners that perform them. “We don’t have any labs at all,” she says. “Scientists here are not doing the experiments, but they are collaborating on designing the experiments and certainly collaborating on interpreting the results.”

Christianson came to AstraZeneca with a Ph.D. in biochemistry from the University of Pennsylvania. She did postdoc work in cell biology at Harvard University. “I am a scientist by training,” she says, “but I moved into information science because I saw the need to better utilize our data and information. Drug discovery requires technical savvy and a computational mind-set.” Christianson developed these skills on the job.

But mathematical modeling and analysis have not risen at the expense of science, according to Christianson, who consulted with Schleiniger in developing the quantitative biology curriculum at the University of Delaware. “I think the math is helping us to derive better intuition as scientists, because we are better able to utilize all the data available. We are now using data to help the analysis. It does not replace the scientific training. It augments it and adds to it and enables more informed decisions.”

Scott Sheehan, senior director of discovery chemistry research and technology at Eli Lilly & Co., agrees. “From my perspective, being a chemist, the influx of data has not changed the fundamental tenets of the scientific method, but it has most definitely enhanced, on many levels, our ability to apply the scientific method, he says. “It has improved the quality of the questions we are trying to ask and the efficiency of how we provide answers.”

For example, Sheehan says, computational methods of designing and evaluating molecules prior to committing to wet chemistry have increased efficiency in research. “It is about the distillation of information into hypotheses and the kind of nonlinear breakthroughs that result from a uniquely human endeavor,” he says.

Even with the influx of data, Sheehan, like Reynders at AstraZeneca, gives primacy to the traditional science. As he points out, Lilly has separate research IT and chemistry groups. “In chemistry, we look for people with science backgrounds,” he says, adding that there are practical means of developing quantitative analytical skills on the job.

In contrast, at the Novartis Institutes for BioMedical Research in Cambridge, Mass., each therapeutic area in drug discovery operates with a team of scientists and data analysts. Mark Borowsky, executive director of scientific data analysis, who came to Novartis this year, is in the process of establishing an enterprise-wide data analysis team to support the embedded quantitative experts in the individual research units. He says computational biologists and chemists have been coming into the workforce with a good understanding of both science and math.

A number of consultants have arisen to offer data analysis to drug researchers, and Novartis has done some work with such companies, Borowsky says. “But building internal expertise in data management is integral to our business,” he says.

At Pfizer, a company that has drastically downsized and reorganized R&D over the past two years, data management has been central to establishing new research hubs called Centers for Therapeutic Innovation. Jose-Carlos Gutiérrez-Ramos, senior vice president of biotherapeutic R&D at the Cambridge CTI, says Pfizer embarked on a strategy of increasing research partnerships partly as a way to effectively manage data.

“Moving our research units and building this site in Cambridge has allowed us to take advantage of an environment that includes the Broad Institute and other computational institutes,” he says. “It will allow for crowdsourcing and dealing with data in much more contemporary ways than if we had taken our traditional research IT departments and tried to change them.” Restarting with an emphasis on partnerships has also given Pfizer a chance to rethink how research data can be used beyond the discovery lab and in drug development and clinical testing, Gutiérrez-Ramos says.

Developers of IT systems have their eyes on the increased need for mathematical modeling and statistical analysis in drug discovery. System architectures have already changed fundamentally as the approach to using data has evolved, according to Matt Hahn, chief technology officer at Accelrys, a leading supplier of laboratory IT systems. “If you go back 10 years,” he says, “the focus was really on statistical analysis in molecular modeling and simulation. This was a significant portion of our business, but it now represents a small part because our customers are beginning to focus downstream” on developmental and clinical science and early manufacturing.


And a new generation of scientists is entering research, Hahn says. “Labs are not necessarily training old-guard molecular modelers,” he says. “Now you have people with backgrounds in statistics who pick up on the science, the patients, the genomics, and begin to apply the statistics they are familiar with to a huge bed of problems.”

Software companies now try to provide what Hahn calls scientific life-cycle management systems that support data analysis from discovery through development. The systems need to support networks of scientists—both traditional and data scientists—from R&D to the clinic. Accelrys has built on its laboratory IT products in recent years to provide such support. For example, it merged with Symyx Technologies, a supplier of electronic laboratory notebooks, in 2010 and acquired the collaborative research software business of Heos, a contract research firm, this year.

Scientists warn, however, that data science is evolving faster than commercial systems to support it, which requires researchers to innovate in developing analytical IT. They also agree that no one company’s network is likely to provide as much IT firepower as will be needed. As partnerships expand, drug companies will seek to link with data science experts that have a track record in pharmaceutical science.

Credit: Maria Nemchuk/Broad Institute
Photo of Martin Leach
Credit: Maria Nemchuk/Broad Institute

“Depending on the size and maturity of their cyber organizations, some companies are further ahead of the curve than others on big data,” says Martin Leach, chief information officer at the Broad Institute. “There is a lot of data complexity. Volume is an important factor but so is accessibility. More disparate types of data are becoming available and offered to preclinical scientists. In some cases, pharma and biopharma companies need to leverage support from outside boutique companies.”

The net benefit of big data, Leach says, is that it will bring discovery researchers closer to what they need to understand about physiology, “which is a massive network of sets and systems of human function and dysfunction. This is going to lead them into other types of computation and more into network analysis.”

The Broad Institute has brought a network analysis focus to a drug discovery partnership with AstraZeneca. “Chemistry has evolved, as it always does,” says Jeremy R. Duvall, manager of diversity-oriented synthesis at the institute. “We are now trying to capture the next generation of chemistry.”

The data now available will give researchers the ability to build highly complex, three-dimensional compounds, Duvall says.

Next-generation chemistry will also emerge from partnerships between traditional-science-based companies and data science specialists. “The great thing is that we are both working on the data,” Duvall says of the collaboration with AstraZeneca. “We are generating compounds and screening together and utilizing data as a team.”

There is no question that drug discovery will only become more data-intensive, requiring researchers to take a multifaceted approach to science that balances chemistry and biology with emerging methods of statistical analysis. Drug discovery scientists say the changes ahead will be rapid, fostering innovation in research and, with some luck, big breakthroughs in developing medicines. But traditional science will still underpin the process.

“As we start to tear into data,” Harvard’s Quackenbush says, “we’ll find increasingly that the only way to proceed is to basically go back to biological science.”


This article has been sent to the following recipient:

Chemistry matters. Join us to get the news you need.