If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.



How pharmaceutical research is navigating the data lake

A trend in large-scale data storage rigs cloud computing with advanced analytic software

by Rick Mullin
October 6, 2018 | A version of this story appeared in Volume 96, Issue 40


An illustration of a data lake.
Credit: C&EN/Shutterstock

With surfaces reflecting the world above while obscuring mysteries below, lakes have long beguiled scientists and poets alike. Alexander von Humboldt went to Lake Atitlán in Guatemala; Charles Darwin, to Tagua-Tagua in Chile. Henry David Thoreau lived on Walden Pond (a lake) in Massachusetts, Lord Byron entertained on Lake Geneva in Switzerland, and William Wordsworth roamed England’s Lake District. The urge to take temperatures, measure depths, catalog plants and animals, and commune with the eternal is easy to understand.

Lately, scientists in pharmaceutical research and development have fallen for a new kind of lake, one that is strictly metaphorical: The data lake.

Big data in drug research has reached the point where information hived in data warehouses serving individual departments—manufacturing, process design, quality management, drug discovery and development, clinical trials—can now be inexpensively stored in one large repository accessible to all, either through an in-house system or, more typically, via the cloud. Just as the cloud provides a conceptual device for understanding computing available from who knows where on the internet, the lake paints a picture of centralized, high-volume data storage and management.

The metaphor can be extended further. A lake’s complex ecosystem, composed of plant, animal, and abiotic elements, is analogous to the mix of data—alphanumerics, images, videos, handwritten notes—that resides in a data lake. And the common demise of lakes—degeneration into swamps—is an apt metaphor for the failure to establish rapid streams whereby data can be entered, accessed, routed, updated, reentered, and eliminated as needed.

“Data lakes are hard to do right,” says Philip Ross, director of translational bioinformatics data science at Bristol-Myers Squibb, one of several large pharmaceutical companies converting disparate data storage systems to large repositories. “And it’s evolving. We want to make sure we are at the forefront of the evolution. It’s explosive right now.”

BMS is currently establishing several data lakes connected by software that facilitates data mining. The company hopes to increase speed and efficiency of research through a system of collecting, processing, and analyzing large stores of data to support, for example, detection of biomarker signals that could help determine whether a drug is safe and effective in patients.

Speed is a matter of multitasking made possible with central data storage, Ross says, allowing access to a broad range of data to accelerate research.

We want to make sure we are at the forefront of the evolution. It’s explosive right now.
Philip Ross, director of translational bioinformatics data science, Bristol-Myers Squibb

Krista McKee, director of data analytics at Takeda Pharmaceutical, is also experimenting with data lakes. “Right now we are working on efficient presentation of data to eliminate unnecessary time spent searching through it all,” she explains. “We are pursuing it further to see where we can get into machine learning,” she says, referring to a variety of artificial intelligence whereby data processing becomes self-directing over time.

McKee spearheaded development of a tool called Platypus that combs through the data in Takeda’s lake to identify, extract, and route pertinent information to researchers or medical reviewers. McKee likens the process to electroreception, the means by which the tool’s namesake monotreme finds food in the murky depths of streams and rivers.

Meanwhile, Amgen is coordinating two data lakes, one covering process development, quality control, and manufacturing; the other, clinical trials. Researchers can access data from both lakes, according to Janet Cheetham, executive director of process development, who is leading the digital and data transformation program at the company.

A woman at a conference room table sits before an open laptop computer.
Credit: Courtesy of Krista McKee
Takeda's Krista McKee has developed a data-foraging tool, Platypus, that mimics the monotreme's means of finding food in muddy stream beds.
Photo of a man leaning on a walkway banister in a corporate lobby.
Credit: Bristol-Myers Squibb
Bristol-Myers Squibb's Philip Ross sees the data lake rising as both research and information technology evolve.

When considering data lake architecture, “there are two things to think about,” Cheetham says. “There is what sits underneath the lake—the storage system. And there is what’s on top of the lake—the presentation layer, the analytics.” Amgen’s lakes communicate via the analytics system.

Storage is the easy part, Cheetham explains. Cloud storage, she says, provides infinite capacity at a very low cost, so information from drug development to commercialization can be gathered in one place rather than reside in silos separated by department. “It completely revolutionizes how you think about these data,” Cheetham says.

The analytics is the hard part. “The cloud does not address the challenge of maintaining data integrity and driving up the value of data in terms of its reproducibility and use in high-fidelity modeling and simulation,” Cheetham says. And analytical challenges can’t simply be solved with software. They require communication standards to enter and retrieve data and allow for third-party software to work cooperatively with the lake.

Several major pharmaceutical companies and data management technology vendors have been working to establish these standards through the Allotrope Foundation, a group formed in 2012. Cheetham, who chaired the foundation from 2014 until last year and now heads its strategy committee, says the group began developing data standards in 2013 and last year issued a standard for formatting data exchanges and another for ontology.

The latter establishes a consistent vocabulary. “We recognized that we were calling the same item by 35 different terms because we could,” Cheetham says. “That really limits your ability to do higher-level data analytics.”

This year, the group plans to debut a standard that defines data structures and provides templates for applying the formatting and ontology standards to specific processes or experiments. Members, including Amgen, BMS, GlaxoSmithKline, Merck & Co., and Pfizer, have contributed more than $10 million to the effort through membership fees and the value of staff time dedicated to the project, Cheetham says. While the Allotrope standards are catered to drug research, she says industries including agriculture, food processing, and oil and gas can also use them.

Cheetham points to another recent advance in data lakes for life sciences research—a data lake platform announced last month by a partnership among Merck, Amazon, and the consulting firm Accenture.

Although the pharmaceutical industry has lagged others in adopting information technology—coming late to the use of electronic notebooks, for example—it is proving to be an early adopter of large-volume data storage and analytics, Cheetham says. The heavily regulated industry’s risk aversion is overpowered by the necessity of putting huge stores of data to work.

“We really can’t afford not to take the risk and embrace digital technology at a point where regulators are also wanting to move away from their traditional paper-based methodologies for regulatory filings and inspections,” she says.

Software vendors have noticed the new behavior. “I am surprised at how not behind the pharma industry is on this,” says Eva Nahari, data warehouse product manager at Cloudera, a data engineering software firm. Nahari notes that researchers in genomics and personalized medicine are jumping on the opportunity to access diverse, multiple-sourced data “without the political battles you used to have to go through.”

The pharmaceutical industry is moving faster than other segments of the health care industry, such as hospitals, says Nahari.

Software vendors have also moved quickly to develop tools for data lakes. “Six years back, we saw how this would be the future of data management,” Nahari says. Product development has focused on analytical and other software tools operating on the storage foundation. “What we build on top is machine learning, security, governance, management tools, and the ability to install on-site or in the cloud,” she says.

Before the emergence of automated analytics, large-volume data holds were creating a problem, according to Ben Szekely, a product manager at software vendor Cambridge Semantics. “You have these data swamps that were starting to proliferate,” he says. “People thought, ‘Let’s just land a whole bunch of data in a lake, and that will solve all our problems.’ But that just kicked off a whole other set of manual paths and one-off projects required to get data out of the lakes.” Adding a smart layer on top of a lake, especially one using open standards like those being developed by Allotrope, saves a lot of time in the lab, he says. Cambridge Semantics’ Anzo software, for example, converts complex data into familiar business or research terms for easy access.

John Haddad, senior director of marketing at systems vendor Informatica, agrees, adding that automated data processing, done before analytics, supports the data streams that keep data lakes alive. Systems like Informatica’s Intelligent Data Platform are needed to process what is kept in storage, he says.

“Data is only useful if you can integrate it, cleanse it, and get it into a shape and format that can be fed into the analytical algorithms that data scientists and analysts use daily,” Haddad says. Powered by machine learning, data processing tools assist in identifying data pertinent to specific projects, routing data, and reentering data as required. “That is 80% of the work,” he says. “The analytics part is straightforward after that.”

Deloitte, a consulting firm, also offers software for data lakes. Its Research Trust structures information from a lake, employing standard ontologies to allow sharing among departments in a large organization. Shared access to data enables those departments to harmonize their efforts, a key business objective in establishing a data lake.


“Where data lakes fail is where they are a technology exercise with a weak business driver,” Raveen Sharma, a project manager in Deloittes’s life science division, says. Success hinges as much on operational efficiency as it does on advances in research, he says.

Takeda is working with Deloitte on configuring its data system using Research Trust, along with visualization software from Tableau and Takeda’s Platypus program. The drug company’s business case for building data lakes is to reduce R&D costs, McKee says.

“We spent a lot of time on business requirements in R&D,” says McKee, who was previously director of business operations in an oncology unit. The project, which launched in January, now includes data from 12 clinical studies in oncology, gastroenterology, neuroscience, and vaccines, she says. “We are scaling it up rapidly to get it across our portfolio.”

BMS is likewise looking to cut time in research with its digital health initiative. But establishing data lakes is also a matter of keeping up with an evolution in science and research. “We are learning to do analysis better,” Ross says, adding that a new breed of scientist is needed to do the job.

“We are hiring bioinformaticians and machine-learning experts to help us bring the cutting-edge technology to its latest level of capability,” Ross says. “I would say it is part of the usual evolution at a pharmaceutical company that you have to respond to emerging technology, emerging understanding of disease, and emerging capabilities to detect new treatments and make them available to patients.”


This article has been sent to the following recipient:

Chemistry matters. Join us to get the news you need.