If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.



Bringing The Logic Of The Lab Bench To Big Data Information Technology

Software firms and drug companies are learning to use their deluge of data more effectively

by Rick Mullin
June 9, 2014 | A version of this story appeared in Volume 92, Issue 23

Example of a cluster graph
Credit: YarcData
YarcData’s Urika generates cluster graphs as a means of visualizing data sets.

In the wake of the terrorist attacks of Sept. 11, 2001, the U.S. government found itself in a new world of data-intensive terror threat analysis, one in which its already formidable data management and information technology firepower would need a significant boost. It signed on Cray, one of its supercomputer suppliers, to develop new technology to study complex relationships in vast amounts of data.

“The agency had lots of technology to solve lots of kinds of problems,” says Misti Lusher, a marketing director at Cray. The agency, which she says she is not at liberty to identify, “had every technology you could name in their basement. But the one problem they couldn’t solve was to basically connect the dots between people, new media, and photos to find things that they didn’t know.”

Cray went to work on a chip and a computing system from which a researcher can construct a visualization of interconnected data from various sources in order to discern patterns—or the lack thereof. It was the beginning of the era of what is now known as big data.

Such a system clearly had applications beyond the government, so Cray moved forward on developing not only a product but also a separate company to supply graph analysis systems for big data. The company, YarcData, debuted in 2012. And one of its main markets is drug research, according to David Anstey, the head of YarcData’s life sciences business.

Indeed, nearly every supplier of computer hardware and software has mobilized to equip bench scientists with the information technology (IT) necessary to manage the great volumes of diverse data that are emerging in drug research. Their customers, meanwhile, are sorting through the third-party options and trying a few of their own. Much of the work mirrors Cray’s response to its government client. Vendors and users alike are developing analytical engines to supplement generic storage and computation systems that bog down under big data.

Tom Arneman, president of Ceiba Solutions, a data analysis software firm, cites one popular estimate of how quickly data is expanding. “If you take all the data from the beginning of time to 2012,” he says, “there will be 13 times more data stored in the next two years, and 44 times more data created.”

Not all of it will have to do with drug research, where the volume of data stored and the velocity with which it accumulates are high but lower than in other endeavors such as meteorology and finance. However, the variety of data—the third defining V in big data—is a principal challenge in pharmaceutical labs, as is its wide dispersion, both within private research organizations and in public repositories.

Arneman says researchers are overwhelmed even when Hadoop, the software that underpins most big data IT systems, is in place. The complexity of data requires the use of supplementary software that caters to the appropriate work process—the problem-solving logic of a scientist, for example. Ceiba’s Helium software, he says, operates in conjunction with electronic laboratory notebooks (ELNs), laboratory information management systems, and data storage systems to guide scientists to information relevant to specific areas of research.

Helium, which was developed by GlaxoSmithKline and acquired by Ceiba in 2012, addresses a fundamental change in drug research, says Robert Cooper, Ceiba’s director of business development. “Scientists are now not only required to be scientists,” he says. “They are literally required to be data curators, information gatherers, data aggregators, and data storers.” And they need to be able to understand the relationship of a variety of data—genomic, phenotypic, clinical—to research at hand.

Automated support for researcher decision making is only starting to hit the market, according to Arneman. “Three or four years ago, the best vision of how to analyze data was to take all the data across a company and normalize it to a common data model, taxonomy, and ontology, and dump it into a data warehouse,” he says. “That was state-of-the-art, and it never worked. It was a nightmare, and it cost a lot of money.”

Adding the necessary support for drug discovery—a tool with visual rendering of data connections—requires software development by people familiar with the lab. “There is a human-IT relationship that a lot of technologies miss,” Cooper says. “The nice thing about Helium is that it allows the scientist to say, ‘This is what I’m interested in. Tell me the things I can do against this particular component of science, and source to me the relationships that exist so I can use my brain and travel down pathways of discovery.”

Credit: IBM
IBM is supplementing the computational capacity of this Blue Gene/Q with new file management and warehousing software.
Photo of the IBM Blue Gene Q
Credit: IBM
IBM is supplementing the computational capacity of this Blue Gene/Q with new file management and warehousing software.

The computing giant IBM also has seen the need to adjust conventional data management hardware and software to accommodate pharma research. John Piccone, life sciences business manager for IBM Global Business Services, describes big pharma’s first pass at big data analytics as a dud. Several years ago major drug companies geared up to enter large data sets, including electronic medical records and health care claims. “They entered into large contracts with data providers and worked for a year or two, and the results were disappointing,” he says.

What the drug firms lacked, according to Piccone, was a means of vetting data that accommodates the scientific process. “My problem with the term ‘big data,’ ” he says, “is that it isn’t about data. It’s about the questions you ask, the methodology you use, and the analytics you apply to data.”

Computing-system vendors agree that data analysis IT needs to achieve a balance between storage size and computational firepower. Piccone says the right balance depends on the research task at hand.

Molecular dynamics simulation, for example, requires support for modeling amino acid sequences, including chemical bond angles and the forces between molecules, so that they can be observed for both chemical and biological properties. This task requires a higher level of computation than, say, genomics research, where the task is oriented toward trawling diverse databases for pertinent information.

Established IBM products, such as Blue Gene/Q, a high-powered computational IT platform for science research, are at the center of what IBM offers to support data analytics in drug discovery. But Piccone points to some newly launched software, such as IBM’s Elastic Storage, a data file management product. IBM PureData, a high-performance data warehousing and analytics tool acquired from Netezza in 2012, also has health care and pharmaceutical data management applications.

At YarcData, the focus is on delivering the data in a graphic format to the user. Urika, the product that emerged from the government project, generates complex data graphs that work well in drug research, according to Anstey.

“Think about the existing data environment in pharma research,” he says. “You have historical clinical trial data. You have data from ongoing trials. You have genomic information.”

Add this to the contents of data warehouses, other laboratory systems such as ELNs, and information in the public domain. The high complexity of the data and the researchers’ queries make the graph representation more advantageous than relational databases or other methods of modeling relationships, Anstey says. “In a graph, you can bring all that data together in one place and rapidly start testing hypotheses.”

In addition to IT itself, drug companies are struggling with science culture as they try to help researchers access, analyze, and share information from large data stores. Many are developing their own IT tools.

Alison Harkins, an associate scientist in strategic operations at Janssen Research & Development, the drug R&D arm of Johnson & Johnson, says technology for collecting and storing data is not a problem. “We have so many tools and systems available to us to capture and report everything you can imagine.” The challenge, she says, is coming up with a semantic road map and protocol for sharing data. Janssen is building an engine for recording and transmitting data among researchers based on the logic of process design and experimentation.

Janssen researchers have developed an IT tool named tRex to provide researchers with pop-up screens on their ELNs on which scientists can collect and share data in a standardized form defined by steps in an experimental procedure.

Adam Fermier, who is charged with developing a knowledge management regimen at Janssen R&D, says building recipe-based semantics into data analytics has provided an entrée for overwhelmed scientists. “We have a lot of freaking data, all over the place in every format,” he says. “Giving structure to the information is really our main mantra.” It is the researchers themselves who need to put the finishing touch, if not the first touch, on an IT system to support data analytics, he says.

The IT department at GSK found this to be true in developing the tool that became Helium, according to Richard Bolton, discovery science IT manager at the company. GSK, he says, put scientists to work developing the analytics tool from the start.

“We had been bitten before by, ‘IT has a fabulous idea. If we build it, they will come,’ ” Bolton says. “So when we first designed Helium, we started off with something very simple, showed it to a few users, and said, ‘Try this, and we’ll be back tomorrow.’ ” The IT department made adaptations based on feedback and then asked a different group of researchers for their input. “We did that until the users said, ‘Don’t take this away!’ ” he recalls.

Software developers agree that the “pull rather than push” approach to IT development is likely to ensure that a data analysis system will actually be used effectively by scientists. Once in place, large stores of data will become useful in a collaborative research effort. “Data analytics opens the door to what other people are doing rather than just knowing what you know,” Ceiba’s Arneman says.

And such systems may also put a stop to the nebulous and somewhat rankling term “big data,” Bolton observes. “Big data means data that is too big to manage,” he says. “Once you can manage it, it’s not big data anymore.”  


This article has been sent to the following recipient:

Chemistry matters. Join us to get the news you need.