First, there was the law of large numbers, a probability theory stating that the average of the results obtained from a large number of trials should minimize bias created by a single random event in one trial. It is a theory taken seriously by gamblers and casinos in working out the odds of winning. It’s also useful to researchers designing experiments.
Now, there is “big data.” A cross between consultants’ jargon and the self-evident fact that a lot more data are being spewed out everywhere, big data is often defined in terms of both quantity and contextual quality.
Although a battery of statistical models has been developed to reap the benefits of large numbers without running an indefinite number of experiments, there are a lot more data in laboratories. Information technology (IT) and software tools that manage data from an array of laboratory repositories have proliferated, but this has only added to the overall chaos of “too much.”
And the result is that there are still too many experiments going on. A white paper published recently by IT analysts at the consulting firm International Data Corp. (IDC) estimates that 40% of all R&D experiments are repeat runs necessitated by inefficient experimental design or inadequate IT.
Companies that supply lab management software are trying to help their customers reduce that 40% number by adapting their information management products to handle increasing amounts of data. Science-driven companies welcome the effort but aren’t yet ready to say their data problems have been solved.
The IDC paper was sponsored by Accelrys, a leading supplier of lab software. The firm responded to the data deluge this year with two adjuncts to its suite of laboratory products: Accelrys Experiment Knowledge Base (EKB), a data search and experimentation design application, and Accelrys Insight, a collaborative cheminformatics tool. Accelrys claims both products elevate the capacity of its laboratory IT products to access, visualize, and analyze data.
According to Ted Pawela, senior director of Accelrys’s materials science and engineering business, the firm’s basic IT platform enables generic data management functions such as extraction, transformation, and loading. EKB adds specificity to these processes by allowing the user to automate and customize the selection and treatment of data throughout a lab.
More important, Pawela says, it helps convert data into knowledge. “Customers want to start drawing conclusions from data after drawing from 100 tests or 1,000 tests,” he says. “EKB lets you collect information and aggregate it from various sources, providing various charting and plotting components to help you visualize and understand what it means.”
Accelrys Insight adds an automated workflow feature, according to Rob Brown, senior director of life sciences marketing at the company. A researcher can enhance experimental data gathered in the laboratory with information from further analyses that can be shared by other users. The product comes in a Web-based version and one that works with Microsoft Excel, a program that many researchers are loath to abandon, Brown acknowledges.
Other lab IT vendors have moved to integrate data collection and analytics through acquisition. IDBS, for example, has enhanced its E-WorkBook, an electronic-laboratory-notebook-based informatics product, with the acquisitions of Quantrix, a modeling and analytics software developer, and InforSense, a developer of business intelligence software that helps analyze data from disparate sources.
E-WorkBook, according to Glyn Williams, vice president of product delivery, has evolved into a single window on a variety of data sources and systems that used to have to be navigated separately. “People don’t want to be pinging out from one system to another just to do an experiment,” he says. Combing the laboratory through software links, E-WorkBook provides simultaneous access to “metadata, raw data, and to deeper data if you need it,” Williams says.
At Thermo Fisher Scientific, an IT infrastructure for big data has been part of product development since long before the term was coined, claims Seamus Mac Conaonaigh, the firm’s director of informatics. “Big data is not necessarily a new problem,” Mac Conaonaigh says. The term emerged with the rise of Apache Hadoop, a database software framework used by Google to manage large amounts of structured and complex data that had not traditionally been dealt with on a single IT platform.
Thermo Fisher approaches laboratory informatics from the standpoint that all labs have a mix of IT from several vendors, Mac Conaonaigh says. He points to the company’s Connects for the Paperless Lab program, which combines technology with consulting services to translate disparate instrument languages and convert raw data to a vendor-independent storage format.
“We have been talking about being part of the bigger picture within the lab for a long time,” Mac Conaonaigh says. “This conversation is very exciting to us.”
The increasing involvement of statisticians in the research laboratory is also exciting, says Trish Meek, director of product strategy at Thermo Fisher. “We end up as scientists so specialized in a particular field and a particular discipline,” she says. “But what we are really talking about is looking at things at a much higher level and stepping away from the science to ask where the data mathematically correlate and taking it back to scientists to ask: ‘Is this relevant?’ ”
JMP, a division of software giant SAS, has seen increased interest in its software for design of experiments (DOE), a statistical regime for conducting efficient experiments (C&EN, April 1, page 25). The company has adapted its DOE software to aid collaborative analysis by incorporating HTML5, a multimedia version of the hypertext markup language used to configure webpages and other documents on the Internet. The company’s JMP 11 software also employs a screening design that increases the number of factors that can be evaluated per experimental run.
Anne Milley, senior director of analytic strategy at JMP, acknowledges that traditional science will continue to direct the work of statistical tools in the lab, but she adds that statistical modeling will provide crucial guidance in drawing inferences from large sets of data.
Chemical and pharmaceutical firms are beginning to investigate big data strategies, some by employing software catering to advanced data analysis. Johnson Matthey is one of the first users of Accelrys’s EKB system at its research facility in Billingham, England. Sam French, technical development manager, says the firm is rolling it out in syngas catalysis research.
Volume of data may determine where big data IT fits best, according to French. “Our more high-throughput technologies require better ways to store, manage, and interpret data,” he says. “But the jury is still out on low-throughput experimentation.” Although Johnson Matthey is concerned with maintaining a better “corporate legacy” than the traditional shelf loads of hard-copy lab notebooks, “we are not convinced yet there is a value to storing every bit of data coming through.”
Similarly, Lonza is experimenting with advanced data analytics at its biologics manufacturing facility in Slough, England. Using IDBS’s Bioprocess Execution System, an electronic notebook based on E-WorkBook, Lonza is creating a paperless research environment, according to Marc Smith, knowledge management project team leader.
So far, the experiment is succeeding at Slough. “We are now able to trend on more aspects of the science than we have ever been able to before,” Smith says. As for whether Lonza is developing a big data strategy, he’s not ready to say. “We are absorbing as much information as possible and trying to find linkages with it,” Smith says. “We are at a point in this big data curve now where we can start looking at it in a more logical way. I tend to want to call it clever data.”
Although vendors claim biology labs are generally behind chemistry labs in implementing advanced statistical analysis, genomics has emerged as a key area of development for big data computing. Agilent Technologies, a leading supplier of data tools for gene expression and next-generation sequencing, has added a range of data management capabilities to its GeneSpring software in partnership with Strand Life Sciences, an Indian genomics IT and diagnostic services firm.
Strand’s Avadis data-mining and visualization platform for bioinformatics has been incorporated into GeneSpring, adding mass spectrometry proteomics and next-generation sequencing capabilities to what had been a gene expression program, according to Strand Chief Executive Officer Vijay Chandru.
Don Healey, chief scientific officer at Opexa Therapeutics, a developer of personalized multiple sclerosis treatments, argues that big data is real but that the IT infrastructure for dealing with it is still evolving. “Big data is not dictated by its size but by its complexity,” he says. “In the past, scientists have either been restricted to gathering high-resolution multivariate data on a small number of samples or faced a reduction in analytical power so as to satisfy the required level of sample throughput.” It was always a compromise, Healey says.
Today, scientists have the opportunity to configure “high-content, high-throughput” data systems, he says. “The problem is not analyzing one or two variables on 100,000 samples, but 10,000 variables on 100,000 samples.” Work at this scale requires advanced informatics, which he characterizes as a developing field still dominated by proprietary software that requires considerable configuration by the user.
There is, however, a grand tradition of user configuration in laboratory IT, which is not necessarily a bad thing, in that users bring scientific know-how to the analysis of data. Software vendors agree that the tradition will not swap science for statistics even as data analysis holds greater sway.
“Statistics should not be approached as a black box where you push a button and get an answer,” JMP’s Milley says. “It is a service discipline, and it needs other disciplines because with those others comes domain expertise.”