Biomedical researchers are beginning to tap into a gold mine of nontraditional sources of data to unravel some of the complexities of human diseases, especially common diseases for which there is significant interplay between genetic variants and environmental stressors. But so much data are being generated from so many different disciplines that it is a challenge to integrate it all.
Efforts are now under way at the National Institutes of Health to tackle this big data problem. The agency recently announced plans to spend approximately $100 million per year over the next seven years to develop a common data-sharing framework and to start training graduate students in quantitative skills. The idea is to promote data integration across disciplines, merging everything from electronic health records to environmental data.
“The last decade or so has seen some fairly impressive technological developments in areas like genomics, imaging, environmental monitoring, and many areas of phenotypic characterization or physiological measurements,” says Eric D. Green, director of NIH’s National Human Genome Research Institute (NHGRI). “Massive amounts of data are being generated in multiple different scientific domains. The biomedical research enterprise is absolutely overwhelmed with data, and we cannot analyze it as fast as we can generate it.”
NIH wants to capitalize on this exponential growth of biomedical research data in hopes of accelerating discoveries that could lead to new ways of preventing and treating human diseases. As a first step, the agency has created a new leadership position focused on data science. Green will serve in that position in an acting capacity until a permanent associate director for data science is recruited, NIH announced last month. During that time, Green will continue to serve as NHGRI director.
In his new role, Green will oversee an NIH-wide effort called the Big Data to Knowledge (BD2K) initiative. The program, which is slated to run until 2020, will pool money from various NIH institutes and centers, as well as the NIH Common Fund, which funds programs that cut across various NIH centers and institutes. NIH expects to begin giving grants under the program in fiscal 2014, which begins on Oct. 1. Within two to three years, the agency plans to spend about $100 million per year on the program, Green tells C&EN.
The BD2K initiative will have multiple components. In particular, it will involve developing better data catalogs so researchers can find out where the data are and how to access them. The program will also involve policy changes, such as creating better ways of sharing data and better ways of crediting people for their data, Green notes. It will focus on developing data and metadata standards to facilitate the broader use of biomedical data.
The second major element of BD2K will be developing and disseminating software tools that are needed by the community, Green says. “Increasingly, biomedical research is transdisciplinary in nature,” he notes, but each discipline has its own software. Experts in a given domain know how to use their software, but they may not know how to use the software of other disciplines. “We need to make software more generally interoperable across different scientific areas,” he says.
Training both the next generation and the current generation of biomedical researchers in quantitative data sciences will be the third component of BD2K. “The tsunami of data and the need to analyze it arrived on the scene long after many scientists were students. We need to develop curriculum to get them up to speed,” Green stresses.
NIH plans to create enhanced curricula for graduate students to become more facile in data sciences. It is unclear, however, whether NIH will require every graduate student supported by NIH funds to have a minimum competency in data sciences or whether it will just make the curricula available to them. Green predicts that young trainees will gravitate toward such curricula without being required to do so. “They are thirsty for good curricula in the data sciences,” he says. “They know that if they are going to be successful in the future, they need such skills.”
Fourth, the BD2K program will establish centers for excellence in data science at various universities. Such centers will aggressively tackle some of the problems associated with big data by gathering a critical mass of people in a given place to focus on a particular barrier. The centers could focus on areas such as cataloging and citation mechanisms, data quality, and privacy issues related to using electronic health records. “We want to identify all of these different bottlenecks,” Green says. “It will be catalytic. If we can break down a bottleneck, it could affect a whole community of researchers.”
NIH envisions having up to 15 of these data science centers. It plans to issue requests for applications over the next few years and hold workshops this year to identify the most pressing needs.
Getting researchers to share their data more freely across disciplines will ultimately require a culture change, Green notes. “I think the Human Genome Project deserves credit for breaking down the barrier to data sharing,” he says. “We made all the data available almost immediately, long before publication. Now that is the norm in genomics.” NIH would like to see that trend of data sharing go further.
One area in particular that could benefit from the BD2K effort is environmental health sciences, which is inherently interdisciplinary. The environmental health sciences community recognizes the importance of sharing data across disciplines, but it is struggling with data integration because of the heterogeneity of environmental health data.
To understand complex diseases that have an environmental component, “we need lots of information about tens of thousands or hundreds of thousands of people,” Green says. Such information needs to go beyond genome sequence data, he notes.
“We need access to medical information through electronic health records. We would love to integrate that information with exposure data or information about an individual’s diet or where they have lived,” Green says. If imaging data are available as part of the medical record, that increases the power, he adds. Such data integration is easy to do with 10 or 100 people, but it becomes a problem when you want to start doing it across hundreds of thousands of people, he says.
“With respect to environmental health science, we are a little late getting to the table in terms of dealing with data science,” says Allen Dearry, director of the Office of Scientific Information Management at NIH’s National Institute of Environmental Health Sciences. “There is much we can learn from our colleagues who addressed this more thoroughly over the past decade or so, especially in terms of genetics research.”
Dearry was part of a National Academies workshop held last month on integrating environmental health data. The goal of the workshop was to examine lessons that can be learned from data integration efforts in other scientific disciplines and to develop a plan to improve coordination and access to environmental health data.
Some of the challenges that were brought up at the meeting include the lack of incentives to get researchers to share data, the need to train biomedical researchers in data sciences, the poor access to research data that are controlled by publishers, the need to weed out poor-quality data, and privacy concerns.
Another problem is the distributed nature of environmental health data. “Instead of people submitting data sets to a public repository, they are making them publicly available in their own private way,” says Carolyn Mattingly, a biology professor at North Carolina State University. They are creating their own websites and repositories, which are difficult for researchers to find.
One increasingly important source of data for environmental health sciences is clinical data stored in electronic health records. The American Recovery & Reinvestment Act of 2009 “committed substantial funds to increasing the use of electronic health records and claims data for research and practice,” says Brian Schwartz, an epidemiologist at the Johns Hopkins Bloomberg School of Public Health. “We can use these data for relatively low cost to get large sample sizes and to get longitudinal data.” Such data track individuals over long periods of time.
Schwartz also participated in the National Academies workshop. At that gathering, he provided a few examples of how electronic health records can be used to link diseases with environmental contaminants. In one example, clinical data from electronic health records were integrated with nutrient management plans showing where manure is applied to agricultural fields. In that study, researchers found that the risk of community-associated methicillin-resistant Staphylococcus aureus infections rises with proximity to large-scale swine operations. Such operations typically feed animals large doses of antibiotics for nontherapeutic purposes, and the antibiotics end up in manure.
Electronic health records are not perfect, Schwartz acknowledges. For example, there are no direct measures of environmental exposures in such records. Rather, exposures are inferred based on an individual’s residential address and environmental data. Using data from electronic health records brings up concerns about patient privacy, and all studies that use such data must be approved by an institutional review board. In addition, the learning curve to use such data is steep and the data require extensive processing to get them into a usable form, Schwartz says.
Nonetheless, electronic health records provide an inexpensive way to get 10 years of data on tens of thousands of patients, Schwartz notes. “It’s a new world in environmental epidemiology with these kinds of data.”