Adapting To The Data Explosion | August 10, 2009 Issue - Vol. 87 Issue 32 | Chemical & Engineering News
Volume 87 Issue 32 | pp. 25-29
Issue Date: August 10, 2009

Adapting To The Data Explosion

Securing the integrity, accessibility, and stewardship of data in the digital age
Department: Government & Policy
Keywords: digital data, data integrity, data management
Digital technologies are responsible for ever-growing volumes of research data.
Credit: Shutterstock
Digital technologies are responsible for ever-growing volumes of research data.
Credit: Shutterstock

Advances in digital technology have revolutionized the generation, accumulation, and storage of scientific and engineering data. The enormous amount of data generated and the intricate processing that data undergo have made it harder to verify their accuracy and validity.

To understand the implications of data explosion, a group of scientific organizations, federal agencies, and foundations asked the National Academies to look into the impacts of digital technologies on research and how the stakeholders might come to grips with the rapidly changing situation. The report resulting from the Academies’ examination calls on research stakeholders to move quickly to ensure that digital technologies are used fully and appropriately.

“The revolution in digital data and the ability to transmit and communicate digitally the conclusions from that data have really changed how science is done and how scientists interact with each other,” says Phillip A. Sharp, institute professor at the David H. Koch Institute for Integrative Cancer Research at Massachusetts Institute of Technology and cochair of the committee that prepared and wrote the report. “This has been driven by technology that some even equate to as big a change as the invention of the printing press.”

Ensuring the Integrity, Accessibility & Stewardship of Research Data in the Digital Age” does not attempt to be a complete review of the detailed range of research procedures and data management styles that vary considerably across scientific fields, but it examines the consequences of the changes affecting researchers with respect to the three areas in the title. Specifically, the committee developed what it terms a central principle for each of these areas.

The first central principle is that ensuring the integrity of research data is essential to advancing knowledge and maintaining public trust and that researchers themselves are ultimately responsible for ensuring that integrity. But the report also stresses that ensuring integrity is done differently across the various fields of science and that should continue.

“One of the things we tried hard to do in this report is project the benefits and challenges of this data revolution from the perspective of the individual scientist,” Sharp tells C&EN. “And we tried to capture in each discipline how scientists make their decisions about data and why.”

The report considers it a universal requirement that all data be true, accurate, scrupulously recorded. At the same time it recognizes that the copious amounts of data resulting from high-speed computing and communications have become a big problem, even as they extend the researchers’ capability to generate and analyze data.

As more data are generated and digital technologies play a bigger role in the research enterprise, the concern is that digital technologies could reduce data quality and compromise research integrity if used improperly. Digitization can introduce spurious information onto a representation, and complex analyses can yield misleading results if researchers do not carefully monitor and understand the analysis process, the report says.

“Because so much of the processing and communication of digital data are done by computers with relatively little human oversight, erroneous data can be rapidly multiplied and widely disseminated,” the report states. “Some projects generate so much data that significant patterns or signals can be lost in the deluge of information.”

Better understanding of digital processing by researchers is one way to solve this problem, Sharp says. “The report emphasizes the importance of digital technology and calls for increased education in this area and for more recognition of people who really know the field—data professionals.”

Data professionals can have a useful impact on research, Sharp says, echoing a major point of the report: These individuals can bring new perspectives to data or new ways of combining data that may yield important advances. They can show researchers new ways to format and transmit large volumes of data to make it easier for other researchers to use. This, in turn, will help maintain the integrity of the data, Sharp notes. They can also help others improve data communication, visualization, and education outreach. “Research sponsors are going to need to recognize the increasingly important role these people play,” he says.

MIT’s Sharp urges education on using digital technologies
Credit: Donna Coveney/MIT
MIT’s Sharp urges education on using digital technologies
Credit: Donna Coveney/MIT

The report strongly supports peer review as a mechanism for ensuring integrity. “Peer review has been an incredibly valuable tool for the scientific community to propel itself forward,” Sharp says. “But it isn’t the only tool we have to judge the integrity of data. Transparency and openness are other critical tools, because if you make the data available, others will scrutinize it and try to build on it or test it.”

Key to the panel’s recommendations to ensure integrity is not only training in responsible conduct of research, but also training in how to manage research data as is appropriate in the researcher’s field. All stakeholders in a research field need to make sure that they understand and adhere to recognized professional standards for data integrity and management, the report says.

Although data integrity could be deeply affected by the rapid growth of digital technology, data accessibility has probably changed the most as a result of this growth. “We have this technology that allows us to build on what we have discovered in very new and powerful ways,” Sharp says, “and we strongly endorse data accessibility for the advancement of science.” His statement underscores the report’s second central principle: Data should be accessible to other scientists so it can be verified and used for future research.

Accessibility is a complex matter, the report states, because data-sharing norms vary to such an extent that different research fields can be said to have different data cultures. And even though the National Academies committee promotes data accessibility, it provides circumstances when data may be withheld. These include research pertaining to national security, nuclear or biological threats, chemical explosives, or information technology infrastructure. The committee also cites copyrights and patents, especially when research is sponsored by private corporations.

The changing federal requirements for making research data available add even more complexity. For example, the 2007 America Competes Act has stipulations that require data sharing. And the National Institutes of Health requires all NIH researchers to deposit their research papers in the National Library of Medicine’s PubMed Central repository no later than 12 months after publication.

The result is a broad universe of expectations, general principles, or standards for making data available. The report recommends that generally data should be made available unless there is a compelling reason not to do so. At the same time, it calls on stakeholders to recognize the complexity of the issue. The committee calls on stakeholders—researchers, journal publishers, and scientific societies—in the different scientific fields to develop clear policies or standards for data sharing within their field.

“Conflicts on data availability should be resolved by the action of the research community,” Sharp says. “We hope we’ve provided a context to stimulate people to think about this problem. It needs to be thought about and a balanced decision reached.”

Finally, the report considers the long-term stewardship of data. Once these large volumes of digital data are determined to have integrity and are accessible to other scientists, how long will they be available? The committee realizes that not all data need to be preserved, but deciding what to save and what to discard becomes increasingly difficult as ever larger quantities of data are generated.

In response to the question, the committee developed its third central principle: Research data should be retained for future use, and when they have long-term value, they should be documented, referenced, and indexed so others can find and use them accurately and appropriately.

A study released in January on digital data by a task force of the White House National Science & Technology Council considers this issue, too. That study, “Harnessing the Power of Digital Data for Science & Society,” says that “preservation of digital scientific data is both a government and private sector responsibility and benefits society as a whole.” That task force, composed of representatives from federal agencies such as NIH and the National Science Foundation, concludes that “communities of practice” are an essential feature of the digital landscape and federal agencies need to promote a data management planning process for projects that generate data that have to be preserved.

Much of data stewardship revolves around the idea of data ownership, the Academies’ report notes. Maintenance of research data requires that someone or some group take the responsibility to ensure that the data are accessible, that they do not degrade over time, and that they are updated as necessary. “If you’re going to keep the data in an accessible form and keep the metadata and specific programs alive [that are] necessary to access it, it’s not going to be without significant expense,” Sharp says.

Other stakeholders, such as research sponsors, foundations, federal agencies, professional societies, and journal publishers, have a big stake in adapting to digital technology. For example, the American Chemical Society, which publishes C&EN and was one of the report’s sponsors, is seriously considering the report’s recommendations.

“ACS is looking forward to determining how we can promote the standards, training, and increased awareness of best practices in these areas,” says Brian Crawford, president of the ACS Publications Division. “ACS has in place quite comprehensive ethical guidelines for publication of chemical research,” he says. “We will determine, in consultation with our editors, whether provisions for data integrity and stewardship should be elaborated in greater detail, building on the recommendations of this study.”

Chemical & Engineering News
ISSN 0009-2347
Copyright © American Chemical Society

Leave A Comment

*Required to comment