Issue Date: September 24, 2012
Globalizing Data Infrastructure
Momentum to make scientific data free to access and reuse is growing steadily. Numerous workshops, committees, reports, and recommendations have emerged over the past decade to help facilitate data sharing, but data infrastructures remain fragmented. As a result, data are not interoperable from one discipline, institution, or country to the next.
To better coordinate data infrastructure activities, a group of researchers and government funding agencies from around the world has taken steps to form a community-based alliance. With initial funding from the U.S. and Australian governments and the European Commission, the nongovernment group aims to promote data sharing across a wide range of disciplines and countries.
As a first step, the group plans to improve the harmonization of standards, policies, and technologies for data sharing through the creation of working groups and action plans. It will not set formal standards or policy, but instead will focus on achieving consensus on voluntary steps that can be taken by the research community.
Although the group has yet to decide on a name, some people are referring to the collaboration as the United Nations of data. In a concept paper released in June, the group was called the Data Web Forum. But last month at a symposium sponsored by the U.S. National Academies Board on Research Data & Infrastructure, the community considered calling itself the Research Data Alliance—a name that is likely to stick, according to those involved with the group.
Name aside, the alliance plans to start small. In the U.S., the National Science Foundation and the National Institute of Standards & Technology expect to provide initial funding to academic researchers and nonprofit organizations to help drive the effort. Altogether, the governments of the U.S. and Australia and the European Commission are expected to spend $4 million to $5 million on the effort over the next two to three years, Alan Blatecky, director of NSF’s Office of Cyberinfrastructure, told C&EN after the symposium.
Blatecky is one of the leading forces behind the effort. He cowrote the concept paper with Christopher L. Greer, associate director for program implementation in the Information Technology Laboratory at NIST. The two government officials were motivated by the lack of interoperability of scientific data, which they say “increases the costs of data preservation, discovery, access, reuse, and repurposing by preventing automated solutions and limiting economies of scale.”
Although government agencies are financing the effort to get it off the ground, the idea is to let the research data community drive it. “This is not going to be a new regulatory group,” Blatecky told participants at last month’s symposium.
The ultimate goal of the group is to produce “high-quality documents to improve the way we store, use, and manage data,” Blatecky said. “There is fear that we are being buried by data and not finding any way out of it,” he noted. “Scientists are worried about data being lost. We are generating so much data. We can’t store everything we are getting.”
Indeed, the amount of all digital data—not just scientific data—being created, captured, and replicated worldwide grew 10-fold between 2006 and 2011, from about 180 exabytes to 1,800 exabytes, according to International Data Corp. One exabyte is equivalent to 1018 bytes. By 2023, the amount of digital data in bytes is expected to exceed Avogadro’s number—that is, 6.02 x 1023.
So far, interest in the concept of a global data alliance has been tremendous, Blatecky noted. “Multiple government agencies are asking how they can play. We are being cautious. We can’t grow too fast,” he said, adding that more governments are expected to join the effort next year.
The group plans to have an informal meeting in Arlington, Va., next month, followed by its first official meeting in Europe in March 2013. At next month’s meeting, academic researchers will begin organizing the alliance’s structure and deciding which data challenges to address.
Initially, the group plans to focus on areas where progress can be made in 12 to 18 months, noted Francine Berman, a professor of computer science at Rensselaer Polytechnic Institute, who is leading the U.S. effort to shape the alliance.
Berman told participants at last month’s symposium to think about areas related to data infrastructure that need a broad community to make progress. Such areas could include improving the discoverability of data or developing better data analytics, best practices, or standards, she said.
One of the first challenges the alliance will tackle is how to identify scientific communities. “Let’s start looking at chunking data,” Blatecky said. In other words, find a solution that works for a bit of data in one community, and another solution for a set of data in another community, and start putting the data together, he said.
“Let’s not worry about having the perfect solution. None of us are going to live long enough to see that happen,” Blatecky emphasized. “Let’s stop talking about data sharing, and let’s start actually doing it. We can fix things as we go along.”
In Europe, a separate but similar effort is under way to develop a cross-disciplinary data infrastructure. Over the past year, a group called the Data Access & Interoperability Task Force has been talking to researchers in 20 different disciplines, Peter Wittenburg of the Max Planck Institute for Psycholinguistics, in the Netherlands, noted during the National Academies symposium. The goal of the task force is to find data services that the various disciplines have in common: for example, long-term archiving and offering computing capacity.
“We have a network of big data and computing centers in Europe,” said Wittenburg. The task force is trying to determine whether there are “common services that we can hand over to these big data centers,” he explained.
The European task force plans to meet next month in Arlington in conjunction with the new global alliance. The two efforts are expected to eventually merge.
Meanwhile, NSF and other agencies are thinking seriously about data policies independent of the global research alliance. In fiscal 2011, for example, NSF began requiring grantees to submit data management plans with their grant proposals.
The White House Office of Science & Technology Policy (OSTP) is considering options for promoting data sharing to accelerate science and drive economic development and innovation.
“Access to data that is produced with federal funding, even after publication, is sometimes difficult,” Michael Stebbins, assistant director for biotechnology at OSTP, noted at a symposium on digital curation sponsored by the National Academies Board on Research Data & Infrastructure in July.
OSTP is in the process of reviewing input solicited from the public to improve the long-term preservation and accessibility of digital data produced with federal funds. But it is grappling with how to balance the benefits of increased access to scientific data with the burdens such a transition will put on federal agencies, researchers, private-sector publishers, and scientific societies.
“Balancing the desire to move forward to a better state against the burdens on the private sector is essential,” Stebbins said. “We believe we are close and that we will be taking steps soon.”
- Chemical & Engineering News
- ISSN 0009-2347
- Copyright © American Chemical Society