Findable: Data, as well as the metadata describing them, should have globally unique and persistent machine-readable identifiers.
Accessible: Data and their metadata should be retrievable via their identifiers, with a standardized protocol that incorporates an authentication and authorization procedure, as necessary.
Interoperable: Data and their metadata should be formatted in a formal, shared, and broadly applicable language that includes cross-references to other metadata. These cross-references should include the relationships between the data.
Reusable: Data and their metadata should be described thoroughly enough so that they can be replicated and combined in different settings.
Source: Go FAIR (www.go-fair.org/fair-principles/).
Compared with even 50 years ago, today’s chemistry lab is a very different place. More researchers are carrying out more experiments than ever before, using increasingly sophisticated and automated tools and generating a deluge of data.
An analysis of the scientific literature suggests that research output (including articles, books, and data sets) is growing by 8–9% a year (J. Assoc. Inf. Sci. Technol.2015, DOI: 10.1002/asi.23329). But the way data from experiments are shared and reused hasn’t kept pace, chemists say. Useful findings and raw data can languish in PhD theses stored in libraries or in PDFs on servers. Some researchers and policy makers would like to change that, pushing for the chemistry community to implement what are called the FAIR principles of data management. Those stakeholders’ efforts are being bolstered by funders like the European Research Council (ERC) and the US National Institutes of Health, which are mandating that the science they fund be made open access and have data management plans in place. The ERC has also produced FAIR guidelines for projects funded by Horizon 2020 grants.
Egon Willighagen is a chemist at Maastricht University with a passion for open science and standards. In his wallet, on what at first glance looks like a credit card, are the FAIR principles. Research output should, it says, be findable, accessible, interoperable, and reusable. What that means in practice is that articles and deposited data must be interlinked, and data points stored within must be machine readable and usable.
Ultimately, Willighagen says, the FAIR principles are about making sure that the work that chemists and other scientists are doing can be found, extracted, and then applied elsewhere. But despite this seemingly sensible goal, he adds, many scientists have not kept up with ensuring that their raw data are preserved and accessible. “It’s really sad,” he says.
Willighagen is a member of several initiatives and groups building tools and standards to help “FAIR-ify” chemistry. One of these, the Go FAIR Chemistry Implementation Network (ChIN), has been working in collaboration with organizations like the International Union of Pure and Applied Chemistry to establish data standards and protocols. Led by Simon Coles at the University of Southampton, ChIN created its manifesto earlier this year and has spent its first few months building a road map. “I think we’re barely even at the ‘grasping at low-hanging fruit’ stage,” Coles says of the effort.
Almost 30 years ago, Coles says, the crystallography community came together to agree on how the data from crystallographic experiments should be packaged so scientists everywhere could pick up the information from databases and use it. The result was crystallographic information files, or CIFs, that are now standard for reporting crystal structures in a machine-readable way. Similarly, Coles says, other data types could be made FAIR if the chemistry community agreed.
Nuclear magnetic resonance data, says Cornell University chemistry librarian Leah McEwen, could be suited to a similar treatment. Raw spectra files could be uploaded at the same time a journal article is submitted or accepted, says McEwen, who helped run a FAIR data workshop earlier this year at the American Chemical Society national meeting in Orlando, Florida. The files could include experimental metadata (essentially, data about the data) to describe how the spectra were obtained. They could also include information such as International Chemical Identifiers (known as InChIs), which are a machine-readable way of describing chemical structures.
Christoph Steinbeck, a cheminformatician who works on metabolomic data at Friedrich Schiller University Jena, points to yet another example of data ripe for FAIR-ification. Consider, he says, a synthesis route published in the Journal of Organic Chemistry. In principle, he says, the large experimental section in the paper could be formatted and structured in a machine-readable way so that researchers anywhere could extract the protocols and reproduce them with automated scripts or programs or by other methods. If the section is formatted properly, chemists could use the data or reproduce the protocol even if it was separated from the context of the paper.
Steinbeck is leading the chemistry section of the National Research Data Infrastructure (NFDI), a network in Germany tasked with building tools and infrastructures for FAIR data. The chemistry consortium goes by NFDI4Chem, and the German government has made long-term funding available for the NFDI consortia. “We can ask for up to €5 million per year to build this infrastructure,” Steinbeck says. He thinks the money will mostly pay the salaries of the people doing the work. The infrastructure itself, Steinbeck says, will include repositories where researchers will have to deposit data themselves, with a minimum set of metadata standards. Other repositories will be managed by data curators.
One US-based example of a curated data service that chemists might be familiar with is CAS, a division of ACS that provides “content and chemical information” to researchers and organizations. ACS also publishes C&EN. CAS acquires data from sources such as publishers and patent offices, indexes them, standardizes them, and makes them easily searchable for its customers, explains Gilles Georges, vice president of CAS content operations.
“The challenge really is the amount of data coming at us, especially from emerging economies in Asia—especially China,” Georges explains. Trying to make data more interconnected, he says, can help, but he believes computers and algorithms can do only so much. Manual curation by human experts who can make links between data that software might not is still valuable, he says. With regard to the open-science movement, Georges says, “there’s now an expectation that things have to be free.” But FAIR data doesn’t necessarily mean free data, he explains. He sees value added in the services that CAS provides that customers are willing to pay for.
You might need to pay for access or sign agreements to use some FAIR data, says Albert Mons, a partner of Phortos Consultants. Mons is a key part of the FAIR data movement, and at Phortos Consultants, he works with the private sector to help people adopt FAIR principles. Mons gives the example of patient data, for which there are privacy issues. These issues mean that the data cannot be made open but can be accessed through the proper channels. “FAIR is not open and free,” Mons says. FAIR, he emphasizes, just means it’s technically possible for data to “talk to” each other.
Even companies should see the value in FAIR data, Mons argues, adding that as a consultant he works with firms such as banks and pharmaceutical companies, as well as smaller start-ups. For example, he says, a pharmaceutical company with multiple subsidiaries will have a wealth of knowledge. But if the firm doesn’t put a FAIR data infrastructure in place, it “will not be able to have all these data talk to each other” or extract the implicit knowledge the data contain.
About “80% of all the effort regarding data goes into data wrangling and data crunching and preparation,” Mons says. “Only 20% is actually effective research and analytics.” That’s because data aren’t yet FAIR. He argues that if data creators would make their data FAIR at the outset, they would save a lot of time and increase efficiency. But the challenge, all researchers agree, is describing the data in a way that enables them to talk to other, similar data. That is the interoperability challenge, or the I in FAIR.
Chemistry is often described as the central science. And as a central science, chemistry underpins many other disciplines. So its data should be accessible and interoperable across those other disciplines, FAIR data proponents argue. But to make data usable across disciplines requires that they be described in an unambiguous way. So chemists need to apply precise metadata standards to data, the proponents argue. And that, they say, requires broad collaboration to agree on those standards.
“It needs to be an international process,” Steinbeck says. “We see it as our mission to foster and enable these international processes, to create the missing standards in the different domains of chemistry.” Willighagen agrees, saying he thinks professional societies should also help organize those conversations.
Chemists, the proponents say, need to get more involved in these discussions so that the implementation initiatives fully represent the chemistry community. This is why workshops and outreach events have been taking place globally. And more are planned. But change will also require a culture change in chemistry, they believe. McEwen, the Cornell librarian, says chemists have traditionally outsourced much of their data management to journals and third-party databases. In the future, she says, they will need a new mind-set. They’ll need to describe and deposit their data as they are being created. But of course, before chemists can do that, FAIR data infrastructure must be put in place.
“My favorite topic, the chicken-and-egg problem!” Steinbeck says. He explains that people will start depositing well-annotated data only if the infrastructure already exists, but without the infrastructure, the negotiation of data standards doesn’t make much sense. The whole process takes a while to work itself out, he says. For example, in Steinbeck’s field of metabolomics, scientists negotiated data standards around 2006, he says. But it took until 2012 or 2013 for the first repositories to appear.
The end goal, Mons points out, isn’t necessarily FAIR data. “The end goal is better analytics and advanced innovation.” But FAIR data are needed to get there.
“We don’t have all the answers,” McEwen says. “But this is something that we’re going to need to figure out.” The more that chemists can annotate their data and make them searchable and available, she says, “that’s a win for everyone.”