Advertisement

If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.

ENJOY UNLIMITED ACCES TO C&EN

Big Data

Editorial: Show us the data

Scientific advances are built on shared data, but walled gardens are everywhere

by C&EN editorial staff
October 25, 2024 | A version of this story appeared in Volume 102, Issue 34

 

Chemical formulas and data processing overlayed onto a photo of some servers.
Credit: Shutterstock

Earlier this month, the Royal Swedish Academy of Sciences awarded the 2024 Nobel Prize in Chemistry for work on computational protein structure prediction and de novo protein design. The award may have seemed sudden, as algorithms from Google DeepMind and David Baker’s lab, in particular, have mostly come out in the past decade.

But the field had been growing and developing for much longer. Algorithms like AlphaFold and RoseTTAFold exist only because of foundational work over the preceding years. That is not unique to artificial intelligence protein design or protein structure prediction. All science is built on what came before. But in the case of this year’s Nobel-winning work, the algorithms’ success owes much to the huge, well-maintained databases of scientific data that were used to train them.

An email sent to C&EN on the day of the award underlines that debt: the prize was “wonderful,” says Sameer Velankar, team leader of the Protein Data Bank in Europe and AlphaFold database at the European Molecular Biology Laboratory’s European Bioinformatics Institute. But, he adds, “tools such as AlphaFold would not have been possible without the vast volumes of experimental data that researchers have been generating for decades, and sharing openly in databases.”

From the beginning, open access was at the heart of these data collections. Even before the internet allowed files to be transferred, databases were sent through the postal service on magnetic tape.

These databases are so useful to researchers because of how well curated and structured their information is: scientists submit entries with specific metadata attached, and teams of curators annotate and maintain the submissions.

But the field of molecular biology is an outlier. If this year’s Nobel Prize in Chemistry shows what well-indexed, publicly available data can enable, it also highlights the fact that equivalent resources in fields such as materials science and pharmaceuticals just do not exist in the same way.

Without those resources, chemistry will fail to fully capitalize on the promise of artificial intelligence and machine learning.

Take the field of drug discovery, for example. Building and testing new drug molecules is an expensive business, so it is in companies’ best interest to protect their intellectual property. Companies keep their databases of millions of molecular structures and related information proprietary because those resources are what companies sell or use to raise money.

Internally, teams of scientists at these firms are developing algorithms to identify potential new drugs or predict the properties of a new molecule. But they will be able to train those models only on the data they have access to. Researchers in academic institutions wanting to build and train algorithms for similar applications will have access to even less.

Imagine the possibilities if these data could somehow be shared. To do that, companies will need to be able to set up protocols that protect their competitive advantage, and appropriately annotate their data so that information from source A and source B can be appropriately compared.

That’s no small task. But industry insiders acknowledge that there is a need to grapple with this problem. Terray Therapeutics cofounder Eli Berlin recently told C&EN that he is starting to see executives in start-ups and Big Pharma acknowledge the need to share data to improve their training sets.

“I’m encouraged that there could be a path towards shared efficiency in a way that protects everybody’s proprietary IP,” Berlin says, but he admits that he has yet to see an answer.

C&EN doesn’t have the answers either. But we urge the chemistry enterprise not to give up. Unless companies can find a way forward, there is a danger that the 2024 Nobel Prize will not be the harbinger of a new and exciting AI-powered innovation but a lone example of what could have been.

This editorial is the result of collective deliberation in C&EN. For this week’s editorial, the lead contributor is Laura Howes.

Views expressed on this page are not necessarily those of ACS.

Article:

This article has been sent to the following recipient:

0 /1 FREE ARTICLES LEFT THIS MONTH Remaining
Chemistry matters. Join us to get the news you need.