If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.


Computational Chemistry

Chemists debate machine learning’s future in synthesis planning and ask for open data

A controversial paper ignites discussion around how to use data-based algorithms and report results

by Fernando Gomollón-Bel, special to C&EN
May 17, 2022 | A version of this story appeared in Volume 100, Issue 18


Two scientists are surrounded by floating molecules that they are poking and making glow. One scientist stands in red surroundings and the other stands in blue.
Credit: Shutterstock/C&EN

In March, a paper in the Journal of the American Chemical Society sparked a heated Twitter debate on the value of machine learning for predicting optimal reaction pathways in synthetic chemistry. The authors argue that certain data-driven algorithms capture only trends that already exist in the chemical literature, thus failing to reach new, creative conclusions (J. Am. Chem. Soc. 2022, DOI: 10.1021/jacs.1c12005).

But other experts disagree and advocate for the use of such algorithms. The key to success, they say, is for researchers to ask computers the right questions, establish good benchmarks for machine learning, and promote open-data initiatives.

Bartosz A. Grzybowski of the Institute for Basic Science in South Korea co-led the study. He and his group have created software, such as Chematica (now Synthia) and Allchemy, that use the basic rules of chemistry to predict multistep synthetic pathways and complex biosynthetic ones. In contrast, other algorithms use existing data to predict these pathways.

For the JACS study, though, the team wanted to examine the difficulties of data-driven approaches. The researchers trained several popular algorithms on more than 10,000 examples of the Suzuki-Miyaura cross-coupling reaction that used heterocyclic building blocks. Grzybowski says the team chose the Suzuki-Miyaura reaction because its high prevalence in organic syntheses meant a relatively large database of examples to work with. But none of the algorithms offered meaningful, new predictions of optimal reaction conditions, even though the researchers “pretty much exhausted the methods used in chemical [artificial intelligence],” Grzybowski says.

Grzybowski and his coauthors concluded that the algorithms for recommending synthetic pathways are biased toward the reaction conditions reported most often in scientific papers. So an algorithm trained exclusively with data from traditional chemical transformations will never defy norms or offer original suggestions. “Data-trained algorithms don’t think out of the box,” Grzybowski says. The study’s results make him think that “data-driven artificial intelligence is doomed.”

But other researchers don’t agree with the study’s conclusions. Computational chemist Connor W. Coley of the Massachusetts Institute of Technology argues that data-driven algorithms have utility and don’t deserve to be written off. “It can be really hard to measure success in machine learning,” he says.

In contrast to the authors’ conclusions, Coley says some of the data-based algorithms designed reaction conditions that outperformed those cited most often in the databases the algorithms trained on. Disregarding the improvements because they’re unsatisfactory is entirely subjective, Coley said on Twitter. He argues that chemists need to establish the right benchmarks and evaluation metrics to properly assess whether the conclusions of data-driven algorithms represent a real improvement over what is already known.

Coley also points to established benchmarks in retrosynthesis, the practice of backtracking from a target molecule to plan a reaction pathway. Many pathways could be possible, and chemists choose which one is appropriate for their needs. For example, they may sacrifice a better yield to avoid hazardous solvents. So to achieve good results with data-driven algorithms, chemists need to first apply the right constraints and preferences. “If we need out-of-the-box solutions, we could perturb the models to reach beyond the [traditional] limits, rewarding uncertainty and optimistic scenarios,” Coley tells C&EN.

On top of better benchmarks, chemists need more sophisticated inputs for any machine learning approach. “We need better data,” says Gabe Gomes of Carnegie Mellon University, an expert in machine learning and automated synthesis. He advocates for data accessibility and stewardship in the sciences.

“A common data source is the US Patent and Trademark Office, with a great bias towards successful, commercial transformations,” Gomes says. But these patents and published data typically omit information on reactions that don’t work, and therefore algorithms never learn to avoid reactions that are not possible. Also, “Most of the data in the pharmaceutical industry is never shared, mostly because of issues with intellectual property,” he says.

Beyond publishing relevant data, another challenge is ensuring those data are accessible and in formats that machine learning algorithms can use. Núria López, an expert in computational chemistry at the Institute of Chemical Research of Catalonia, considers it crucial to follow the principles for scientific data management known as FAIR: findability, accessibility, interoperability, and reusability of digital assets.

“FAIR data ensures transparency, as well as important features such as traceability,” López says. Better descriptors beyond the text-based, computer-readable SMILES (simplified molecular-input line-entry system), which is currently used for describing molecules, could also help anonymize proprietary information so that it can be deposited into public databases and contribute to training sets for algorithms.

Scientific publishers should also endorse open formats for raw data and supporting information, Gomes says. For example, the Open Reaction Database is an initiative promoted by Coley and other machine learning experts to support efforts in reaction prediction, chemical synthesis planning, and experimental design (J. Am. Chem. Soc. 2021, DOI: 10.1021/jacs.1c09820). The platform provides a structured format for chemical reaction data, always under FAIR principles. “As the name implies, the Open Reaction Database is designed for open access and community contributions,” Coley says.

Another example of an open format is ioChem-BD, a platform for computational chemistry data sets that López helped develop. It is one of the FAIR platforms that the American Chemical Society and Nature’s journal Scientific Data recommend submitting authors use (ACS publishes C&EN). “Advancing towards FAIR and open data will surely push our prediction powers,” she says. She also recommends that scientists not depend entirely on data-driven machine learning platforms. Instead, she encourages the use of “hybrid models with experimental and traditional computational modeling.”

Machine learning experts must remain humble, Gomes says. Of course, some studies will make mistakes. “Us chemists, much like algorithms, still have lots to learn.”

Fernando Gomollón-Bel is a freelance writer based in Cambridge, England.


This article has been sent to the following recipient:

Chemistry matters. Join us to get the news you need.