If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.


Computational Chemistry

Mapping reaction space with machine learning

Algorithms made to understand language can accurately classify chemical reactions

by Sam Lemonick
February 4, 2021 | A version of this story appeared in Volume 99, Issue 5


A tree diagram of tens of thousands of chemical reactions whose broad classification is indicated by color.
Credit: IBM Research-Zurich
Machine-learning tools called transformers can classify chemical reactions according to type, shown as different colors. This tree diagram includes tens of thousands of reactions, each represented as a point, with similar reactions placed near one another.

A machine-learning technique first developed for understanding language can accurately classify reactions according to type (Nat. Mach. Intell. 2021, DOI: 10.1038/s42256-020-00284-w). The model also tagged reactions with computer-readable codes that allow chemists to search for similar reactions.

Transformers are a type of machine-learning algorithm useful for interpreting sequences of information. They’re widely used in translation software and voice assistants like Amazon’s Alexa, but chemists recently have shown their utility in chemistry (see page 19). Philippe Schwaller of IBM Research–Zurich and the University of Bern and colleagues show the transformer approach can classify reactions by type, identifying broad categories like carbon-carbon bond formation or deprotection and finer-scale groups like chloro or bromo Suzuki coupling. When categorizing reactions, the machine-learning model matched the classifications assigned by software that used human-coded rules 98% of the time. The researchers tested their model on data sets containing tens of thousands of reactions described using the SMILES (simplified molecular-input line-entry system) line notation, which describes chemical structures with a string of characters. Schwaller says the group traced many of the classification disagreements to tautomeric differences in molecules or simple typos in the data.

The group’s model generates unique codes to identify different reactions, and Schwaller says these codes are useful for more than just classifying reactions: the codes can identify very similar reactions via a database search. Schwaller says this function could help chemists screen a database for alternative reactions or help them optimize a reaction by pointing them to published procedures for similar transformations. And he says these models can be modified for other tasks, like predicting reaction yields.



This article has been sent to the following recipient:

Chemistry matters. Join us to get the news you need.