A group of researchers are launching an open-source database of chemical synthesis procedures that they think will benefit artificial intelligence algorithms for reaction prediction, synthesis planning, and other tasks (J. Am. Chem. Soc. 2021, DOI: 10.1021/jacs.1c09820).
Machine-learning algorithms typically become more accurate when they are trained on large amounts of data. Machine-learning tools that can identify faces and translate text have rapidly become ubiquitous and are accurate because their developers have been able to take advantage of open databases of text and images, says Steven M. Kearnes of Relay Therapeutics, a member of the governing committee of the new Open Reaction Database (ORD). Chemists and computer scientists have already demonstrated machine learning tools that can successfully plot synthetic routes, predict reaction products, and estimate the properties. of new compounds. But Kearnes and his colleagues think access to more data is needed to accelerate progress in this area of research. The ORD is designed to solve this problem.
Computational chemists have long complained about the lack of large amounts of good quality, algorithm-friendly data. Only one large data set of reactions, drawn from the US Patent Office (USPTO) database, is openly accessible; users have to pay to access other large data sets, like the Reaxys and SciFinder databases. (SciFinder is produced by CAS, which is a division of the American Chemical Society, C&EN’s publisher.) Journal articles contain a wealth of potentially useful information about the details of reaction methods and outcomes, but these data are not often in formats that computers can readily read. Published procedures may also lack details that would seem obvious to a chemist but not to an algorithm, and scientists rarely report failed reactions, which can help machine-learning models understand what doesn’t work. Connor Coley of the Massachusetts Institute of Technology, another researcher behind the ORD, says chemists today are limited to just a handful of databases to train algorithms.
The ORD standardizes the kind of information scientists should share about a reaction, including reagents, temperature and pressure, types of glassware used, and how to format the data. Researchers can enter data manually or write computer programs to automatically transmit the information from high-throughput experimentation to the ORD. Kearnes and Coley say the latter would help the database capture information about failed reactions. Users can search the database by molecule, substructure, and reaction. The database currently contains nearly 2 million reactions compiled from existing sources like the USPTO database. At press time, about 15,000 reactions in the ORD were new submissions.
The database fills a real need, says Philippe Schwaller of IBM and the Swiss Federal Institute of Technology Lausanne, who has built machine-learning tools for chemistry. But, he says in an email, “the usefulness of the Open Reaction Database will largely depend on its adoption by the community.”
Spencer D. Dreher, another member of the ORD advisory board and a chemist at Merck & Co., agrees. “If we want to build useful models for all the different types of difficult chemistry reactions, we need participation from a large group of chemists in industry and academia to submit data to ORD.” He acknowledges that industry researchers may not always be able to share data, but he says his group has already contributed and will continue to do so. The ORD advisory board also includes researchers from Pfizer and Google.
If the ORD can amass a large amount of useful, machine-readable data, Kearnes thinks computational chemists will not be the only ones using it. A good database and hard problems like predicting chemistry, could also attract computer scientists. “The machine learning community responds well when you take the time to bring them data,” he says.—.