If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.


Computational Chemistry

CAS opens data vault to MIT scientists

Agreement gives computational chemists data to train retrosynthesis AI

by Sam Lemonick
November 10, 2020


A photo of Connor Coley alongside the CAS logo.
Credit: Connor Coley/CAS
Connor Coley will have first-time access to CAS data to train machine learning algorithms for synthesis prediction.

Scientists will soon get their hands on a vault of chemical data that has for years been kept from independent machine-learning researchers. For now, database organization CAS is allowing one group based at the Massachusetts Institute of Technology to use the company’s data to train machine-learning-based synthesis planners. Agreements between CAS and academic researchers have been rare, but a CAS official says it hopes this one will be the first of many.

CAS is a division of the American Chemical Society, which publishes C&EN. CAS maintains databases of chemical information, including molecular structures and properties, as well as reaction procedures and conditions. That’s exactly the kind of data computational chemists need in order to develop machine learning tools that can carry out retrosynthesis, the process of predicting the synthetic steps needed to make a target molecule like a drug. But CAS’s standard terms of use specifically forbid machine learning algorithm training.

Connor W. Coley, the MIT computational chemist who will lead the CAS collaboration, says the agreement will give his group access to a curated dataset of several million reactions. The team intends to publish its results and computer code, although the underlying training data will remain private. MIT is not paying for access to the data.

CAS’s Michael Dennis, who oversees this and other collaborations, says that—assuming the researchers develop useful algorithms—CAS hopes to incorporate them into its existing predictive retrosynthesis software, which is part of its SciFindern product.

Coley’s group has previously used the Reaxys database, open-source datasets drawn from patent filings, and other sources to train machine learning retrosynthesis algorithms. While Coley says he hasn’t yet looked closely at the CAS data, he thinks the set’s curation and level of detail may make it superior to other training data.

As an example, Coley says, imagine training an algorithm using a dataset of catalytic reactions where one-fifth of the entries don’t specify the catalyst used. Because of this sort of oversight—which is common—the model will learn the wrong thing. “The model will think there’s a way to run the reaction without a catalyst,” he says. If CAS data is more consistently annotated, that could help avoid similar problems.

Coley also thinks the CAS data might allow his group to develop algorithms capable of making new kinds of predictions. One area where other datasets fall short is stereochemistry, he says. If the CAS data have better stereochemical information, Coley says, a machine learning model trained on them might do a better job learning “what governs stereochemistry or how to design stereoselective reactions.”

Researchers applauded CAS for beginning to give academic researchers more access to its data. “I think it would just be wonderful if they ‘opened up’ to the community,” computational chemist Anatole von Lilienfeld of the University of Basel and the University of Vienna says in an email.

Some researchers familiar with CAS’s data questioned whether the quality is actually substantially higher than that of other data sources. Others said the results of the collaboration will provide a way to assess the value of CAS’s data for machine learning researchers. Amol Thakkar, a PhD student at the University of Bern who has studied retrosynthesis algorithms, says in an email, “To the best of my knowledge, this is the first access to CAS synthetic data, so it will be interesting to see what comes out regarding data quality.”

Regardless, Thakkar says he isn’t sure whether this new source of data will substantially improve machine learning models. Thakkar’s group has compared training datasets for the types of predictive synthesis algorithms Coley works on and found they performed well no matter what data they were trained on (Chem. Sci. 2020, DOI: 10.1039/C9SC04944D ). But he notes that curated training sets like the one CAS is providing to MIT can make a difference in training algorithms for specific types of reactions—for instance, ring formations.


This story was updated on Nov. 16, 2020, to clarify that CAS's collaboration with MIT is not exclusive of future collaborations with other researchers, and to update the link to CAS's SciFindern product.


This article has been sent to the following recipient:

Chemistry matters. Join us to get the news you need.