CAS opens data vault to MIT scientists

Credit: Connor Coley/CAS

Connor Coley will have first-time access to CAS data to train machine learning algorithms for synthesis prediction.

Scientists will soon get their hands on a vault of chemical data that has for years been kept from independent machine-learning researchers. For now, database organization CAS is allowing one group based at the Massachusetts Institute of Technology to use the company’s data to train machine-learning-based synthesis planners. Agreements between CAS and academic researchers have been rare, but a CAS official says it hopes this one will be the first of many.

CAS is a division of the American Chemical Society, which publishes C&EN. CAS maintains databases of chemical information, including molecular structures and properties, as well as reaction procedures and conditions. That’s exactly the kind of data computational chemists need in order to develop machine learning tools that can carry out retrosynthesis, the process of predicting the synthetic steps needed to make a target molecule like a drug. But CAS’s standard terms of use specifically forbid machine learning algorithm training.

Connor W. Coley, the MIT computational chemist who will lead the CAS collaboration, says the agreement will give his group access to a curated dataset of several million reactions. The team intends to publish its results and computer code, although the underlying training data will remain private. MIT is not paying for access to the data.

CAS’s Michael Dennis, who oversees this and other collaborations, says that—assuming the researchers develop useful algorithms—CAS hopes to incorporate them into its existing predictive retrosynthesis software, which is part of its SciFinderⁿ product.

Coley’s group has previously used the Reaxys database, open-source datasets drawn from patent filings, and other sources to train machine learning retrosynthesis algorithms. While Coley says he hasn’t yet looked closely at the CAS data, he thinks the set’s curation and level of detail may make it superior to other training data.

As an example, Coley says, imagine training an algorithm using a dataset of catalytic reactions where one-fifth of the entries don’t specify the catalyst used. Because of this sort of oversight—which is common—the model will learn the wrong thing. “The model will think there’s a way to run the reaction without a catalyst,” he says. If CAS data is more consistently annotated, that could help avoid similar problems.

Coley also thinks the CAS data might allow his group to develop algorithms capable of making new kinds of predictions. One area where other datasets fall short is stereochemistry, he says. If the CAS data have better stereochemical information, Coley says, a machine learning model trained on them might do a better job learning “what governs stereochemistry or how to design stereoselective reactions.”

Researchers applauded CAS for beginning to give academic researchers more access to its data. “I think it would just be wonderful if they ‘opened up’ to the community,” computational chemist Anatole von Lilienfeld of the University of Basel and the University of Vienna says in an email.

Some researchers familiar with CAS’s data questioned whether the quality is actually substantially higher than that of other data sources. Others said the results of the collaboration will provide a way to assess the value of CAS’s data for machine learning researchers. Amol Thakkar, a PhD student at the University of Bern who has studied retrosynthesis algorithms, says in an email, “To the best of my knowledge, this is the first access to CAS synthetic data, so it will be interesting to see what comes out regarding data quality.”

Regardless, Thakkar says he isn’t sure whether this new source of data will substantially improve machine learning models. Thakkar’s group has compared training datasets for the types of predictive synthesis algorithms Coley works on and found they performed well no matter what data they were trained on (Chem. Sci. 2020, DOI: 10.1039/C9SC04944D ). But he notes that curated training sets like the one CAS is providing to MIT can make a difference in training algorithms for specific types of reactions—for instance, ring formations.

Correction

This story was updated on Nov. 16, 2020, to clarify that CAS's collaboration with MIT is not exclusive of future collaborations with other researchers, and to update the link to CAS's SciFinderⁿ product.

Chemical & Engineering News

ISSN 0009-2347

Advertisement

LATEST

TOPICS

MAGAZINE

FEATURES

COLLECTIONS

PODCASTS

CHEMPICS

JOBS

LATEST

TOPICS

MAGAZINE

FEATURES

COLLECTIONS

PODCASTS

CHEMPICS

JOBS

Computational Chemistry

CAS opens data vault to MIT scientists

Agreement gives computational chemists data to train retrosynthesis AI

by Sam Lemonick

November 10, 2020

Advertisement

Correction

You might also like...

Join the conversation

Advertisement

TOPICS

MAGAZINE

FEATURES

COLLECTIONS

Grab your lab coat. Let's get started

Welcome!

Welcome!

Create an account below to get 6 C&EN articles per month, receive newsletters and more - all free.

It seems this is your first time logging in online. Please enter the following information to continue.

As an ACS member you automatically get access to this site. All we need is few more details to create your reading experience.

The key to knowledge is in your (nitrile-gloved) hands

Access more articles now. Choose the ACS option that’s right for you.

Thank you!

CAS opens data vault to MIT scientists

Agreement gives computational chemists data to train retrosynthesis AI

by Sam Lemonick

November 10, 2020

Advertisement

Correction

You might also like...

Join the conversation

The power is now in your (nitrile gloved) hands

Sign up for a free account to get more articles. Or choose the ACS option that’s right for you.

Option 1

Create a free account To read 6 articles each month from

Option 2

BEST VALUE

Join ACS To get even more access to

Create a free account
To read 6 articles each month from

Join ACS
To get even more access to