Scientists will soon get their hands on a vault of chemical data that has for years been kept from independent machine-learning researchers. For now, database organization CAS is allowing one group based at the Massachusetts Institute of Technology to use the company’s data to train machine-learning-based synthesis planners. Agreements between CAS and academic researchers have been rare, but a CAS official says it hopes this one will be the first of many.
Connor W. Coley, the MIT computational chemist who will lead the CAS collaboration, says the agreement will give his group access to a curated dataset of several million reactions. The team intends to publish its results and computer code, although the underlying training data will remain private. MIT is not paying for access to the data.
CAS’s Michael Dennis, who oversees this and other collaborations, says that—assuming the researchers develop useful algorithms—CAS hopes to incorporate them into its existing predictive retrosynthesis software, which is part of its SciFindern product.
Coley’s group has previously used the Reaxys database, open-source datasets drawn from patent filings, and other sources to train machine learning retrosynthesis algorithms. While Coley says he hasn’t yet looked closely at the CAS data, he thinks the set’s curation and level of detail may make it superior to other training data.
As an example, Coley says, imagine training an algorithm using a dataset of catalytic reactions where one-fifth of the entries don’t specify the catalyst used. Because of this sort of oversight—which is common—the model will learn the wrong thing. “The model will think there’s a way to run the reaction without a catalyst,” he says. If CAS data is more consistently annotated, that could help avoid similar problems.
Coley also thinks the CAS data might allow his group to develop algorithms capable of making new kinds of predictions. One area where other datasets fall short is stereochemistry, he says. If the CAS data have better stereochemical information, Coley says, a machine learning model trained on them might do a better job learning “what governs stereochemistry or how to design stereoselective reactions.”
Researchers applauded CAS for beginning to give academic researchers more access to its data. “I think it would just be wonderful if they ‘opened up’ to the community,” computational chemist Anatole von Lilienfeld of the University of Basel and the University of Vienna says in an email.
Some researchers familiar with CAS’s data questioned whether the quality is actually substantially higher than that of other data sources. Others said the results of the collaboration will provide a way to assess the value of CAS’s data for machine learning researchers. Amol Thakkar, a PhD student at the University of Bern who has studied retrosynthesis algorithms, says in an email, “To the best of my knowledge, this is the first access to CAS synthetic data, so it will be interesting to see what comes out regarding data quality.”
Regardless, Thakkar says he isn’t sure whether this new source of data will substantially improve machine learning models. Thakkar’s group has compared training datasets for the types of predictive synthesis algorithms Coley works on and found they performed well no matter what data they were trained on (Chem. Sci. 2020, DOI: 10.1039/C9SC04944D ). But he notes that curated training sets like the one CAS is providing to MIT can make a difference in training algorithms for specific types of reactions—for instance, ring formations.
This story was updated on Nov. 16, 2020, to clarify that CAS's collaboration with MIT is not exclusive of future collaborations with other researchers, and to update the link to CAS's SciFindern product.