Getting the most out of chemistry data with machine learning
Although chemists are excited by the potential of so-called deep-learning computational tools to make a splash in drug discovery, publishers and others are still looking to squeeze findings out of earlier, less sophisticated versions of these tools. With machine-learning techniques that “teach” themselves with large data sets, they hope to get more out of scientific information, whether in the lab or the classroom.
“It’s about how you make discoveries consumable,” says Conal Thompson, chief technology officer for CAS, a division of the American Chemical Society that’s looking into how to get more out of its chemistry databases. “What’s going to become more valuable is insight from your data or content, rather than just the content itself.” ACS publishes C&EN.
One emerging area of chemistry that is capitalizing on machine learning is computer-aided synthesis design: feeding software a target molecule and getting back possible routes chemists might use to make it.
“Eight years ago, there was a lot of skepticism and resistance of chemists to the whole notion” of artificial intelligence being used to solve chemical problems, says Orr Ravitz, product manager for Wiley’s ChemPlanner, one platform offering help with synthesis design. “I think a lot has changed since then, and I think that’s related to us using so many computational tools in our daily life. People are starting to expect it.” Also, the falling cost of computing power has made chemistry applications faster and less expensive.
But machine learning does not help in every situation. For example, the basic reaction rules underlying ChemPlanner and a similar program developed by a start-up company, Chematica, do not come from computers automatically extracting information from journals or patents. Instead, humans are extensively involved to identify key reactions and write the reaction rules on which the programs run. This is in part because artificial intelligence programs learn best when they train on hundreds of examples.
If chemists just want to use very common, well-established reactions, then machine learning likely could extract them from the literature, says Bartosz A. Grzybowski, developer of Chematica and a chemistry professor at Ulsan National Institute of Science & Technology and at the Polish Academy of Sciences. “But for complex synthetic planning, very rare reactions can be very important. A reaction that might appear in the literature only three times may be key to making a natural product,” Grzybowski adds.
Consequently, expert chemists write the rules that allow the software to identify the core of a reaction—the bonds that change during the reaction and their associated atoms. Chemists also write the rules that dictate when and how to incorporate other components of reagent structures that may influence reactivity, such as aromaticity or electron-donating or -withdrawing groups.
The software then uses those rules to identify possible reactions based on whether the chemical structures of those “extended cores” share similar properties. Machine learning comes in for navigating the options among a huge network of synthetic possibilities.
Machine-learning algorithms also play a role in scoring synthetic pathways to prioritize the order in which they’re shown to the user. Scoring is not a one-size-fits-all process, the software developers have found. Different chemists differently prioritize things such as cost, yield, number of steps, or use of protecting groups. “What we hope to do with machine learning in the future is to basically learn from the user’s interaction with the system and try to tailor prioritization to their taste, similar to what Netflix does based on your viewing history,” Ravitz says.
Researchers are also actively applying machine learning to materials science. Northwestern University professor Chris Wolverton and colleagues recently published a general framework for using machine-learning approaches to predict properties of inorganic materials (npj Comput. Mater. 2016, DOI: 10.1038/npjcompumats.2016.28). Separately, a team led by Sorelle A. Friedler, Joshua Schrier, and Alexander J. Norquist of Haverford College used machine-learning models to predict conditions for successful crystallization of inorganic-organic hybrid materials (Nature 2016, DOI: 10.1038/nature17439).
Notably, the Haverford group says in its paper that the researchers used information on “dark” reactions—failed or unsuccessful syntheses—collected from their archived laboratory notebooks to help train their machine-learning model, and they have a “Dark Reactions Project” website set up to gather similar information at darkreactions.haverford.edu. Such “dark” data will become increasingly important as people look to develop machine-learning applications, experts say.
“Nobody likes publishing negative results, but a machine and its intelligence would be much more informed by having positives and negatives,” CAS’s Thompson says. “It’s not a mistake anymore, it’s valuable information.” However, making that valuable information accessible is an unsolved problem in a scientific culture that prizes positive findings and largely ignores so-called negative results in its publications.
Getting the most out of chemistry data with machine learning
Other areas that may benefit from machine learning include scientific education, where algorithms can potentially improve student learning outcomes. “One of our divisions creates educational materials for nurses, but many of our students get frustrated with the challenging material, drop out of the course, and never take their certification exam,” Dan Olley, chief technology officer at Elsevier, told CIO magazine last year. “We are using algorithms that learn how students actually use the course material,” he continued. “This way, we can create adaptability and personalization within the course to engage the students and drive better pass rates.”
Where machine learning will take scientific learning and research in the future remains to be seen. But just as computing technology has changed daily life to incorporate activities previously only seen on “Star Trek”—“Alexa, lower the temperature to 68 degrees”—it has the potential to allow the scientific enterprise to do things researchers previously only dreamed about.
CORRECTION: This story was updated on Jan. 23, 2017, to correct the credit on the reaction scheme. It should be credited to ChemPlanner, not Chematica.