Advertisement

If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.

ENJOY UNLIMITED ACCES TO C&EN

Computational Chemistry

A guide to navigating AI chemistry hype

If you plan to use machine learning for research, consider ChatGPT’s shortcomings and inquire about AI tools’ training data and benchmarking performance

by Andy Extance, special to C&EN
May 20, 2025

 

Credit: Madeline Monroe/C&EN/Shutterstock

The world is awash with news about artificial intelligence tools, including those intended to help chemists. But are the tools useful, a threat, or even worth the attention? It’s hard to know, especially when developers are making bold, exaggerated claims. Some of us have little background knowledge with which to critically assess this new technology. But even Rafael Gómez-Bombarelli, a researcher who applies machine learning to materials science at the Massachusetts Institute of Technology, says that journal articles that report data generated by AI based on large language models (LLMs) such as OpenAI’s renowned ChatGPT “don’t feel like science.” Though Gómez-Bombarelli didn’t refer to it directly, one relatively well-known and controversial preprint suggests that general LLMs can solve many chemistry problems.

Yet these tools have a reproducibility problem, he warns: ask them to do the same task repeatedly, and they usually output multiple responses.

Furthermore, many papers relating to the use of AI in science describe just small advances, says Koji Tsuda of the University of Tokyo. Tsuda, who applies AI to medicine, biology, and materials science, stresses that researchers are “incentivized to create new methods and papers.” “We should just use somebody else’s method when it’s not so different,” Tsuda says. “But because we always need to publish papers, it’s a little tough to stop.”

Yet some tools are clearly being used effectively. Giving an example, Tsuda highlights a preprint published by MIT researchers in 2024 that documents a case in which AI enhanced scientists’ productivity. At an unnamed company, 1,000 researchers who were randomly assigned to use an AI tool discovered 44% more new materials and filed 39% more patent applications between May 2022 and June 2024 than did the ones who stuck to their standard workflow. “The interesting thing is that the satisfaction of scientists reduced,” Tsuda says.

Notice

As this story was going to press, C&EN learned that because of concerns regarding the integrity of the study's research, MIT conducted a review and concluded "the paper should be withdrawn from public discourse."

By including advice from Tsuda, Gómez-Bombarelli, and other experts, this guide to AI in chemistry aims to help researchers steer through the complicated maze of tools available. It covers various types of AI, what to expect from them, and how to judge whether particular tools are worth investigating.

Look at how training data relate to your needs

Today’s AI most often consists of networks of interconnected artificial neurons, which encode information as numerical values determined by inputs from other such neurons. Such systems can learn from data they are trained with, a process referred to as machine learning, which has existed for decades. Modern tools are referred to as deep learning because they have multiple “deep” layers of neurons that pass along numerical information to the next layer. A well-established chemistry application for neural networks is planning synthetic routes in organic chemistry. AiZynthFinder, for example, uses such a network to guide searches for the most-promising routes.

The amount of data used for training an AI tool is a strong first indication of how well it might work. So says Bharath Ramsundar, founder and chief executive officer of Palo Alto, California–based Deep Forest Sciences. “A rule of thumb is if you have 1,000 or more data points, probably you can do something,” he says. “It’s logarithmic. 100 is a little tricky, 10,000 better, 100,000 even better.”

A second indication of how well an AI tool might work in a given situation is how similar a query is to the data the tool is trained on, Ramsundar adds. Looking for another structure with properties similar to those of other structures in the training data will be more successful than seeking wholly new properties. “Machine learning tends to do better the closer to its input that you stay,” Ramsundar says.

In chemistry, many AI models use supervised learning, Gómez-Bombarelli explains. This method involves information referred to as a label, such as the physical property of a chemical structure, that “supervises” the way a model learns to predict properties from structures. In this example, developers train the model with the structure and its associated property. The model can then predict properties of new structures based on patterns learned.

Supervised tasks such as property and structure prediction have generally been done with graph neural networks (GNNs), Gómez-Bombarelli says. The word graph refers to mathematical systems where edges connect nodes, analogous to how chemical bonds connect atoms in molecules. The mathematical representation may be somewhat different from the structures chemists are used to, however.

There have been few examples where AI use in chemistry has been transformational. AlphaFold is one of them.
Rafael Gómez-Bombarelli, assistant professor in materials processing and Jeffrey Cheah Career Development Professor, Department of Materials Science and Engineering, Massachusetts Institute of Technology

The Nobel Prize–winning AI protein-folding-prediction system, AlphaFold, creates graphs that represent pairings between every amino acid building block in a protein sequence. AlphaFold can use supervised learning based on a dataset of over 170,000 protein structures from the Protein Data Bank (PDB). There have been few examples where AI use in chemistry has been transformational. AlphaFold is one of them, Gómez-Bombarelli says.

GNNs can be successful in such applications when the datasets available for training have thousands of structures. Gómez-Bombarelli adds that an extra training step called fine-tuning can adapt such models to new applications when there are only hundreds or even tens of data points. Such fine-tuned models can outperform humans when predicting the scope of chemical reactions or selecting ligands for organometallic complexes.

Models that link structure to properties have been widely adopted, Gómez-Bombarelli says, especially by pharmaceutical companies, partly because software has long existed for that purpose. As such, they have been somewhat beneficial.

In the field of molecular simulation, machine learning potentials (MLPs), which have been in development since 2010, are “a huge success” in replacing computationally demanding density functional theory (DFT) calculations, Gómez-Bombarelli points out. Not all MLPs employ the kind of neural network–based machine learning that has become popular. The ones that are based on neural networks are trained through supervised learning on data calculated using DFT. With enough data, MLPs that perform similarly to DFT in simulations are “way faster,” Gómez Bombarelli says.

He adds that researchers at Texas Tech University and the Lawrence Berkeley National Laboratory estimate that conventional DFT simulations such as Espresso AI and VASP use about 20% of supercomputer time in the US, consuming a great deal of energy as they do so. In contrast to AI’s reputation for consuming large amounts of energy, MLPs are cutting chemistry’s electricity bills.

MLPs are powerful, but they have limitations. Vassiliki-Alexandra Glezakou, section head of chemical transformations at Oak Ridge National Laboratory (ORNL), uses MLPs to accelerate molecular dynamics simulations. She notes that MLPs trained on data for one chemical system “are not necessarily transferable” to other systems. “This is a considerable challenge for solving chemistry problems,” she says.

Consider ChatGPT carefully

ChatGPT revolutionized AI when it was launched in November 2022, yet the full version of its name—generative pretrained transformer—gets little attention. “Generative” means that it creates new information that is like its training data, without explicitly knowing the rules that humans used to create that data. “Transformer” refers to an architecture through which it can supervise its own learning, without needing labels. Famously, OpenAI trains ChatGPT using the wealth of information available on the internet, consuming tremendous amounts of energy in the process.

Generative chemical models include IBM’s RXN for Chemistry, which can plan synthetic routes in organic chemistry and is also based on transformer technology. IBM now calls the technology MoLFormer-XL. This is an LLM that exclusively uses simplified molecular-input line-entry system (SMILES) representations, which translate a chemical’s 3D structure to a string of symbols. IBM first trained the model primarily on structures extracted from written language in patents and has since refined it using other databases.

Other generative chemical models, such as ChemLLM and ChemCrow, are LLMs trained or fine-tuned for chemistry tasks using natural language data, Gómez-Bombarelli says. Such LLMs learn by autocompletion, which enables them to develop an intrinsic understanding of chemical structures by learning to predict missing molecular fragments. “You show them 90% and allow them to write in the missing part, over and over,” Gómez-Bombarelli says. GNNs are better options with large, labeled datasets, whereas unsupervised learning can shine when data is scarce, he adds.

Tsuda notes that ChatGPT and other conventional LLMs try to represent chemical structures as language. “Usually it sounds unreasonable, but now, because of LLMs, this really makes sense,” he says. “It’s a new type of AI chemistry, and that’s interesting.” Yet he’s concerned about trying to describe physical characteristics in words. “We don’t know what the representation is,” Tsuda says.

Tsuda argues that generative models are perhaps the most useful AI tools in chemistry. “AI is not so interesting if you already know what you want to investigate,” he says. But “in cases where you are searching for new topics, you can use AI to explore.”

General-purpose LLMs such as GPT also can answer chemistry questions that look like natural language, but they may be less successful than specialized models when it comes to structures and equations. Jonathan Allen, a senior informatics researcher at the Lawrence Livermore National Laboratory (LLNL), suggests that such tools are “glorified Google searches” with “more-efficient summarization.”

How can we determine how much to trust GPT and its relatives? One option is to turn to a validation technique available to many AI models: benchmarks.

Compare how well models did on their exams

Wei Wang, a computer scientist at the University of California, Los Angeles, and her team collated university-level questions to test LLMs in math, physics, chemistry, and computer science. Called SciBench, the benchmarking software showed that the GPT-3.5 and GPT-4 models that power ChatGPT incorrectly answered many questions taken from university-level exams and textbooks. In the best case, GPT-4 answered around one-third of textbook questions correctly and scored 80% on an exam.

When new models are developed to do existing tasks, they should be compared with a baseline or reference, Gómez-Bombarelli says. Such benchmarking tools have already been produced for many tasks: for example, Tox21 for comparing toxicity predictions and MatBench for predicting various properties of solid materials. Gómez-Bombarelli notes that machine learning studies typically use such benchmarking tools to create tables comparing performance between new models and established ones. Real-world impact, however, requires more than just benchmarking. If a model claims to improve molecule discovery, it must be tested experimentally.

Benchmarking efforts have existed since before ChatGPT. One approach originates from the Accelerating Therapeutics for Opportunities in Medicine (ATOM) consortium, a public-private partnership created in 2016 by GSK, the US Department of Energy, and the US National Cancer Institute. LLNL’s Allen helped establish the ATOM Modeling PipeLine, or AMPL, in 2020. ATOM evaluated various deep learning models that predict properties related to the safety of drugs and their transport and breakdown in people’s bodies. ATOM is no longer functioning, but former consortium members are continuing to support AMPL.

AMPL lets researchers test datasets on various models, allowing scientists to choose the ones that work best for them. “There wasn’t necessarily a single best model,” Allen says. Graph convolutional neural networks, which can filter and simplify inputs using mathematical processes called convolutions, are powerful but require lots of training data. When sufficient data are unavailable, an alternative way to improve performance is to ensure that models include many types of data about substances. Models are most effective when they’re predicting familiar structures, and they have difficulty expanding into unknown spaces.

A healthy dose of skepticism has to be front and center.
Jonathan Allen, senior informatics researcher, Lawrence Livermore National Laboratory

“Property prediction of molecules is challenging,” Allen says. “The space of molecules is very large, and the amount of experimental data that’s been collected relative to the things that can be made is very small.” Generative models can propose seemingly plausible molecules, but many of them are not actually synthesizable, he adds. “A healthy dose of skepticism has to be front and center” in assessing their value, Allen says.

ORNL researchers released a workflow and testing platform in 2021 called MatDeepLearn. It was similar in concept to AMPL, but it evaluated top-performing GNNs on several representative datasets in computational materials chemistry rather than in drug discovery. Based on her experience, Glezakou stresses that only a small amount of chemical data is available for training, which limits AI’s potential in the subject.

Like Tsuda, Glezakou believes that few AI tools are used in earnest. “Although it is useful to have these benchmarks, the models are still not used to solve any real chemistry problems in practice,” she says.

She notes that the field of AI in chemistry is dominated by two groups of scientists. One predicts material properties without providing deeper information on how those materials could be produced or how they might react. The other applies and tests new AI models. “I think what chemistry needs is beyond benchmarks—a workflow and thought process that can actually solve real chemistry problems,” Glezakou says. Yet she still encourages researchers to use AI, especially for difficult challenges. “If a problem is critical, then it is worth the chemist’s time,” she says.

If a chemist needs an AI tool, they might need help in choosing it—and members of the two groups Glezakou mentioned might be some of the best people to ask.

Call on your community

To give them the best chance of being genuinely valuable for scientific progress, AI tools should accord with the FAIR data standard—being findable, accessible, interoperable, and reusable—says Gómez-Bombarelli. Making code and data available on platforms like GitHub supports credibility. GitHub offers star ratings that can serve as a measure of trust, going some way toward open peer review. Allen highlights as useful the compilation of resources for machine learning for drug discovery that is available on GitHub and was put together by Pat Walters, chief data officer at Relay Therapeutics in Cambridge, Massachusetts.

A desire to share knowledge drove Ramsundar to set up DeepChem, an open-source platform for machine learning in chemistry. While working on his PhD, he interned at Google Research, gaining access to deep learning tools for chemical datasets, which he wasn’t able to share when he left the post. Ramsundar remembers thinking, “I just lost access to the coolest result I had in my PhD so far because I was no longer a Google employee.” DeepChem was initially built from existing open-source tools developed by others. It has grown through the contributions of various researchers, including Ramsundar’s Stanford University colleagues and specialists in drug discovery.

Today, DeepChem enables scientists to write, benchmark, and share their own machine learning tools as part of an active community on the online social platform Discord. Deep Forest Sciences builds on the foundation of DeepChem and integrates other open-source AI chemistry tools. These include the company’s Prithvi platform, which Ramsundar calls “a type of AI scientist” that orchestrates various tools to scan databases and answer questions in chemistry, biology, materials science, and device physics.

Advertisement

While some “working chemists” use DeepChem effectively, Ramsundar says, it’s more likely to be used by computational chemists. LLMs and synthesis-planning tools have relatively simple interfaces, but learning to use DeepChem well takes hard work, like learning to use a new laboratory instrument. He says the same is true for most AI tools.

“I hope that in the not-too-distant future, there will be much easier on-ramps to using these methods,” Ramsundar says. For now, open-source tools are mainly the domain of expert power users. “You have to be willing to ask questions and ideally work with someone who already has some expertise.”

Ramsundar doesn’t feel that AI has yet been transformational in chemistry broadly. But his outlook is more positive than Tsuda’s or Glezakou’s on how successful AI has been so far. “AI is a broadly useful expert tool that has contributed in many small ways to many, many discoveries,” he says. “I’ve seen quite useful things, but nothing like that AlphaFold level of this is a Nobel Prize–worthy advance. That could change.”

Article:

This article has been sent to the following recipient:

2 /3 FREE ARTICLES LEFT THIS MONTH Remaining
Chemistry matters. Join us to get the news you need.