Imagine you’re a materials scientist and your job is to discover a new material, a combination of atoms no one has ever made. Maybe you’re looking for a metal-organic framework (MOF). They have a lot of potential applications: carbon capture, drug delivery, and hydrogen fuel storage, just to name a few.
So how do you find a new MOF? You could try machine learning. People are saying good things about machine learning, especially graph neural networks (GNNs), which are designed to behave like neurons in a brain.
And lucky for you, there are already a handful of GNNs designed to predict new materials. Better yet, you can download them right now, for free, from a site like GitHub.
But which GNN should you use? How should you teach it to make accurate predictions? What are the optimal settings? Choose wisely; you’re about to invest time and money and maybe hire a graduate student to go MOF hunting. Which GNN will find the best new MOF in the shortest time?
Those are not easy questions to answer, regardless of whether you’re new to machine learning for materials discovery or an old hand. Some scientists are trying to make those decisions easier by developing methods for comparing the performance of machine-learning algorithms. These researchers say that adopting these benchmarking methods could help speed the discovery of new materials. It could also help developers of machine-learning models improve their algorithms and approaches.
The idea of benchmarking isn’t new. In simple terms, it means comparing the performance of one process against a baseline to quantify how much that process helps you. Chemists have benchmarked computational approaches before—for example, comparing how well approaches to density functional theory (DFT) predict experimentally derived chemical properties. Through that benchmarking, chemists now know when they can trust DFT to make accurate predictions.
Benchmarking hasn’t become widespread in the world of machine learning for materials discovery. Bobby G. Sumpter of Oak Ridge National Laboratory (ORNL) has been experimenting with machine learning for decades. He says there are many machine-learning methods available, many of which are open source, and more are appearing all the time. “People get sort of overwhelmed by what to choose,” Sumpter says.
Sumpter, his ORNL colleague Victor Fung, and others developed a tool called MatDeepLearn for benchmarking GNNs in materials discovery. Fung says a few years ago he thought machine learning was probably overhyped, but advances in GNNs since then have changed his mind. He says recent papers show that these models are capable of chemical accuracy, meaning their predictions match properties measured experimentally. Still, like Sumpter, he says choosing which one to use can “be a roll of the dice.”
In MatDeepLearn, the group programmed a framework with most of the steps of a machine-learning discovery process and then swapped in different models’ convolutional operator, which is these algorithms’ central component that processes data to make predictions. You can think of this benchmarking process like testing and comparing car engines. The researchers have built test beds with the same car body, wheels, tires, and driver inputs, and then they swap in different engines to measure how fast each one is in a race.
The team tested five GNNs in their framework in a recent preprint to see how well the algorithms predicted properties of different classes of materials (ChemRxiv 2021, DOI: 10.26434/chemrxiv.13615421.v2). Preprints are not peer-reviewed. The top four GNN models all performed about equally. Fung says these results suggest that for scientists simply looking for a model that performs well at the tasks tested, it might not make much difference which model they choose.
But he says for scientists developing new GNNs and machine-learning methods, the results raise some questions. The researchers found that MEGNet, a GNN published in 2019 (Chem. Mater. 2019, DOI: 10.1021/acs.chemmater.9b01294), performed about as well as SchNet, released in 2017 in a preprint (arXiv 2017, arXiv: 1706.08566). If 2 years hasn’t led to an increase in algorithm performance, “Are we making progress?” Fung asks. He says their study points to another way that benchmarks can be useful. They help developers of models identify what they’re doing right or wrong as they try to improve their methods.
Alex Dunn of the University of California, Berkeley, and Lawrence Berkeley National Laboratory says aiding developers was the motivation for a benchmarking method called Matbench that he, Anubhav Jain of Berkeley Lab, and colleagues developed (npj Comput. Mater. 2020, DOI: 10.1038/s41524-020-00406-3). Without a way to compare machine-learning models fairly, Dunn says, “it can be hard for someone who’s interested in advancing the field to know what avenue to go down.” Matbench tests algorithms on 13 machine-learning tasks, such as predicting a material’s bandgap. And the scientists created a reference algorithm for users to benchmark their algorithm against.
“Now if you have a new algorithm or method, you can directly compare your results to theirs,” says Bryce Meredig, chief science officer at Citrine Informatics, which develops machine-learning methods for materials science. Meredig says the materials science community has realized in the past few years that it lacks a set of universally acknowledged benchmarks for gauging model performance. Dunn wants it to be common practice for developers to use standard tests and data sets to benchmark algorithms and to publish those results.
Like MatDeepLearn, Matbench produced some counterintuitive results. While the GNNs that Matbench tested outperformed the reference algorithm when trained on data sets with more than 10,000 entries, the more simplistic reference algorithm did better than the GNNs on most predictions when fewer data were available. The researchers say these results suggest that researchers can predict some properties accurately without more computationally expensive algorithms like GNNs.
Another recent attempt at benchmarking machine-learning methods set a more holistic goal. Santosh K. Suram of Toyota Research Institute, John M. Gregoire of the California Institute of Technology, and colleagues evaluated how long it took different machine-learning methods to do three tasks: find one good catalyst in a data set, find all the good catalysts in that data set, and predict the performance of catalysts not in the training data set. “This benchmarking evaluates the impact instead of the predictive power of an algorithm,” Gregoire says. In other words, they wanted to determine not just how well a machine-learning model can predict properties but how well it can address the larger goal of accelerating materials discovery. The researchers used a data set of catalysts whose properties they had already determined experimentally.
They found that the most advanced machine-learning models they tested, called sequential learning models, can discover all the high-performing catalysts in the data set about 20 times as fast as random sampling, a less sophisticated type of machine-learning model (Chem. Sci. 2020, DOI: 10.1039/C9SC05999G). Sequential learning means the computer chooses which experiments to do next to improve a model’s predictive performance. But the group also found that if the model wasn’t set up optimally, sequential learning could take 1,000 times as long as random sampling.
Gregoire says the results are a good lesson about understanding what tasks different machine-learning models are good for in materials discovery. “We need to be very careful about how we do this, because the floor is way deeper than the ceiling is high,” he says.
These researchers are developing benchmarking methods as standalone projects. Heather J. Kulik of the Massachusetts Institute of Technology and her team have been implementing benchmarking as a natural part of their materials discovery process with machine learning. “I have to be able to defend that we learned something in a way we couldn’t” without machine learning, Kulik says. She says her group typically publishes the results of its benchmarks in papers. Benchmarks also help new group members understand the strengths of different models, she says.
Kulik thinks new benchmarking tools like the ones Fung, Dunn, Gregoire, and their colleagues have developed are great if they can get people to use them. But she points out that there’s a human element that may override even the most rigorous and empirical benchmark. “Even if we know what should be the most accurate thing, we don’t always choose it, maybe because of our own biases,” she says. People might choose the machine-learning model that’s most cited or the one their graduate advisers used, she says, even against the evidence.