In the field of machine learning, researchers tend to think that the method known as deep learning makes its best predictions when models are trained on a lot of data, like hundreds of thousands or millions of data points. For example, these algorithms have made breakthroughs in image recognition after being trained on massive data sets. Now, new research argues that there are ways to make accurate deep learning predictions on molecular systems with data sets of only thousands of example compounds (Nat. Mach. Intell. 2021, DOI: 10.1038/s42256-021-00368-1).
The number of machine learning models researchers trained to test the limits of deep learning
The number of unique molecules used to train the deep learning models.
That’s a relief for study author Michael Skinnider, a medical student at the University of British Columbia who earned his PhD under coauthor Leonard J. Foster. The two wanted to use deep learning to predict the molecular structures of illicit designer drugs based on mass spectrometry data, but Skinnider says only about 1,700 designer drugs structures are known, well below the supposed limit of deep learning accuracy. Machine-learning algorithms develop the ability to make accurate predictions—about things like molecular structures or chemical properties—after exposure to relevant data, a process known as training.
So the researchers and their colleagues set out to determine just how many data are needed to properly train deep learning algorithms. They also wanted to find out if there were ways to modify the data, the algorithms, or the training procedures to improve accuracy when only limited data are available.
To put deep learning through its paces, the group built 8,500 models, then trained them on data sets of different sizes drawn from a set of 500,000 molecules represented as simplified molecular-input line-entry system (SMILES) strings. Then they tasked the models with predicting a gamut of molecule types, including plant, fungal, and bacterial metabolites. They found improvements in models’ predictive performance leveled off when the training set included about 10,000 to 20,000 SMILES strings. In general, they did find that predicting more complex chemical structures benefited from more training data.
The researchers explored ways to improve performance when fewer data—say thousands of SMILES string—are available. One simple technique was to train the models with data sets that included randomized SMILES strings, which consisted of notations that represent other molecules in the data set but with different sequences of letters and numbers. But changing the deep learning model, its operating parameters, or the training strategy did little to boost performance.
Machine-learning models like deep learning “are evolving very fast and assumptions about how much data is needed, which representation is best, or how to augment it become rapidly outdated,” says Rafael Gomez-Bombarelli, a computational materials scientist at the Massachusetts Institute for Technology who develops such models. He calls the new study “welcome and thorough.”
The British Columbia team's effort to better understand deep learning predictions bore fruit for their designer drug research. Skinnider says they subsequently built a deep learning tool to predict molecular structures of unknown drugs based on molecular weight (ChemRxiv 2021, DOI: 10.26434/chemrxiv.14644854.v1). Preprints such as this paper are not peer-reviewed before publication. After training on 1,750 SMILES strings of known drugs, the model deduced the right structure just over 50% of times, which Skinnider considers pretty good.