This week, scientists at Meta, the firm behind Facebook and Instagram, released the structures of more than 600 million putative proteins in a database called the ESM Metagenomic Atlas. The structures are for proteins predicted to exist based on genetic data from large-scale metagenomic screens of soil, seawater, and other sources. The proteins themselves have yet to be isolated or identified using proteomic methods.
The team describes the method used to perform this feat in a preprint (bioRxiv 2022, DOI: 10.1101/2022.07.20.500902), which has yet to undergo peer review.
In July, the Alphabet-owned company DeepMind announced that it had filled a database with predicted structures for almost all known proteins. That database holds around 200 million models made using AlphaFold, DeepMind’s algorithm for predicting protein structures. The Meta AI algorithm used to make the new protein models (ESMFold) is not as accurate as AlphaFold, but it is quicker, researchers say. The speed is a result of how the tool predicts protein structures using a language model trained on sequence data—the order of amino acids in the linear chain that make up a protein. The increased speed meant that the researchers could predict the 600 million structures in just 2 weeks, using a cluster of approximately 2,000 graphics processing units.
The Meta AI researchers have also published the code that they used to create the new database. They intend for other scientists to use the tool for their own research.
Pernilla Wittung-Stafshede, a protein folding expert at Chalmers University of Technology, says the new database “gives a really broad view of [the] protein universe on Earth.” But she cautions that structure prediction algorithms are just the beginning, with more work needed to tease out each protein’s function, which she says is the next challenge.