Credit: Ian C. Haydon/UW Institute for Protein Design | Detail of a protein designed using a tool called ProteinMPNN
Protein designers are uniquely positioned to capitalize on the ongoing artificial intelligence revolution. Researchers are rushing to apply technological advantages to their work, building new biology with each new development. There has been an explosion in the number of groups working in the field and in tools to help them do that. The models these researchers make are moving from simply designing new shapes to creating functional molecules for an ever-expanding set of applications. The inflection point, they say, is now.
In the past year, phrases like learned language models, diffusion, and hallucination have gained new meanings in popular culture as artificial intelligence has started taking over mundane tasks. Today, users can log on to an AI-powered chatbot and ask it to draft texts based on simple prompts. They can then use text-to-image services to create illustrations and new images to accompany the dreamed-up words.
But beyond these consumer applications, algorithmic approaches are helping researchers create a whole world of new proteins—proteins that could become vaccines, biologic therapies, materials, or tools for bioremediation.
A few years ago, C&EN chatted with David Baker of the University of Washington about a host of topics, including de novo protein design, which is designing new proteins from scratch rather than adjusting existing ones. Back then, he said he tried not to look too far into the future. Too much could change; too much was uncertain. That has never been truer.
De novo protein design has reached an inflection point, researchers say. AI-powered protein design is becoming very real and very usable, thanks to technological advances in the development of algorithms and the hardware that runs them.
Protein science itself was uniquely positioned to take advantage of these advances because of the enormous amounts of work carried out over the past 50 years to curate and annotate biological data.
“Every time there is a new method in computer vision or natural language processing, we are in a race to try to transfer it to biology,” says protein designer Noelia Ferruz at the Institute of Molecular Biology of Barcelona. “I guess it’s the perfect moment, because we’re seeing an AI revolution in every field.”
Proteins are hugely variable and incredibly specialized. They can form large, complicated complexes that mediate biological functions, or they can exist as small peptides that merely send signals from place to place. Proteins move and interact. They bind to and modify one another. They form part of the complex molecular dance we call life.
But proteins are also just molecules. They are composed of amino acid building blocks stuck together using amide linkages to create polymer chains. These polymers can curl up to build various shapes, chemical environments, forms, and functions, depending on where they sit and the interactions between the side chains in the molecules. For example, a hydrophobic portion might be found curled up inside the protein it’s part of or buried in a fatty membrane.
As researchers started to understand more and more about these processes, they wanted to do more. They dreamed of having more influence over the shape and function of a protein. Ultimately, rather than modify existing proteins, protein designers wanted to build them de novo, from the ground up.
But for years, that required a detailed understanding of the chemical and physical aspects at play. That changed in the past year, with an explosion in AI and machine learning–powered models that can be trained on existing data and then asked to spit out new solutions without having to understand the chemistry and the physics.
The pace of change has accelerated to the point where many researchers privately admit that they struggle to keep up with the flood of papers and tools in the field, many first appearing as preprints—before peer review. The approaches often use one of two types of modeling: large language models and diffusion.
The two algorithmic approaches aren’t unique to protein design. They were first developed for very different applications, Ferruz says. But proteins are well suited to these algorithms. Their primary sequences can be described by language models, and their 3D structures work with computer vision training and image generation.
Ferruz remembers when she realized she could use generative AI algorithms, which create new content from training data, for her protein design work. It was 2019. Then a postdoc at the University of Bayreuth, in Germany, Ferruz was learning German in her spare time when the tech company OpenAI launched its Generative Pre-trained Transformer 2 (GPT-2) for text generation.
“I was finding it very hard to learn a new language,” she recalls. “And then there was this model that was speaking perfect English. I thought, ‘Wow, OK, this model is able to properly speak a language.’ So I started wondering if this model is close to starting to speak other things, like proteins.” That inspired her to create ProtGPT2, a modified version of GPT-2 trained to “read” and understand the amino acid sequence “language” of proteins and then generate new amino acid sequences (Nat. Commun. 2022, DOI: 10.1038/s41467-022-32007-7).
There are now several large language models for protein design. One example is from the protein design start-up Profluent, which came out of stealth mode with $9 million in seed funding in January. The firm is led by CEO Ali Madani, who has a background in using machine learning for natural language processing and has brought that expertise into the protein design arena.
A paper that accompanied Profluent’s launch describes how its large language model can generate functional protein sequences (Nat. Biotechnol. 2023, DOI: 10.1038/s41587-022-01618-2). Madani sees very real parallels between reading and writing English with an algorithm and reading and writing biology with one. “I had expected that this would just be an analogy,” he says. “But surprisingly, this analogy has held.”
However, language-based models aren’t the only hot area in AI/machine learning right now. Diffusion, typically associated with image generation, is a different generative technique that researchers are also using.
Baker’s team at the University of Washington is one of several that have developed diffusion-based methods to dream up new protein structures and then work from the structures to determine how to make them efficiently.
Within days in December 2022, two preprints—one from the Baker lab (bioRxiv 2022, DOI: 10.1101/2022.12.09.519842) and one from the biotech firm Generate Biomedicines (bioRxiv 2022, DOI: 10.1101/2022.12.01.518682)—were published describing new diffusion-based approaches. The Generate model, called Chroma, starts by modeling a protein’s shape and then asks another algorithm to design an amino acid sequence that might fold into that shape.
A diffusion approach is well suited to protein design because proteins have so many parts to consider, says John Ingraham, Generate’s head of machine learning. It is hard to make a model that can immediately come up with the final answer, he says. But diffusion processes continually add and subtract noise from the system.
“It’s not about how do I make the final thing; it’s about how do I make small changes that make it better?” he says. “I think that’s a really powerful paradigm.”
Designers will likely combine different approaches to build these new proteins. Ingraham points out that a strength of Generate’s platform is that it is modular and programmatic, so you can mix and match different models.
At Profluent, Madani says researchers aren’t just using sequence-based models but are also incorporating structural data and adapting the software to the problem. And that’s becoming the norm.
“I personally am quite method agnostic,” says Namrata Anand, a protein designer at Diffuse Bio, a specialist in generative protein design. “I think, to actually solve downstream problems, you’re going to need a whole bunch of complementary approaches.”
All these new tools and algorithms are opening up de novo protein design to new teams of researchers. That’s something that Chris Wells Wood at the University of Edinburgh welcomes.
Wells Wood sees being able to join different methods of protein design as the future. He says one “excellent example” is the work of researchers, including Sergey Ovchinnikov at Harvard University, to build the ColabFold and ColabDesign platforms of open-source software, which don’t require people to build or install expensive software or equipment themselves.
A recent beneficiary is physicist Hendrik Dietz of the Technical University of Munich. Dietz spent most of his career working on DNA origami to create self-assembling structures. But the physicist says he has dreamed of building proteins ever since he spent time pulling them apart while working on his PhD. In the past, design techniques to make proteins form scaffolds and structures weren’t as advanced as he wanted. That has now changed.
Rather than build algorithms from scratch, Dietz needed someone in the lab to find and adapt existing ones. Christopher Frank had no experience in de novo protein design when he started his PhD project with Dietz. But Frank explored all the tools he could use to create a protein design pipeline, and then he stitched some together.
Dietz and Frank collaborated with Ovchinnikov and used his ColabFold platform to build a new protocol that includes one algorithm to create a structure, another to design a sequence, and then a final folding algorithm to confirm that the designed sequence should create the desired proteins.
The new workflow was published at the end of February as a preprint (bioRxiv 2023, DOI: 10.1101/2023.02.24.529906). Before then, Dietz and his lab wouldn’t have come up in any search of academic groups working in the protein design space.
The tools that now exist are democratizing the field and enabling more and more teams to get involved, Frank says. For Dietz, that means he can now shift focus from making structures from DNA to making them from proteins.
Protein design is progressing incredibly rapidly. Yet it would be a mistake to think that the field is entirely plug and play. Fluent speakers can often tell if a chatbot isn’t generating grammatically correct sentences. And people can see when image-generating software doesn’t get the number of fingers on a hand quite right. Protein designs also need that kind of verification.
“The ultimate decider of whether a protein works is what happens when you put it in a biochemically realistic environment,” Generate’s Ingraham says. That means that whether in academia or industry, protein designers sitting at a computer must work closely with their wet-lab-based colleagues to make those proteins and validate the designs.
Such feedback from experimentalists is also vital for training algorithms and helping design the next generation of useful proteins. One application that companies are already betting on is in the arena of AI-designed antibodies and mini binders for therapeutic applications.
The biotech firm Absci, for example, has been busy designing antibodies (bioRxiv 2023, DOI: 10.1101/2023.01.08.523187). And Baker’s lab has designed binders that can glom on to viruses to try to stop infection.
Even in the past year, improvements in the algorithms available have massively sped up the design process, Baker says. The lab’s coronavirus binders, which he hopes will go into clinical trials later this year, took around 4–6 months to design (Sci. Transl. Med. 2022, DOI: 10.1126/scitranslmed.abn1252). Influenza binders, which were created more recently using the team’s new RF Diffusion program, took just weeks (bioRxiv 2022, DOI: 10.1101/2022.12.09.519842).
The team has also designed binders against multiple human receptors as well as intrinsically disordered proteins and peptides (Nature 2023, DOI: 10.1038/s41586-023-05909-9).
Looking to the future, researchers are also working on designing functions—for example, with new enzymes and dynamic processes. In Barcelona, Ferruz dreams of one day asking an algorithm to create a carbon dioxide–capturing enzyme. To have control over that design process, she says, “we’re going to focus on trying to design new-to-nature enzymes.” The idea is that the user would input a chemical reaction, and the model would output an enzyme to catalyze it, she explains.
But several researchers agree that they still don’t understand all the important details needed for catalysis. That means researchers still need older approaches—such as rational-based design, in which designers use understood details about proteins and their chemistry—as well as AI. Using the right approach for the right problem is vital for the field to develop. Ideas of where else the field might go in the future are highly speculative.
Of course, algorithms can’t do everything. And Wells Wood in Edinburgh says scientists don’t know how far these programs can venture from the data that trained them. That’s one challenge that several researchers are now focused on.
“Our process of understanding the limitations of our current methods—it’s constantly ongoing,” Anand says. “We’re still in the regime where physics-based models are better for certain types of problems.”
Going further, understanding the limits of AI-powered algorithms could also provide insight into the physical and chemical processes involved in protein production and formation.
Ovchinnikov points out that AI-designed proteins often differ from naturally occurring proteins. For example, AI-designed proteins are much more thermally stable. He wonders if the algorithms have identified the “cheat code” for stable protein folds. Understanding that point could help protein researchers figure out even more about how proteins fold in the ways they do, he says.
Thousands of proteins exist in nature. But the potential variety of de novo proteins is so much larger that it is hard to comprehend. The possibilities and the directions that AI-powered protein design might take are equally hard to fathom.
At a briefing for journalists in January, Kimberly Powell, vice president of health care at the AI computing firm Nvidia, said the pace of change is skyrocketing. Just as researchers now have a vast array of algorithms and architectures they can use to design proteins, she said it’s clear there is also considerable potential to use those designed proteins in useful ways. But she also acknowledged unanswered questions about where and how far AI-based protein design might go and what that will mean for the fields it touches.
“It’s always hard to predict the future,” Baker says. “I’ve always said that, and it’s only getting more true.”