Genetic data is more readily available than ever before—but interpreting it can be a challenge. Without months of work in the laboratory, it can be tough to tell what one of the millions of single–amino acid variants observed in humans does to protein function, let alone whether it might have a role in disease.
Researchers in the fast-moving field of protein machine learning are trying to speed up variant interpretation. This week, Deep Mind, the Google team that developed the popular AlphaFold2 protein structure prediction engine, published a new artificial intelligence model that predicts whether protein-altering gene variants will have a benign or harmful effect (Science 2023, DOI: 10.1126/science.adg7492).
The model, dubbed AlphaMissense, works by combining structural predictions from AlphaFold2 with a technique called protein language modeling, which involves training an algorithm on an enormous number of amino acid sequences and using it to make statistical inferences about other sequences. It outputs a predicted pathogenicity score between 0 and 1 for each possible amino acid substitution at each point in a protein.
The team ran the model on the whole human proteome, and posted a publicly accessible database of some 71 million single amino acid substitutions. Proteome-wide, the model predicted that harmful variants would be concentrated in structured regions and at residues like cysteine that play an outsized role in maintaining structure.
What the model does not do, the Deep Mind researchers emphasized at a press conference, is predict how mutations change that structure, protein stability, or interaction with binding partners—a task recognized as a major challenge in the field. “The model predicts pathogenicity in the abstract,” says senior author Žiga Avsec. “But it doesn’t tell us the biophysical nature of what this mutation does.”
David Taylor, a structural biologist at the University of Texas at Austin, called the work “a game changer” in combining structures with protein language modeling and making the results available to all. The model doesn’t say why mutations might be harmful, but he says that biochemists might use it to help identify regions important to protein function for further study.
According to geneticists working in variant prediction, however, this is just the latest in a fast-moving field. Researchers at Illumina and a number of universities published a similar algorithm called PrimateAI-3D in June (Science , 2023 DOI: 10.1126/science.abn8197). Like AlphaMissense, the Illumina model combines a protein language model with structure data. “I’m very surprised that AlphaMissense was published in Science after PrimateAI-3D,” says Vasilis Ntranos, senior author of yet a third recent algorithm, which classified variants based strictly on a protein language model (Nat. Genet. 2023, DOI: 10.1038/s41588-023-01465-0).
Experts also stress that it is difficult to assess these algorithms’ accuracy. According to Michael Sternberg, a bioinformaticcs expert at Imperial College London, benchmarking tests suggest that AlphaMissense and its competitors appear accurate enough for now to help researchers prioritize studies at the bench but not trustworthy enough to conclusively link a gene variant to a disease.
This story was updated on Sept. 28, 2023, to correct the name of a quoted Deep Mind researcher. His name is Žiga Avsec, not Ziga Avosec.