Sequencing the human genome was only the beginning. Once biologists finished the project, they needed to figure out what the blueprint was encoding. But there was a lot to sift through. To make that job more tractable, researchers arbitrarily set a lower length boundary for calling something a protein-coding gene. They are now realizing that this choice led them to miss the shortest of proteins that play roles in biology.
Since these microproteins were first identified in the early 1990s, researchers have found thousands of them; thousands more may still be unknown. As scientists continue to search, they are turning their attention to an even harder task—figuring out the functions of these proteins. Once they determine what those proteins do, the next step is harnessing them as drugs or to control biological functions.
Most protein-coding regions are defined by a start and a stop sign, each encoded by a trio of base pairs called a codon. Start and stop codons could occur randomly in the DNA sequence, but the longer the stretch of DNA between them—what’s known as an open reading frame, or ORF—the greater the likelihood that the intervening sequence encodes a protein. An ORF would be considered protein coding if it was 100 codons long or more (50 for microorganisms).
But as often happens with arbitrary rules, scientists are finding many examples that break them. Biologists call these relatively short proteins encoded by much shorter ORFs microproteins. Microproteins are a subset of the so-called dark proteome, the portion of the proteome for which researchers haven’t been able to find experimental evidence.
“When you had a long open reading frame, you were pretty confident that that wouldn’t happen by chance. That was a way of identifying a trusted set of proteins, but it wasn’t a way of finding out everything that was translated,” says Jonathan Weissman, a molecular biologist at the Whitehead Institute at the Massachusetts Institute of Technology.
Researchers soon realized that they needed new techniques to understand which ORFs were actually being translated into proteins in the cell. To do that, Weissman and his team developed a ribosome profiling platform called Ribo-seq (Science 2009, DOI: 10.1126/science.1168978).
In Ribo-seq, researchers chemically halt the process of protein translation, in which the ribosome uses messenger RNA (mRNA) as a template for synthesizing proteins. By adding a nucleic acid-cleaving enzyme, the researchers can digest all mRNA in their sample except for the mRNA lengths being translated by the ribosome. Each ribosome protects a section of mRNA about 30 nucleotides long. Sequencing the protected fragments identifies the precise position in the RNA being translated. The method is an empirical way of measuring exactly which genes are being translated, according to Weissman.
Ribo-seq revealed that much more of the genome was being translated than biologists had realized. “It is still an open question how many of those translation events really lead to functional proteins in the cell,” says Sarah Slavoff, a chemical biologist at Yale University who is developing tools to detect and study microproteins and other noncanonical proteins.
These microproteins are similar in size or even smaller than previously known peptides, such as hormones or neuropeptides, but are a distinct group, says Alan Saghatelian, who studies microproteins at the Salk Institute for Biological Studies.
“The best argument for calling them microproteins is that they’re made ribosomally as small proteins” and are released into the cell as is, Saghatelian says. “The things that we consider bioactive peptides undergo a large amount of posttranslational processing, including proteolysis, to get to their mature form.”
Before techniques like Ribo-seq allowed researchers to identify them, these microproteins had evaded detection for a variety of reasons. For example, researchers using mass spectrometry (MS), the main approach for proteomics, had failed to notice them. Thus, in a vicious cycle, the microproteins weren’t in the databases that researchers use to annotate mass spectra. And microproteins’ small size means that they may lack the cleavage sites for the enzymes, such as trypsin, used to cut proteins into peptides before MS analysis. They may also sometimes have peptides that match those from other proteins. In such cases, the assumption would be that they came from the known protein, making them impossible to uniquely identify. In addition, they tend to be present at low abundance, so their signal can be swamped by more-abundant proteins.
As technologies improved, researchers could detect microproteins directly by MS. “You could see which ORFs were actually being translated, which ones were making stable long-lived proteins,” Saghatelian says. “In some cases you might only be interested in translation, and Ribo-Seq is great [for that]. But if you want to figure out functions for microproteins, it really helps to have evidence that the protein is detectable by proteomics.” Both methods suggest that there are a lot of microproteins, however. Thousands of them.
The last decade and a half has involved a large amount of identifying and cataloging new microproteins, Slavoff says. That has led to high-quality evidence for about 7,000 of them in humans, and there may be thousands more.
The catalog is far from complete, because many microproteins are specific to certain cell types and conditions. But there is growing consensus on the importance of the ones that have been identified. The focus is now shifting to identifying the function of these proteins and their biological role. “We know almost nothing about what most of them do,” Slavoff says. “You can hypothesize that there are many other functions that we just haven’t screened for yet.”
At the University of Texas Southwestern Medical Center, Eric Olson and coworkers study the development and diseases of muscles, including the heart. While working with what they thought were noncoding RNAs, they discovered that the RNAs encoded microproteins.
Olson found that many of the microproteins in muscle tissue are located in the mitochondria—organelles in cells that provide energy in the form of adenosine triphosphate (ATP). “That realization blew me away, because mitochondria have been studied for decades,” Olson says. “I would have naively thought that we knew most of the components of the mitochondria. But lo and behold, there’s this whole collection of undescribed micropeptides that reside in different compartments of the mitochondria.” For example, the researchers identified microproteins that activate a specific calcium pump that controls muscle contractility (Sci. Signal. 2016, DOI: 10.1126/scisignal.aaj1460)
Meanwhile, Saghatelian and coworkers have used Ribo-seq to identify microproteins produced by fat cells (Cell Metab. 2023, DOI: 10.1016/j.cmet.2022.12.004). Of the 3,877 potential microprotein-encoding ORFs that the team found, it was able to validate 85 microproteins with MS measurements; 33 of the microproteins were secreted by the fat cells. The researchers don’t yet know the functions of the microproteins they found, but they think some of them could be involved in regulating metabolism.
To uncover more microproteins, Slavoff and her coworkers recently used a chemoproteomic labeling technique, followed by MS, to identify 22 previously unannotated microproteins. Several were synthesized only under stress conditions, such as DNA damage (Nat. Chem. Biol. 2022, DOI: 10.1038/s41589-022-01003-9).
Slavoff thinks that one of the proteins the team detected, MINAS-60, regulates the synthesis of the large ribosomal subunit, called 60S. The gene that encodes MINAS-60 overlaps the one that encodes another RNA-binding protein called RBM10. She thinks MINAS-60 pauses assembly of the 60S to make sure that everything is correct before the subunit gets exported to the cytoplasm. “This is the first time that this regulatory step in ribosome biogenesis has been observed in the cell,” Slavoff says.
Many microproteins that have been found so far are translated from RNA near sections that code for other known proteins or peptides. Other microproteins are instead translated from RNA predicted to be long noncoding RNA.
The fact that some noncoding RNAs do in fact code for microproteins may have been overlooked previously because of the original criteria for coding versus noncoding sequences, Saghatelian says. The criteria used to distinguish between coding and noncoding RNA were very stringent. “Sometimes things just got missed,” he says.
Thomas Martínez, who studies microproteins at the University of California, Irvine, finds microproteins encoded by long noncoding RNAs more straightforward to study, especially ones that are recognized to be involved in cancer or other diseases. “We already know something genetically about them, what they’re regulating,” he says. “Now it’s just a matter of reassessing it through the lens of protein as opposed to RNA. That’s not always easy, though.”
Proving that a microprotein and not the RNA is at work can be a big challenge, Martínez says.
Saghatelian isn’t convinced that the protein always performs the function in a biological system. “It’s also not 100% clear in those cases that just because there’s a peptide there, the peptide is responsible for the biology. It could still be that the RNA has a function and the microprotein also has a function as well,” he says.
John Prensner, a pediatric oncologist at the University of Michigan who studies long noncoding RNAs and microproteins, thinks that many presumed noncoding RNAs are bifunctional molecules. In some contexts, the RNA molecule will have its own function, while in others it will be translated to form a microprotein, he says. “The hard work for our community is to distinguish and dissect those possibilities scientifically,” he says.
So far, researchers have found microproteins that serve as regulatory elements or that modulate other proteins. Prensner thinks that there will also be many with their own functions, but they will be harder to predict and study. “Many of them will be highly context specific,” he says.
Gisela Storz and her team at the National Institutes of Health have studied bacterial regulatory RNAs for many years. “We realized that a subset of these encode small proteins,” she says. The group has been able to figure out functions for some of these microproteins. For example, it has identified a microprotein that affects the specificity of an antibiotic transporter, a protein that bacteria have evolved to pump antibiotics out of their cells and develop drug resistance. Cryo-electron microscopy showed how the microprotein interacts with the transporter to change its structure (Structure 2020, DOI: 10.1016/j.str.2020.03.013).
Ami Bhatt, a microbial genomicist at Stanford University, got interested in microbial microproteins when it became clear that many of the DNA sequences generated by shotgun sequencing couldn’t be classified. “I became obsessed with the idea that our reference databases are incomplete for microbial genomes,” Bhatt says.
She and her coworkers took a computational approach to discover about 4,500 families of small ORFs in a mixture of bacteria (Cell 2019, DOI: 10.1016/j.cell.2019.07.016). They proceeded to experimentally demonstrate that about 1,000 of those ORF families result in proteins. “We are aggressively trying to figure out what these proteins do,” Bhatt says.
Her goal is to figure out how these microbial microproteins might affect human health. For example, Bhatt identified a microprotein that’s involved with a type of bacterial communication known as quorum sensing. “If we could figure out how [bacteria] signal, we can co-opt or interrupt those signals for benefit in human medicine,” she says.
Two companies have already been launched to discover medicines based on microproteins. In some cases, the microprotein itself would be used as a drug. In others, the microprotein would be a target for another molecule to act on.
Velia Therapeutics, which was founded by leaders in the microproteins field, including Olson, Saghatelian, and Weissman, is looking for new ways to treat cancer and autoimmune diseases. The company’s first task is cataloging the entire human microproteome, according to its chief scientific officer, Shelly Meeusen.
In addition to identifying the full set of microproteins—or at least the ones 10 amino acids or longer—Velia is trying to determine which ones have biological activity and could be leveraged as, or tweaked to be, drugs. “The idea is to see which of these are really going to move the needle for human health, which ones are going to have a big impact,” Meeusen says. “There may be others that have meaningful biology but will probably not be good therapeutic targets because the biology around the proteins themselves is subtle.”
Another company, ProFound Therapeutics, was built around a question: What if more of the genome is translated into protein than we realize? The firm is not focused on microproteins but is finding them as part of its protein mining. To discover new proteins, ProFound uses a combination of detection approaches, including methods that identify proteins as they’re being made. “It’s akin to going to the factory and understanding what’s at the factory to get an inventory of everything being made as opposed to looking for products in the wild,” says Avak Kahvejian, a partner at Flagship Pioneering and a cofounder of ProFound. “We’ve already started to build a platform that’s bearing fruit. It’s illuminating the genome in a different way, allowing us to find tens of thousands of novel proteins.” The proteins ProFound has found include cancer targets, immune modulators, and circulating factors.
Many questions remain about which microproteins are going to be most interesting in various biological settings, Saghatelian says. But the only way to start answering them is through experiment. He and his coworkers are using CRISPR screens to figure out which microproteins are functional. “The next step will be, do any of these translate? Are they going to be of any value clinically? That part still remains to be seen.”
Celia Henry Arnaud is a freelance writer based in College Park, Maryland.