The protein community is awash in data. As of October, the UniProt database of protein sequences had more than 44 million entries. UniProt serves as a clearinghouse for sequences. Researchers who identify new sequences can add them and annotate them by including any functional information they’ve determined or inferred. Others can annotate sequences as well.
The functions of a small fraction of the millions of entries have been experimentally determined. But most proteins are annotated only on the basis of their similarity to other proteins or aren’t annotated at all. To make matters worse, many of those tentative annotations are wrong. And the problem only grows as more genomes are sequenced and more proteins of unknown function are identified.
What can be done to ferret out the protein functions buried in this tidal wave of data? That’s where the Enzyme Function Initiative (EFI) comes in. This consortium of approximately 80 researchers at nine universities in the U.S. and Canada is developing new tools for assigning functions to the enzyme component of the proteome. The initiative is funded by a Large-Scale Collaborative Project Award—a “glue grant”—from the National Institute of General Medical Sciences.
Attaining a better knowledge of enzyme function is an important scientific goal because it could facilitate the use of enzymes as catalysts for novel reactions, and serve as a starting point for protein design or redesign to develop or improve enzymes that carry out activities of interest, among other applications.
Now in its fourth year, EFI’s efforts are starting to pay off. Using tools it developed, EFI members have been able to computationally predict enzyme functions and verify those functions experimentally. They’ve even discovered previously unknown metabolic pathways in which some enzymes participate. They hope to scale up these tools, make them accessible to the broader community, and make it possible for many more proteins to eventually be annotated.
“If you’re going to make inroads into the large number of enzymes for which we don’t know function, you’ve got to develop bioinformatics tools that allow you to look, large scale, at whole protein families at a time,” says John A. Gerlt, EFI director and an enzymologist and biochemistry professor at the University of Illinois, Urbana-Champaign.
“Every time a new genome is sequenced, probably 30%, 40%, or upward of 50% of all protein-coding regions in that genome are either unannotated or, what’s worse, misannotated in terms of their function,” says Steven C. Almo, an EFI member who is a crystallographer and biochemist at Albert Einstein College of Medicine. Misannotations can occur, for example, when people transfer annotations from one protein to others that seem to be related on the basis of sequence but turn out not to have a correspondingly close functional relationship.
The function of the first protein is rarely the problem, because “that’s almost always based on direct biochemical evidence,” Almo says. “The question is if a protein is 60% identical to the initially characterized protein, does it have the same function?” EFI researchers hope their tools will help correct mistakes from incorrect assumptions that such proteins are related functionally as well.
The initiative is organized into scientific core projects and bridging projects. The five cross-cutting cores focus on developing tools to analyze proteins computationally and experimentally in a high-throughput fashion. The methods developed in the cores are tested and refined by using them in four bridging projects, each of which focuses on a different enzyme superfamily—enolases, glutathione transferases, haloacid dehalogenases, and isoprenoid synthases. A protein superfamily is the largest grouping of proteins for which common evolutionary ancestry can be inferred.
The initiative places a heavy emphasis on adapting computational methods that were developed for other applications. For example, EFI member Brian K. Shoichet, a professor of pharmaceutical chemistry at the University of Toronto and the University of California, San Francisco, and coworkers are identifying natural substrates of enzymes by using variations of docking programs they devised to identify potential drug molecules.
“One of the big innovations was not docking the ground-state structure, which is what you always do for ligand discovery, but docking the high-energy intermediates, the transition-state structures,” Shoichet says.
In addition, Shoichet’s team developed a covalent docking method to be able to work with haloacid dehalogenases. These enzymes form covalent intermediates, but the existing docking programs modeled only noncovalent interactions. For that enzyme superfamily, a docking program that sets the covalent bond in place was needed to help identify the enzymes’ substrates. “Brian has now made a program that not just docks but actually forms a bond” between the enzyme’s aspartate nucleophile group and its substrate, says Karen N. Allen, a chemistry professor at Boston University who is one of the leaders of that bridging project.
Building tools is just the beginning for EFI. Those tools are being applied to the bridging projects’ four enzyme superfamilies, each of which contains thousands of enzymes.
For example, earlier this year, EFI members used tools they had developed to predict functions for enzymes in the polyprenyl transferase subgroup of the isoprenoid synthase superfamily (Proc. Natl. Acad. Sci. USA 2013, DOI: 10.1073/pnas.1300632110). These enzymes are responsible for adding five-carbon isoprene units to growing chains
“Nature has discovered how to make the same chain length in a couple of different subtle ways using the same fold,” says C. Dale Poulter, an enzymologist and chemistry professor at the University of Utah. “It’s impossible to look at a sequence where the homology is not pretty high and know what chain length it’s making.”
Poulter has analyzed product chain lengths of enzymes in the isoprenoid synthase superfamily in collaboration with Matthew P. Jacobson, a computational chemist and EFI member at UC San Francisco. Jacobson developed homology models for the enzymes and then used those models to determine product chain length for 79 members of the enzyme family. Poulter’s team experimentally determined the product chain lengths for each of the enzymes.
“Comparisons of experiment and prediction were within one isoprene unit for more than 90% of the structures we looked at,” Poulter says. “That may not sound too good, but the problem is that nature doesn’t select to within an exact isoprene unit. A lot of these enzymes make chains that differ by one five-carbon isoprene unit equally well, so the predictions did turn out to be pretty good.”
In another recent study, Gerlt, Jacobson, and coworkers predicted and verified a completely new metabolic pathway (Nature 2013, DOI:10.1038/nature12576). They started with a protein for which the structure, but not the function, was known. The structure had been solved in 2007 as part of the Protein Structure Initiative, another large-scale project within the National Institutes of Health.
They took advantage of a property of bacterial genomes to determine the identities of two other enzymes that form a metabolic pathway with the first enzyme. “Metabolic pathways in bacteria are often encoded by operons or gene clusters,” Gerlt says. “There are very large gene clusters that encode the enzymes involved in the construction of very complicated natural products.” If those cooperating enzymes are close to one another in the genome, that “simplifies the problem because at least you believe you know what all the players are,” Gerlt adds.
Finding the novel pathway required pulling together several tools that consortium members have developed. Jacobson, Gerlt, and their colleagues used computational methods to dock thousands of potential substrates in the enzyme’s active site. The best-scoring substrates were amino acids, especially proline analogs and N-capped amino acid derivatives. Those results allowed the researchers to predict that the enzyme was an amino acid racemase/epimerase that uses substrates with N-substitution.
The researchers didn’t have structures for other enzymes in the group, but they were able to build homology models for them. They did similar docking studies with those models and were able to figure out the complete pathway.
But this required considerable chemical intuition. “There still is a chemical brain that has to be involved at this point,” Gerlt says. “If you’ve thought about pathways and metabolism enough, intuition allows you to get some idea of the steps.”
The problem with needing chemical intuition is that “cleverness is not scalable,” Jacobson says. “What we need to do next is turn that type of approach into an algorithm that can be applied in a more semiautomated way on a larger scale.”
The approach the team used to discover that pathway is the way of the future, Shoichet says. “The idea is that you would dock against a much larger number of target enzymes, hopefully guided by sequence information,” he says. “In bacteria, sequence can really focus you on proteins that might be in the pathway, but only some of which will be. The idea is that you dock against all of them and then look for similarities or chemical links among the high-scoring ligands.”
But pathways are still a “stretch objective,” Shoichet says. “It’s worked once. Let’s see if we can make it systematic.”
EFI is not only developing such tools for use by its own participants but is also starting to make them available to a broader audience. The consortium has already held workshops to introduce researchers to the tools. Some of those workshops have been targeted at researchers working with particular enzyme superfamilies. Others have cast a wider net.
Researchers don’t have to attend workshops to gain access, however. Some of the tools are now available on the Internet for anybody who wants to use them. In particular, Metabolite Docker, a tool developed by Shoichet’s team, is now accessible at metabolite.docking.org.
The members of EFI are proud of the work they’ve accomplished so far. “This is not something a single investigator could dream of doing,” Gerlt says. “We’re doing science now that I couldn’t have imagined we ever would have done.”