If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.


Analytical Chemistry

Bypassing Protein Databases

Software takes mass spec data straight to the genome to identify novel genes and proteins

December 20, 2004 | A version of this story appeared in Volume 82, Issue 51

In genome fingerprint scanning, mass spectrometry data are used to find the region of the genome that is most likely to have encoded unknown proteins.
In genome fingerprint scanning, mass spectrometry data are used to find the region of the genome that is most likely to have encoded unknown proteins.

Most mass spectrometry-based proteomics studies rely on matching peptide masses from protein digests with those found in protein databases. That works great if the genome and the proteins being studied are already known, but what about genomes that aren't yet annotated?

A new bioinformatics tool called genome fingerprint scanning (GFS) extends proteomics to unannotated genomes. In GFS, protein mass spectral data from unknown proteins can be searched against genome sequences to find the region of the genome most likely to encode a particular combination of peptides. The tool was developed by Morgan C. Giddings, an assistant professor in the departments of microbiology and immunology and of biomedical engineering at the University of North Carolina, Chapel Hill.

"When I was a postdoctoral researcher at the University of Utah, we were trying to study proteins that were encoded by nonstandard mechanisms in cases where the ribosome did unusual things when translating the RNA into a protein, producing these weird proteins that you wouldn't expect just looking at the sequence of the gene," Giddings says. "I saw this problem that these unusual proteins were not represented in the databases of known genes and proteins."

Giddings decided to bypass the protein databases and go straight to the genome instead. In GFS, an entire sequenced genome is translated into its potential proteome without regard to any form of genetic annotation, like open reading frames. That proteome is "digested" according to the cleavage rules for different protein-digesting enzymes such as trypsin. Mass spectral data can be compared with these computationally generated fragments to find the region of the genome most likely to have encoded the proteins [Proc. Natl. Acad. Sci. USA, 100, 20 (2003)].

The Web-based version of Giddings' program is called GFSWeb, and it's available at [J. Proteome Res., 3, 1292 (2004)]. She recently received a grant of more than $1 million from the National Institutes of Health to make the program more widely available.

The software works well for simple genomes, even ones that have not yet been annotated, according to Giddings. For example, one of Giddings' collaborators was doing proteomic studies on the bacterium Francisella tularensis. The genome had been partially sequenced, Giddings says, but it was still in 38 different pieces, and there were few protein database entries. Using GFS, her collaborator was able to figure out where those proteins were encoded.

More complex genomes still present hurdles. Although an issue, the larger size of such genomes is not the major problem. Splicing, especially alternative splicing, is the major problem.

Many genes are stitched together from noncontiguous sequences known as exons. Sometimes the same gene contains multiple exons that can be joined in different combinations, known as alternative splicing. "If you look at any particular stretch of the genome, there are so many possible exons that could be used if you just look at the canonical splice sites," Giddings says.

The challenge is in determining which of those possible splice sites are real and including them in the database. She and her coworkers are developing a probabilistic model to predict the most likely arrangement of exons for a given protein, using the mass spec-analyzed proteins as a guide.

Giddings is working to improve the "scoring" of peptide mass fingerprints. "We're bringing some bioinformatics algorithms that have been used in other areas and applying them to this protein identification problem," she says. "We have great hopes that they will increase both the accuracy and precision [of protein identification]."

ONE WEAKNESS of the publicly available website is its inability to use tandem mass spectral data, which are generally accepted as providing more authoritative protein identifications. The program can use tandem mass spectrometric data, but patent restrictions prevent Giddings from making them publicly available. "We're still working on implementing other patent-free methods for taking advantage of the tandem data," she says.

Through the work of collaborators, GFS has been applied to a number of problems. For example, Chris Upton at the University of Victoria, British Columbia, is using GFS to study viruses, including pox viruses. "It's very easy to sequence the pox viruses, but to correctly annotate them is quite a challenge," Giddings says. Upton and coworkers are "using GFS to bypass that challenge. They just sequence a pox virus, and then they can immediately start doing proteomics and figure out where the genes are, just based on protein data."

Giddings hopes to combine GFS with another program she has developed for identifying posttranslational modifications, called PROCLAME [Anal. Chem., 76, 276 (2004)], for "top-down" proteomics, which uses intact proteins rather than protein digests.

"There's so much extra information about proteins that you can get if you start with the intact protein. It's information that's lost if you take a shotgun approach," she says. "Once all the proteins are digested into bits and pieces, the information about how they fit together and what went where is lost."



This article has been sent to the following recipient:

Chemistry matters. Join us to get the news you need.