CORRECTION: This story was updated on May 24, 2012. The research team estimated solvation energies for 56,000 sequences coding for tRNA, not 5,600, as originally stated.
As scientists sequence more and more genomes, they need efficient tools to parse the mountains of data. For example, to uncover the genetic causes of disease, scientists often need to determine whether a DNA sequence codes for a protein. Using basic solvation properties of DNA, researchers have now developed a simple and accurate method for predicting whether a stretch of DNA codes for messenger RNA (mRNA), which translates into a protein, or transfer RNA (tRNA), which helps with protein synthesis (J. Am. Chem. Soc., DOI: 10.1021/ja3020956).
Genomes are a patchwork of biological information with sequences representing genes, regulatory elements, so-called junk DNA, and other exotic nucleic acid species. Scientists typically use existing sequence data to build predictive models that help them find the various genetic elements in a genome. But this bioinformatics approach can lead to mistakes, says B. Jayaram of the Indian Institute of Technology Delhi. “The error rates are pretty high,” he says, with most models producing false positives 50 to 60% of the time.
In a previous study, Jayaram successfully differentiated between coding and non-coding parts of genes based on chemical properties of the DNA sequence (PLoS One, DOI: 10.1371/journal.pone.0012433). As a next step, he wanted to develop a way to tell whether a sequence codes for mRNA or tRNA.
Jayaram and his colleague Garima Khandelwal used data produced by the Ascona B-DNA Consortium. This international group of scientists used a computer model to simulate the behavior of every possible four-nucleotide DNA sequence in a cell-like environment. The simulations estimated each sequence’s solvation energy, which describes its affinity for water; less favorable water-DNA interactions translate into higher solvation energy.
Using these data, the researchers estimated the solvation energies of more than 2 million DNA sequences coding for mRNA and about 56,000 sequences coding for tRNA. They found that tRNA sequences had greater solvation energies than mRNA ones: Over 99% of the genes with solvation energies greater than 1.2 kcal/mol coded for tRNA, while over 99% of genes below this threshold coded for mRNA.
Jayaram says the trend makes sense because tRNA forms a rigid three-dimensional structure with a core hidden from water, while mRNA is relatively flexible and, as a result, more exposed to water.
David Beveridge of Wesleyan University calls these results “quite interesting” and thinks Jayaram’s method will become another tool for analyzing genomes. Jayaram next wants to identify genes for other types of nucleic acids, such as microRNAs.