If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.


Synthetic Biology

Machine learning correctly picks engineered-DNA lab of origin nearly half the time

Random selection would guess correct lab 0.12% of the time

by Celia Arnaud
August 14, 2018 | A version of this story appeared in Volume 96, Issue 33

Synthetic biology offers the opportunity to program cells to make new products. As the field advances, the possibility increases of someone engineering organisms for nefarious purposes. In such circumstances, authorities could need a way to figure out where a given piece of engineered DNA comes from.


Frequency with which the algorithm correctly identifies the lab of origin for engineered DNA

Source: Nat. Commun. 2018, DOI: 10.1038/s41467-018-05378-z.

The strings of DNA designed in one lab are difficult to distinguish from those designed in another. But Christopher A. Voigt and Alec A. K. Nielsen of Massachusetts Institute of Technology show that it can be done.

Voigt and Nielsen used off-the-shelf machine-learning tools to figure out DNA signatures that distinguish the sequences from various labs (Nat. Commun. 2018, DOI: 10.1038/s41467-018-05378-z). They trained the algorithm on DNA sequences from Addgene, a nonprofit plasmid repository. After excluding labs with too few plasmid samples to pick out a distinct fingerprint, the researchers ended up with a data set of 36,764 plasmids from 827 labs. They trained the algorithm with 31,802 of the sequences and reserved the rest for testing the algorithm.

“The neural network tries to figure out the shortest piece of DNA that will separate a particular lab,” Voigt says. “The differences might be design choices or mutations.”

The algorithm correctly identified the lab of origin in the test sets 48% of the time. And 70% of the time, the correct lab is within the top 10 labs on the list. Random selection of the correct lab would lead to the right answer only 0.12% of the time.

The main limitation is the size of the training set, Voigt says. He and Nielsen used the Addgene data set because the plasmid origins are well documented. The accuracy of the predictions will improve with larger training sets, Voigt says.

“The ability to determine the origin of a biological design has important security and engineering consequences,” says Douglas M. Densmore, a synthetic biology expert at Boston University. “This work is still in its infancy, but it can be built upon and likely will be increasingly important in the future.” The work will be applicable to the bioweapons concerns raised in a report on synthetic biology released by the National Academies of Sciences, Engineering & Medicine in June, Densmore says.


This article has been sent to the following recipient:

Chemistry matters. Join us to get the news you need.