If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.




How genomic epidemiology is tracking the spread of COVID-19 locally and globally

The novel coronavirus is challenging genome sequencing technology and data processing like never before

by Claire Jarvis, special to C&EN
April 23, 2020 | A version of this story appeared in Volume 98, Issue 17


In early April, a paper published in the journal Proceedings of the National Academy of Sciences of the United States of America (PNAS) raised the eyebrows of a number of genomic epidemiologists. It suggested that there were three distinct variants of the novel coronavirus, SARS-CoV-2, spreading in different regions of the world—one predominantly in Australia and North America, one in China, and one in Europe. It also suggested that the variant circulating in North America and Australia was older than the one that appeared in Wuhan, China, at the end of December (Proc. Natl. Acad. Sci. USA 2020, DOI: 10.1073/pnas.2004999117).

If this were true, the outbreak of COVID-19, the disease caused by the virus, would have started outside Wuhan, which contradicts scientific consensus.

This image shows two tweets from the University of Edinburgh’s Andrew Rambaut explaining why a newly published paper is problematic. The first message says: “There are many things that are terribly wrong about this paper. Both in its content, findings and route to publication.” The second tweet explains that the researchers erroneously used a root sequence from a bat in their analysis.
Credit: Twitter
Molecular evolution expert and University of Edinburgh professor Andrew Rambaut took to Twitter on April 9 to point out problems with a recently published PNAS paper.

Some genomic epidemiologists, who track the genomes of pathogens to understand how diseases spread, thought that although the data the PNAS paper used were sound, the paper’s analytical approach led to a false conclusion. PNAS says it has received letters about the paper and is currently reviewing them.

The incident illustrates the challenges that the COVID-19 pandemic has been presenting to these scientists: technology for sequencing, tracking, and mapping genes has never been stronger nor the results shared so publicly, but the field of genomic epidemiology has also never been deployed on such a large scale before nor against such a fast-moving foe.

Many countries have responded to the pandemic by investing in genome sequencing. In March, the UK allocated £20 million ($25 million) to launch the COVID-19 Genomics UK Consortium, an initiative to increase the country’s capacity to sequence SARS-CoV-2 genomes by creating new high-throughput regional centers. Other countries are using existing capacities: in Iceland, for example, the biopharmaceutical firm deCODE Genetics has sequenced hundreds of SARS-CoV-2 genomes from local patients.

Support nonprofit science journalism
C&EN has made this story and all of its coverage of the coronavirus epidemic freely available during the outbreak to keep the public informed. To support us:
Donate Join Subscribe

Several methods exist for scientists to sequence an RNA virus like SARS-CoV-2. One common technique begins with collecting viral RNA and breaking it into small fragments. Scientists then transcribe the sequences into DNA and amplify them with polymerase chain reaction. Fluorescent markers in the reaction mixture help reveal the identity of the sequences being generated. Once scientists know the sequences of the fragments, they can computationally stitch the whole genome back together with the help of a reference sequence from a closely related virus.

Some of this work is conducted at large, specialized facilities, but because of advances in technology, genome sequencing can also be carried out in smaller labs. Hospitals and universities around the world are collecting samples from people with COVID-19. Some in-house sequencing devices are the size of a smartphone or suitcase, and many can report a genome sequence within 7–8 h.

Francois Balloux, a computational biologist at University College London who studied the H1N1 influenza virus during the 2009 outbreak, says, “Technology has improved drastically, so now we have sequencing methods that are so much faster and reliable than what we had 10 years ago.” He remembers working nearly a decade ago with only 11 partial H1N1 sequences rather than a whole genome and trying to draw conclusions about the threat of the virus.

Although the technology has improved, the idea behind analyzing viral genomes to follow their spread remains the same. Like most viruses, SARS-CoV-2 accumulates mutations in its genome over time as a result of replication errors. Essentially, the virus’s protein machinery makes mistakes, copying the wrong nucleotide—A, U, C, or G—into its genome every once in a while during replication. Genomic epidemiologists count the number of mutations in samples and use the information to track how a particular virus travels around the globe.

SARS-CoV-2 isn’t as sloppy a replicator as seasonal influenza, which mutates much more rapidly, but its genome isn’t static either. Scientists observe that new mutations capable of spreading from person to person appear at a rate of 2 per month. So far, “there are maybe 10 lineage-defining changes we’re seeing as it spreads around the world, which is not many,” says Claire Marie Filone, a virologist at Johns Hopkins University Applied Physics Laboratory.

Phylogenetic trees 101
This explainer box shows a set of five genome sequences at the right, labeled A through E. Each sequences as a few colored dots on it indicating mutations that have accumulated over time. At the left is shown a branched phylogenetic tree depicting that sequence data and showing how the genomes are related to one another.
Credit: Adapted from Nextstrain
Genomic epidemiologists use phylogenetic trees and other tools to track the spread of pathogens like coronaviruses. They collect genome sequences (A-E, right) from various samples and characterize mutations (colored circles) the virus has accumulated over time. Then, they create a tree (left) based on their data. The left-most branch in the tree shown here represents a mutation (blue) common to all sequences and, thus, an older strain of a virus. Moving to the right, branches are added to depict new mutations (and strains). A vertical line like the one at the upper right indicates that the sequences along that branch (A and B in this example) are identical. Arrows indicate where the sequences fit in the tree.

To visualize those mutational changes over time—the evolution of the virus—many of these scientists use a construct called a phylogenetic tree. Trees start with the earliest-known version, or common ancestor, of a virus, and every branch in the tree equals one mutation away from that ancestor—a so-called new strain. At the start of the COVID-19 outbreak, scientists didn’t sequence many genomes, so they had to guess at a common ancestor.

According to Andrew Rambaut, an evolutionary biologist at the University of Edinburgh, one problem with the April PNAS paper is the researchers’ choice of common ancestor. He took to Twitter on April 9 to describe the problems. The paper’s authors, Rambaut wrote, tried to root their SARS-CoV-2 tree using the coronavirus RaTG13, found in bats. Despite being the most similar nonhuman coronavirus on record, RaTG13 is still quite different from SARS-CoV-2—about 1,100 nucleotides differ between them. “So basically the bat is so far away from the SARS-CoV-2 viruses its branch could fit in almost anywhere,” Rambaut tweeted. Instead, many genomic epidemiologists prefer to make their COVID-19 phylogenetic trees starting with the oldest SARS-CoV-2 genome, posted online in early January and published later in Nature (2020, DOI: 10.1038/s41586-020-2008-3). In an email to C&EN, Peter Forster, a researcher at the University of Cambridge and lead author of the PNAS paper, disagreed that the early SARS-CoV-2 sequences were an appropriate root and defended his paper’s conclusions, stating that “in science, competing analyses are vital in order to confirm or reject scientific conclusions.”

It’s the first time genomic sequencing is being used on such a large scale
Pavel Skums, computational biologist, Georgia State University

During this pandemic, scientists are sharing and disseminating genetic information in databases originally created to track influenza. The World Health Organization launched the Global Initiative on Sharing All Influenza Data (GISAID) in 2008 in response to the H5N1 bird flu outbreak that occurred 3 years earlier. To date, researchers have uploaded over 11,000 SARS-CoV-2 RNA genomes to the database.

The GISAID data are open access, so anyone who registers can extract and analyze the information. A team of virus evolution scientists runs a project called Nextstrain that uses these data to produce phylogenetic trees and maps showing the spread of SARS-CoV-2. Nextstrain is managed by Trevor Bedford at the Fred Hutchinson Cancer Research Center and Richard Neher at the University of Basel. The programs they use to analyze and visualize their data are open source and available for anyone to use.

The volume of data being deposited into GISAID creates its own challenges for scientists. Pavel Skums, a computational biologist at Georgia State University, notes that the algorithms his team used to analyze SARS-CoV-2 genomes in March don’t work today because of how large the GISAID data set has become. “In March, there were around 500 sequences in the database,” Skums says. Now, there are thousands.

Computational scientists, such as Niema Moshiri at the University of California San Diego, are trying to develop more efficient ways to process the deluge of data so that Nextstrain can keep updating its maps and trees in real time. “If you double the number of sequences, it will typically quadruple the computation time,” Moshiri says.

Public health officials have used Nextstrain’s data to make decisions about their communities’ safety. For instance, in late February, Nextstrain compared the first and second SARS-CoV-2 genomes collected and sequenced from patients in Washington State. The samples were collected 6 weeks apart. Because both genomes contained a mutation rarely observed in genomes from China, Nextstrain scientists concluded that local transmission of COVID-19 was occurring in the Pacific Northwest. Within days, the governor of Washington State and the mayor of Seattle declared a state of emergency, citing localized person-to-person spread as evidence of heightened public health risk.

This pair of global maps shows the spread of the novel coronavirus during the date range Dec. 3, 2019, to Feb. 3, 2020, and during the date range Feb. 21, 2020, to April 21, 2020.
Credit: Nextstrain, with rendering in Leaflet, Mapbox, OpenStreetMap
The Nextstrain project is tracking SARS-CoV-2, the virus that causes COVID-19, as it travels across the globe. The project uses genome sequences of the virus derived from patient samples to generate its maps. The circles indicate the size of the outbreak over a given date range in a particular country or region, and the colored lines indicate where the virus traveled from.

Genomic epidemiologists draw conclusions from viral RNA mutations as follows: if one virus sampled in Italy has three specific mutations on its genome and another sampled in the US has the same three mutations plus one additional lineage-defining mutation, the scientists can infer the coronavirus was in this instance transmitted from Italy to the US. This type of analysis comes with caveats, though. Unless the origins of all the mutations are accounted for, conclusions become difficult. For instance, if scientists have two genome sequences, one from Person A and one from Person B, and those samples have six mutations differing between them, it could mean the virus made six replication errors and transmitted directly between these two people. But it could also mean that there were a few mutations between Person A and a third person (Person C), and a few more mutations between Person C and Person B, Moshiri says. “Missing samples potentially change the narrative,” he adds.

Scientists have built a detailed picture of how SARS-CoV-2 traveled the globe on the basis of the data they have. They’ve determined that most countries experienced simultaneous introductions of the virus from different countries. For example, they’ve found that the New York City outbreak was seeded by COVID-19 cases from Europe, while the SARS-CoV-2 genomes sequenced in Washington State point to a Wuhan origin.

Using the rate at which the virus is mutating, the number of mutations seen in local SARS-CoV-2 strains, and sequences from the country of origin, genomic epidemiologists can estimate how long the virus has been circulating in a community. In the case of the New York City outbreak, they’d study the strains circulating there and compare them with the sequences from Europe. Once epidemiologists identify the probable date of introduction, University College London’s Balloux explains, they can estimate how many people in the area are currently infected with the coronavirus, based on its rate of infection.

Technology has improved drastically, so now we have sequencing methods that are so much faster and reliable than what we had 10 years ago.
Francois Balloux, computational biologist, University College London

The science community is focused on collecting as many genomic data as it can. Skums argues that the next step might be to find links between the genomic sequences of SARS-CoV-2 and its viral properties. This could help scientists understand whether a particular mutation leads to a more severe infection or set of symptoms.

The genomic data hint that we have a lot to learn about how SARS-CoV-2 behaves and spreads, and epidemiologists may find ways to use this extraordinary quantity of information in ways they hadn’t previously considered. “It’s the first time genomic sequencing is being used on such a large scale,” Skums says. “It means the technology is being tested.”

Claire L. Jarvis is a freelance science and medical writer based in Atlanta.


This story was changed on April 24, 2020, to place virologist Claire Marie Filone at the Johns Hopkins University Applied Physics Laboratory rather than just at Johns Hopkins University.


This story was originally published on April 23, 2020, and was revised on April 29, 2020, to include a comment from Peter Forster, for clarity, and to correct Trevor Bedford's affiliation; he is at the Fred Hutchinson Cancer Research Center, which isn't part of the University of Washington.


This article has been sent to the following recipient:

Chemistry matters. Join us to get the news you need.