Twenty years after the first drafts of a human genome were published (Science 2001, DOI: 10.1126/science.1058040; Nature 2001, DOI: 10.1038/35057062), researchers have filled in missing sections, assembling a full genome sequence of 3.055 billion base pairs. A consortium of labs known as the Telomere-to-Telomere Consortium, or T2T Consortium, published a preprint, ahead of peer review, describing the work (bioRxiv 2021, DOI: 10.1101/2021.05.26.445798).
Adam Phillippy at the National Institutes of Health and Karen Miga at the University of California, Santa Cruz, cochair the T2T consortium, which has members across the globe. Until now, they say, around 8% of the human genome has been missing from reference genomes. The consortium has been working to fill in the gaps, largely highly repeating sections that don’t encode proteins but rather regulate biochemical processes via other mechanisms including transcription into regulatory RNA sequences or their physical conformation.
One reason that human genomes have remained incomplete has been because of technological constraints. Conventional DNA sequencers sequence relatively short fragments of a few hundred base pairs at a time and then use computer programs to put the fragments in order, constructing the full sequence. That means certain sequences, particularly stretches of DNA with repeating patterns, are challenging to read.
The consortium used two competing new sequencing technologies that avoid this problem by allowing researchers to sequence much longer stretches of the genome with high accuracy.
“The entirety of genomics as a field is a constant cycle between pushing the technological envelope and using these technologies in new and exciting ways,” says Paul Flicek, head of genes, genomes, and variation services at EMBL-EBI, who is not part of this work.
One of the areas that the new work has filled in is the centromeres. These points where arms of chromosomes intersect, making the chromosomes’ characteristic X or Y shape, have been hard to sequence because of their repetitive sequences. Sarah McClelland at Cancer Research UK’s Barts Centre, says these areas are essential for correct cell division. Not having data about the sequences in these features was “a major roadblock for further research” into how centromeres are involved in cell division, including when cell division goes wrong in cancer, she says, “so this new, gap-free reference has suddenly opened up this field.” By making the data available on public repositories and preprint servers, McClelland adds, the consortium “has allowed us, and many others, to start making use of the new reference immediately.”
In this work the cells sequenced are a special type of cell line that is homozygous, meaning that the 46 chromosomes are 23 identical pairs, which makes assembling the genome easier. This means that a Y chromosome was not included in the latest work.