Filling in the gaps of the human genome

Filling in the gaps of the human genome

An international team of scientists called the Telomere-to-Telomere (T2T) Consortium says they have assembled the first truly complete human genome

You are probably scratching your head saying ‘but wait, wasn’t the human genome already sequenced?’ Well, yes and no. The Human Genome Project (HGP) sequenced the first near complete human genome in 2003, resolving all but about 15 percent of the genome that mostly contained repetitive regions that were inaccessible using the technology at the time.   

Extensive work since the completion of the HGP had brought the missing parts down to about eight percent. Now the T2T team claims to have sequenced a complete, gapless human genome, called T2T-CHM13 reference assembly. Keep reading to learn why there were gaps in the original sequence and how the T2T team were able to fill the gaps.

The Current Reference Genome 

The original sequence of the human genome was created with the most advanced sequencing technology available at the time. While very accurate, it struggled to decipher stretches containing repeating sequences. Large regions of repeat-rich DNA are widely scattered throughout our genome, leading to gaps in the initial findings. Over the years, advances in sequencing technology have filled many of the gaps, but about 100 regions were still incomplete. 

Rick Myers, PhD, President, Science Director and M. A. Loya Chair in Genomics at the HudsonAlpha Institute for Biotechnology, led the Stanford Human Genome Center team that was involved in sequencing the 320 million base pairs of human chromosomes 5, 16 and 19. When asked to reflect on the progress made since the 2003 genome draft, he had this to say:  

“The end-to-end human genome is a real tour de force. By combining new advanced technology with the information the genetics community has learned since the first human genome, the consortium was able to uncover the sequence of these difficult regions of the genome. While there is still much to learn about these newly sequenced repetitive regions, the application of the technology to the creation of more complete human genomes is invaluable.”

Overcoming technological limitations 

The T2T team claims that two changes in their sequencing protocols enabled them to create a gapless genome. For starters, they used DNA from a rare type of uterine growth called a hydatidiform mole. Unlike most human cells, which contain one set of chromosomes inherited from mom and another set from dad, the hydatidiform mole contains two identical copies of the paternally inherited set of chromosomes. That level of identity makes scientists’ sequencing computations much easier because they don’t have to tease apart the genetic differences between maternal and paternal copies. A small drawback to this approach is the lack of a Y chromosome to analyze. 

A new type of sequencing technology called long-read sequencing also helped the T2T Consortium build the gapless genome. Long-read technology sequences longer stretches of DNA, meaning scientists do not have to cut the DNA into as many pieces. 

For example, Pacific Biosciences SMRT platform uses lasers to scan long stretches of DNA isolated from cells. It can accurately read up to 20,000 base pairs at a time. The larger pieces are much easier to piece together because they are more likely to contain sequences that overlap. In addition, if a stretch of DNA contains repetitive sequences that are found in many other places in the genome, they can be assigned to the correct regions because there are unique segments on the same long-read section. 

To learn more about the different types of long-read DNA sequencing, HudsonAlpha’s Everyday DNA blog describes them in detail here.

What can we learn from the gapless genome?

The ease of assembly, due to both the less complex DNA and the long-read sequences, allowed the T2T team to construct the gapless genome. The T2T-CHM13 reference assembly represents the single largest addition to the human genome in the past 20 years—adding more than 3,000 predicted new genes, including about 150 new protein-coding genes. 

It also presents five entirely new chromosome arms for chromosomes 13, 14, 15, 21 and 22, which were previously difficult to sequence because these short arms are primarily composed of highly repetitive DNA. As scientists begin studying these new portions of the genome, we will learn more about their function, importance, and potential implications in various diseases.

It’s important to note that this study has yet to be peer-reviewed and the technology is still being optimized to analyze genomes that contain both maternal and paternal contributions. It’s also unclear how the newly deciphered regions will be integrated into existing databases. Even so, this work clearly demonstrates the importance and usefulness of long-read sequencing technology. The T2T Consortium has now teamed up with the Human Pangenome Reference Consortium to  use  long-read methods to sequence hundreds more gapless human genomes. This could benefit human health research by presenting a more complete picture of the diversity of human genomes.

To schedule a media interview with Dr. Neil Lamb or to invite him to speak at an event or conference, please contact Margetta Thomas by email at [email protected] or by phone: Office (256) 327-0425 | Cell (256) 937-8210

Get the Latest Sharable Science Delivered Straight to Your Inbox!