Telomere-to-telomere assembly of a complete human X chromosome

Karen opened her plenary talk by stating that we are "entering into a new era" in genetics and genomics which is demanding complete, high-quality assemblies. The current human reference genome (GRCh38) is the most accurate and complete vertebrate genome to date. However, it is incomplete - there are still 368 unresolved issues and 102 gaps. Karen said that it "really drives it home when we look at chromosome 21", which has ~30 Mb of assembled sequence but ~20 Mb of missing sequence - unexplored regions to study that could be linked to disease. These problem regions are associated with segmental duplications, gene families, satellite arrays, centromeres, and rDNAs, as well as uncharacterised sequence variation in the human population. The major challenge is the generation of complete assemblies across repetitive regions that can span up to hundreds of kilobases, or even megabases at centromeres. Karen asked: can high-coverage, ultra-long read sequencing be used to resolve these regions and complete assemblies of the human genome? She stated that this question was what motivated the establishment of the Telomere-to-Telomere (T2T) consortium, of which she is a member, which is an open, community-based effort to generate the first complete assembly of a human genome. The aim of this consortium is to "shift the standards in genomics" to the highest quality.

Karen and her colleagues have sequenced CHM13hTERT, a karyotypically stable haploid cell line, using long-read nanopore sequencing. From the start of May 2018 to January 2019, 94 MinION/GridION flow cells were used for CHM13 sequencing, obtaining 50X depth of coverage from ultra-long nanopore reads. The maximum mapped read length was 1.04 Mb. These nanopore ultra-long read data were used for contig building, along with long read datasets from other sequencing platforms for polishing and structural validation. The alignment tool Canu was used for sequence assembly; the final assembly was 2.94 Gbp with an NG50 contig size of 75 Mbp - this exceeds the continuity of GRCh38 in completeness which has an NG50 contig size of 56 Mbp. Moreover, a subset of chromosome assemblies only remained broken at the centromere.

Karen stated that the next step was to use this hybrid de novo assembly to assemble a complete human X chromosome. The X chromosome seemed a "natural place to invest time", for it is associated with many Mendelian diseases. The biggest challenge in assembly of this chromosome was at the centromere, which required ultra-long nanopore reads spanning 100 kbp repeat-rich regions. However, she stated that an assembly is only a hypothesis and the manually-finished assembly needed to be validated using other methods such as digital droplet PCR, restriction enzyme pulse-field gels, and structural validation techniques.

Karen demonstrated how difficult it is to assemble centromeric regions, especially the centromere of the X chromosome where, for example, only 37 structural variants are present to guide assembly, and the majority of these SVs are very small. She stated that the next challenge is determining how to polish the assembly and bring it to high accuracy. How can we create new strategies to deal with tandem repeats? Karen described how they created a polishing strategy using unique k-mers; this firstly involves identifying all unique, single-copy k-mers throughout the genome. These k-mers are used to create a scaffold for anchoring high-confident, long-read alignments; only those long-reads aligning with unique k-mers are retained. Karen described how spacing of single-copy k-mers can be irregular in repeat dense regions, such as centromeres. For example, the longest distance observed between two k-mers on the X chromosome was 53 kbp, this means that reads of ≥53 kbp are required to span this section of the chromosome.

Two rounds of nanopolish were used for k-mer-based polishing of nanopore reads, along with long read polishing from other sequencing platforms, and HiFi alignments were then used to evaluate the success of polishing. Karen concluded this section by stating that the finished T2T X chromosome had a structurally validated assembly, from telomere-to-telomere, including a problem 2.8 Mb tandem repeat at the X centromere. The novel k-mer based polishing strategy they used improved the assembly quality of large repeat-rich regions. She stated that this demonstration is "really bringing the point home that we are achieving high quality and high continuity".

In the final section of her talk, Karen asked "how do we start to finish the human genome?" Focusing on chromosomes 7 and 9, at D6Z1 and D8Z2 centromeric sites from satellite array predicted regions, Karen explained how we can see the difference in sequence diversity compared to the X chromosome centromere with its 2.8 Mb tandem repeat. At the centromeres on these autosomal chromosomes there is far greater sequence diversity which makes their assembly significantly easier - there is "a lot more information to guide mapping, polishing and assembly". For example, the maximum spacing between k-mers is only 3 kb. Using the k-mer polishing approach greatly improved the assembly.

Karen concluded by stating that the goal of the next two years is to obtain a complete human genome. Challenges facing us include acrocentric regions, large segmental duplications, and classical human satellites, and we need to start thinking about automating repeat assembly. "We keep setting the bar higher and higher" for the genetics community in terms of assembly quality and completeness. Thinking about 2020 and beyond, we need to start thinking about human populations, as opposed to a single human genome. This will require increasingly high-throughput long-read sequencing on the PromethION, and they are now starting to "ramp up the process". It will also require cloud-based assembly and processing; Karen announced that the SHASTA cloud-based assembler is imminently being released by Santa Cruz; this has achieved assembly of 2.8 Gbp of sequence data in only 5.6 hours.

"So I guess that my take home message is...keep calm because everything is awesome"!

Please note that all the CHM13 data is openly available at github.com/nanopore-wgs-consortium/chm13.

Authors: Karen Miga