Generating the ‘most complete human methylomes to date’


Until recently, lingering gaps in the human reference genome have limited the ability to perform comprehensive analyses of epigenetic regulation. It was impossible to fully characterise chromatin structure and gene expression, even with the latest GRCh38 build. As these traits are already known to be linked to diseases, such as schizophrenia, lung cancer, and muscular dystrophy, understanding them has been a priority for many researchers.

However, even for finished regions of the reference genome, the use of traditional short-read sequencing technology to detect methylation has been problematic. Samples must be prepared specially for bisulfite sequencing and run separately for methylation analysis, rather than generating both DNA sequence and methylation data in a single run. In addition, short reads cannot span long-range epigenetic interactions or differentiate between haplotypes. Finally, accurately mapping short-read methylation data for highly repetitive regions has been a significant hurdle.

The 2022 release of the first complete human genome from the Telomere-to-Telomere (T2T) consortium — including 225 Mb of novel sequence data based on ultra-long nanopore reads to close the 8% of the genome previously missing — has given scientists an entirely new foundation for understanding epigenetic regulation1. By combining that resource with additional nanopore sequencing data, researchers have produced the most comprehensive evaluation yet of methylation across the human genome2.

In this study, Gershman et al. from Johns Hopkins University, USA, and other institutions presented a detailed view of the human epigenome and their novel findings based on long nanopore reads. They chose to use data generated by nanopore sequencing devices from Oxford Nanopore Technologies because traditional short-read data was a poor fit for the highly repetitive sequence added to the new reference genome. In addition, since methylation status and nucleic acid identity can be gleaned from the same data, nanopore sequencing reduces the time and cost needed for an epigenetic analysis workflow.

emerging long-read technologies offer sequence lengths capable of spanning infrequent unique markers and provide a direct measurement of the base sequence and epigenetic state on single molecules'2

Nanopore-based methylation profiling of the T2T reference offered a clear improvement over both the GRCh38 reference and short-read epigenetic analysis of the T2T genome, ‘revealing epigenetic patterning of genomic regions that were previously intractable’. The researchers found that ‘methylation profiles [in the T2T genome] using long-read nanopore data demonstrate an increase in the genome coverage (32.8 M compared with 29.17 M in GRCh38, omitting chromosome Y) and surveyed more CpGs (10%, 3.18 M) compared with short-read whole-genome bisulfite sequencing’.

In addition, the long nanopore reads clearly addressed the mappability challenge seen with short-read data. Short reads struggled noticeably with important elements such as segmental duplications and satellite DNA, with 165 Mb of sequence enriched for these elements considered highly unmappable using short reads. In contrast, nanopore reads longer than 50 kb enabled accurate methylation detection in these challenging regions.

To deepen their investigation, the team sequenced both the CHM13 cell line used to create the T2T reference and HG002, a lymphoblast cell line. CHM13 captures an early developmental stage and exhibited hypomethylation, while HG002 is later stage, with most of its genome methylated. The authors reported that ‘long-read methylomes of distinctive developmental time points surveyed >99% of CpGs, establishing the CHM13 and HG002 methylomes as the most complete human methylomes to date’.

The team then focused on particular elements of interest, including allele-specific methylation in X chromosome inactivation. This approach was enabled by the long nanopore reads, which, when paired with the T2T reference, ‘confer the ability to explore methylation patterns of single molecules’. They identified genetic regions as hypermethylated or hypomethylated, using that information to distinguish between active and inactive X chromosomes. Looking at the CHM13 cell line, they found several genes that were improperly regulated by the X inactivation process and posited that this was likely due to a failure of that process in this haploid cell line.

Separately, the team also used this approach for a high-resolution analysis of the DXZ4 satellite array, which is associated with X chromosome inactivation and lacks heterozygous polymorphisms. The researchers suggested that ‘because satellite arrays are known to be hypervariable in the human population and linked to several human diseases, these results highlight the importance of long-read single-molecule epigenetic studies for understanding disease pathology’.

In one final investigation, the team explored the methylation profile of centromeres — the last regions of the genome to be finished by the T2T consortium because of their highly repetitive sequence. They identified a hypomethylated centromeric unit across chromosomes, linking its role in ‘kinetochore positioning and epigenetic regulation of chromosome segregation’ and analysing it in human samples representing diverse ancestries. They reported that ‘when combined with findings in other organisms, e.g., maize and medaka, this suggests that the [centromeric dip region] is a conserved, functionally important feature of complex centromeres across vertebrate and plant lineages’.

Concluding, the researchers noted that ‘this analysis provides a framework with which to investigate the most elusive regions of the human genome, granting insights into epigenetic regulation

1. Nurk, S. et al. Science 376(6588):44-53 (2022). DOI: https://doi.org/10.1126/science.abj6987

2. Gershman, A. et al. Science 376(6588), eabj5089 (2022). DOI: https://doi.org/10.1126/science.abj5089