The long and short of it

Kicking off the second plenary of NCM 2019 was Sissel Juul, Director of Genomic Application at Oxford Nanopore. Sissel began by introducing the applications team: in Oxford, United Kingdom, the group focus on sample technology, whilst the Genomic Applications team is split across Oxford, New York and San Francisco, and showcases the technology through high-impact biological applications. Last but not least, the Applications Support team travel all over to provide in-depth support for customers.

Sissel highlighted the main vision of Oxford Nanopore: to enable the sequencing of anything, by anyone, anywhere. Focusing on the "anything" part of that goal, she noted that the long-read capacity of nanopore sequencing means that discussion of the technology generally features just that - long reads. However, Sissel said, "really, there's no reason why you'd only sequence long reads with nanopores". Displaying how typical fragment length vary by sample type, from very short degraded DNA to very long mammalian chromosome DNA, Sissel explained how many applications involve samples with short read lengths. Whilst the majority of reads sequenced on nanopore platforms are ~1.5 kb-150 kb, in this talk Sissel decided to focus on the other end of that scale. As nanopores can sequence read after read, one after another, she noted, sequencing short fragments does not result in loss of throughput over sequencing long reads.

In her first example of "the short of it", Sissel introduced the work of the Applications team using "longish short reads" to call variants in a cystic fibrosis panel. Cystic fibrosis, a recessive genetic condition affecting around one in 3,000 people, can result from many mutations in the gene encoding the cystic fibrosis transmembrane conductance regulator (CFTR) protein on chromosome 7; ~139 mutations in the gene have been verified as clinically relevant. A genetic test is available to detect CFTR gene mutations, enabling carrier screening, confirmation of clinical diagnosis, prenatal diagnosis and determination of optimal treatment options. This is achieved via panels, which detect the 139 mutations and variants in the space of 80 amplicons; however, with ultra-short reads, problems result from the presence of paralogous pseudogenes, which cause issues when mapping reads. The Applications team demonstrated long-read amplicon sequencing of the same mutations: by extending the amplicons to ~1,500 bp each, the same variants could be spanned in 24 amplicons, with the longer reads enabling variant phasing and unambiguous identification of the paralogous genes. Sissel displayed an alignment showing identification of expected known SNPs in a sample, plus one unexpected SNP. The additional SNP fell outside of the range of the short-read panel, but was picked up by the longer amplicons used here; this SNP was shown in the literature to have been associated with cystic fibrosis. The team aimed for 150x depth of coverage for each target, and found that this could be achieved on a MinION Flow Cell in five minutes; Sissel suggested that, to make the most of the throughput of a flow cell, samples would be ideally sequenced in multiplex or on a Flongle Flow Cell. The panel was then tested on 23 samples, each with known cystic fibrosis-related variants. The panel was enriched via the 24-amplicon multiplexed PCR, then each sample was uniquely barcoded, enabling all 23 to be sequenced on a single MinION Flow Cell. Of the 36 known mutations across the sample, 34 were correctly called, and no false positives were seen, giving a sensitivity of 94.44% and specificity of 100%; Sissel noted that for the 2 false positives, the correct mutations were visible in the data, suggesting a bioinformatics issue they are now investigating.

Looking at shorter reads still, Sissel then discussed their work in progress with "ultra-short" cell-free DNA (cfDNA): fragments of DNA present in blood plasma. In cancer patients, tumour DNA is also present; this represents 0.1-10% of cfDNA, with more tumour DNA present at more advanced stages of disease. To complicate things further, these fragments are typically only 100-200 bp in length. Sissel described how, in order to detect circulating tumour DNA as early as possible in the disease, it is important to be able to detect variants in the oncogenes of these DNA fragments with high accuracy at low frequency. To enable this, the team developed a protocol combining target enrichment and unique molecular identifiers (UMIs). Clustering all the reads originating from a single initial molecule via UMIs, Sissel explained, performs two functions. Firstly, it enables any PCR bias to be detected and controlled for. Secondly, clustering can be used to polish reads, generating high accuracy single molecule consensus reads. In this method, UMIs were incorporated into the primers used to amplify all the cfDNA in a sample, then biotinylated probes were used to enrich for regions of interest. The on-target captured DNA was then further amplified and sequenced; in analysis, the UMIs were used to cluster and polish reads, with all reads associated with the same UMI in a cluster having derived from a single parent strand, enabling the generation of high-accuracy, single-molecule consensus sequences.

The workflow was tested on human gDNA from NA14097, with a known SNP in BRCA1, spiked into the well-studied human genome NA12878 to a final variant frequency of 5%. The Roche Avenio ctDNA kit, a pan-cancer assay, was used to target oncogene-specific fragments from 17 cancer-associated genes. DNA was fragmented to ~160 bp to reproduce the typical length of cfDNA. After the addition of UMIs, enrichment and amplification, samples were sequenced and aligned to the NA12878 reference genome using minimap2. UMIs were then clustered via vsearch, filtered and then high-accuracy consensus reads obtained using Spoa, Racon and Medaka, all publically available software. A cluster size of 8 was shown to give a read accuracy of 99%, whilst a cluster size of 20 reached 100% read accuracy. SNPs were called from these high-accuracy consensus reads using varscan2, enabling quantification of the low-frequency variant, which was successfully detected at 5% as expected. In future, the team will focus on lowering the frequency of the variants, widening the selection of loci and testing the workflow on cfDNA samples.

Sissel then handed over to her colleague Eoghan Harrington, Associate Director of Genomic Applications Bioinformatics, who began by introducing the Pore-C project - a collaboration with Marcin Imielinski's lab at Weill Cornell and NYGC. Pore-C is Oxford Nanopore's method of chromatin conformation capture ("3C"), using long-read nanopore sequencing to assess the 3D organisation of DNA that is close in proximity, but not in sequence. The pre-print, nanopore sequencing of DNA concatemers reveals higher-order features of chromatin structure (author et al.) features comparisons to chromatin conformation capture methods HiC and SPRITE, and discusses the use of Pore-C in assessing structural variation and improving assemblies. The end-to-end Pore-C workflow is now available for members of the Nanopore Community: the protocol is now online, whilst tools and pipelines for Pore-C data analysis, plus sample data, can be found on Github.

Eoghan then focused in more detail on chromatin conformation capture, explaining how traditional methods work by measuring how close in proximity two points in a genome are - "pairwise contacts". The 3D structure of chromatin can then be visualised from pairwise contact maps. The reason that these methods use pairwise contacts, Eoghan explained, was due to the use of short reads, in which further spatial information is not visible.  Displaying the Pore-C protocol, Eoghan explained briefly how this is achieved, featuring cross-linking of chromatin, restriction digestion and ligation of DNA in close proximity. This forms long concatemers of DNA fragments which are close in proximity. This DNA is subsequently purified, prepared and sequenced. In analysis, Pore-C tools then identify the restriction fragments present in the reads. Long nanopore reads enable multi-fragment concatemers to be produced and sequenced in single reads, providing many pairwise contacts in the space of a read. This enables resolution of both direct pairwise contacts, which are adjacent in sequencing, and virtual pairwise contacts, which are not adjacent but can be associated through their presence within the space of a single long read. Long-range information is encoded in these virtual pairwise contacts, maximising the span of the associations.

Having discussed "the short of it", Eoghan then moved on to "the long of it", asking: why analyse only pairwise interactions? Whilst these are compatible with available analysis and visualisation tools, breaking down Pore-C data in this way means that the higher order information is lost. Pore-C reads can go beyond this: Eoghan demonstrated how complex queries could be set up for the multiple contacts in full-length Pore-C reads, enabling identification of subsets of reads with specific high order contact patterns, to gain more spatial information and identify long-range interactions. In their pre-print, this is demonstrated, showing how reads can be pulled out that span the A/B compartments in a chromosome.

Eoghan described how chromosomes tend to occupy their own spaces within a nucleus, known as chromosome territories. This is also the case for homologous chromosomes, so in Pore-C, reads generally represent a particular allelic phase. Furthermore, where 3C methods traditionally require PCR, Pore-C is PCR-free, enabling preservation and detection of base modifications in sequencing - Eoghan noted that this means that "epiallelic" information can also be assessed, to investigate imprinted loci and cell-specific methylation.

Closing the plenary, Eoghan introduced one more case study in which many of these types of Pore-C information were brought together. He described how known SNPs can be used to phase data: here, a Genome in a Bottle sample with full phasing information was analysed via Pore-C, with the data split into two buckets to build allele-specific contact maps for entire chromosomes. Comparing the chromatin conformation of the two alleles of this chromosome, Eoghan showed how distinct differences in the structure were visible. He then revealed the reason ("if I had longer, I'd build some suspense around this"): the chromosomes in question were the two X chromosomes in this female human genome sample, and the allelic differences seen are the result of X inactivation. Eoghan pointed out how the inactive (Xi) chromosome data shows the two expected superdomains, whilst the "checkerboard" data for the active (Xa) chromosome results from its expected A/B compartment structure. The other hallmark of X inactivation is differential methylation. In Xi, inactivated genes feature hypermethylated promoters, whilst the smaller proportion of genes that "escape" inactivation feature methylation more similar to that on the Xa chromosome; this was the precise pattern of methylation identified in the native DNA Pore-C reads.

Visit the "Posters" section of the Resource Centre to read more about the studies featured in this plenary: https://nanoporetech.com/resource-centre

Read the Pore-C preprint: https://nanoporetech.com/resource-centre/nanopore-sequencing-dna-concatemers-reveals-higher-order-features-chromatin

Authors: Sissel Juul, Eoghan Harrington