Accurately characterise bacterial genomes with nanopore-only data: a benchmarking study
- Home
- Accurately characterise bacterial genomes with nanopore-only data: a benchmarking study
To characterise microbial genomes and their variants, high-accuracy sequencing data must be generated, and microbial laboratories have traditionally relied on legacy short-read sequencing technologies to do this1. Dr Michael Hall (Peter Doherty Institute for Infection and Immunity, Australia) and his colleagues performed a benchmarking study to evaluate the improved accuracy of Oxford Nanopore Technologies sequencing data. Recent advancements include the new R10.4.1 Flow Cell and updated basecaller.
This study illustrates the accuracy improvements in nanopore sequencing data and proves it can be used to precisely characterise and call variants in bacterial genomes.
How to call variants in bacterial genomes
Variant calling is the process of mapping reads to a reference genome and identifying any variation2. This process has enabled variants to be called in most of the human genome and is used across many applications, including in microbiology laboratories to predict antimicrobial resistance in bacterial infectious diseases1,2.
Variant callers use different methods to identify variants and can be split into two categories: traditional and deep-learning-based2. Traditional variant callers typically rely on multiple algorithmic methods to identify variants, but no single caller can identify all variant types. Deep-learning-based variant callers generate pileup image representations (visual depictions of aligned sequencing reads stacked on top of one another) to call variants more efficiently and accurately than traditional callers — including variants of high complexity3,4.
However, modern deep-learning-based variant callers are often trained on human rather than bacterial genomes1. Therefore, Michael and the team assessed both types of variant callers to determine the accuracy of nanopore sequencing to detect single nucleotide polymorphisms (SNPs) and insertion/deletion (indel) variants in bacterial genomes. Additionally, previous bacterial variant-calling benchmarking studies typically focus on short-read data alone, highlighting the need for benchmarking using other technologies1,5.
Short-read sequencing technologies have historically been the primary method to call variants in bacterial genomes due to their relatively high basecalling accuracy1,6. Still, amongst other things, their limitations include difficulty reconstructing genome assemblies in complex and repetitive regions6. Oxford Nanopore provides advantages, including reads of unrestricted length, portability, and real-time data streaming, to enable complex bacterial genomes to be accurately sequenced in remote locations whilst rapidly generating results.
Benchmarking accuracy study setup
Nanopore sequencing of 14 gram-positive and negative bacterial species was performed on MinION and GridION devices using R10.4.1 Flow Cells, and the team generated both simplex and duplex reads (reads generated from both DNA strands from a single molecule sequenced back-to-back). The reads were basecalled using the three nanopore basecalling models: fast, super accurate (SUP), and high accuracy (HAC). Simplex reads were basecalled with all models and duplex reads were basecalled with the SUP and HAC models. Seven variant callers for nanopore data were assessed, and as a performance comparison, each bacterial sample was sequenced using a short-read sequencing method and variants were called using Snippy. All variant callers were assessed against variant truth sets for each sample.
Oxford Nanopore sequencing reads reach Q32
The SUP model with duplex reads was found to generate the highest median read accuracy of 99.93% (Q32) out of all conditions and produced the highest F1 scores (a machine-learning evaluation metric that measures variant calling accuracy) across all variant callers. Additionally, the deep-learning methods Clair37 and DeepVariant8 generated the highest F1 scores for SNPs and indels for both SUP and HAC models.
Overall, the deep-learning methods significantly outperformed the traditional variant callers regardless of the basecalling model used and exceeded the short-read sequencing comparison, with the SUP model generating significantly higher F1 scores. These results illustrate that Oxford Nanopore ‘data can provide better variant calls than [short-read sequencing]’.
Changing preconceptions about nanopore accuracy
To confirm these results and change people’s preconceived ideas about nanopore sequencing accuracy, Michael and his colleagues investigated the reason for the difference between the technologies: read length. He discovered that the short-read sequencing method had a lower F1 score because short reads cannot align as accurately in repetitive and variant-dense regions, leading to higher false negative calls than from nanopore data (Figure 1).
Figure 1. Comparison of short reads with nanopore reads in variant-dense and repetitive regions. A) The distribution of true positive (TP), false positive (FP), and false negative (FN) variant calls from short-read data illustrates a bimodal distribution of FN calls. B) The distribution of TP, FP, and FN variant calls from nanopore data using the variant caller Clair3 shows no bimodal distribution and fewer FN calls. C) F1 score for nanopore data using Clair3 and short-read data in repetitive regions shows short reads do not align as accurately as nanopore data. Figure from Hall et al.1 and available under Creative Commons license (https://creativecommons.org/licenses/by/4.0/).
Michael and the team also investigated how well nanopore technology could sequence indels as these variants have previously been difficult to sequence. He found that with the continued improvements as the technology has developed, false indel calls from nanopore data with the SUP model and Clair3 matched the short-read method error rate, showing significant improvement in nanopore sequencing accuracy.
How good is good enough?
The improvements in sequencing accuracy from Oxford Nanopore were also discussed at the 2024 Nanopore Community Meeting in Boston, USA, by David Yarmosh (ATCC Sequencing and Bioinformatics Center, USA). David presented the benchmarking study the ATCC performed to determine whether the new nanopore basecaller (Dorado) would actually improve basecalling speed and read quality scores compared with the previous basecaller (Guppy).
David found ‘Dorado has a two- to three-fold enhancement over Guppy’, and that ‘switching to Dorado and the [super] accuracy mode in most cases is going to be your best bet’ for performing nanopore sequencing. As a result, the ATCC will be sequencing samples on a GridION using the SUP model on Dorado.
Nanopore sequencing challenges the gold standard
The ATCC study shows the improvements that have been made to nanopore basecalling, and the benchmarking study performed by Michael and his colleagues demonstrates that improved nanopore sequencing accuracy has ‘enabled the creation of near-perfect reference genomes’. Furthermore, it indicates that deep-learning techniques, regardless of being trained on human genomes, were still able to achieve the highest F1 scores for bacterial SNP and indel detection. This study confirms that Oxford Nanopore sequencing can effectively and accurately characterise bacterial genomes and demonstrated improvements in nanopore read accuracy and reduced homopolymer-induced false positive indel calls.
Furthermore, this study ‘demonstrate[s] significant advancements in sequencing accuracy with the latest [Oxford Nanopore] technology’. Finally, it highlights the importance of any-length nanopore reads in overcoming the limitations of short-read methods to accurately align reads in complex and repetitive regions — unlocking novel information that is out of reach to short-read sequencing technologies.
To find out more about basecalling improvements with nanopore sequencing, watch David’s talk below and discover more about microbial applications on the microbial genomics page.
Hall, M.B. et al. Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data. eLife 13:RP98300 (2024). DOI: https://doi.org/10.7554/eLife.98300.3
Olson, N.D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. 24:464–483 (2023). DOI: https://doi.org/10.1038/s41576-023-00590-0
Zverinova, S. and Guyrev, V. Variant calling: considerations, practices, and developments. Hum. Mutat. 43(8):976–985 (2021). DOI: https://doi.org/10.1002/humu.24311
Wang, S. and Ye, K. Deep-learning based representation and recognition for genome variants—from SNVs to structural variants. Natl. Sci. Rev. 11(11):nwae335 (2024). DOI: https://doi.org/10.1093/nsr/nwae335
Bush, S.J. Generalizable characteristics of false-positive bacterial variant calls. Microb. Genom. 7(8):000615 (2021). DOI: https://doi.org/10.1099/mgen.0.000615
Sanderson. N.D. et al. Comparison of R9.4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome reconstruction. Microb. Genom. 9(1):000910 (2023). DOI: https://doi.org/10.1099/mgen.0.000910
Zheng. Z. and Li, S. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat. Comput. Sci. 2(12):797–803 (2022). DOI: https://doi.org/10.1038/s43588-022-00387-x
Poplin. R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36(10):983–987 (2018). DOI: https://doi.org/10.1038/nbt.4235