Applications of Long-read Sequencing in Infectious Disease Genomics

Next generation sequencing technology has revolutionised the study of microbial genomics, but most large-scale studies have focused on short-read sequencing. This use of short-read sequencing has limitations however. In analysis of single bacterial genomes it is difficult to resolve repeats, multiple replicons, and other complex features. This is particularly true in strains isolated from the clinic which can have very elaborate genome structures. Additionally, studies of the microbiome are also limited by short read methods. For example, microbial communities originally relied on sequencing full-length 16S rRNA genes by the Sanger sequencing method, but this has been largely replaced by short-read sequencing of one to three of the nine variable regions of the 16S rRNA gene, leading to loss of specificity. Short-read sequencing allows researchers to generate millions of highly accurate reads, but the use of short-reads of specific variable regions can lead to misclassification, fail to provide taxonomic lineage to the genus and species level, and miss strain-specific polymorphisms. Such accuracy is relevant to the use of 16S rRNA gene sequencing in the identification of pathogens in clinical metagenomic samples.We will show the uses of long-read sequencing data to overcome some of these limitations. We studied the production of assemblies of complex genomes of medically-relevant Escherichia coli and Klebsiella pneumoniae strains. The E. coli strains have large numbers (up to 100) of repeats of a beta-lactamase gene, leading to high-level resistance. The K. pneumoniae strains carry a carbapenemase gene (KPC strains) and an elaborate collection of plasmids. Moreover, we used long-read data to perform full-length 16S rRNA sequencing, providing greater specificity than shorter read methods. We will also look forward to the usage of long-read technologies in clinical diagnostics and complex mammalian genomics.

Authors: Dr Blake Hanson