InfoSheet
Kit 14 sequencing and duplex basecalling V K14_S1019_v1_revL_29Jun2022
FOR RESEARCH USE ONLY
Contents
Introduction to Kit 14 chemistry and duplex data
Basecalling Kit 14 simplex and duplex data
Appendix
1. Introduction to Kit 14 chemistry and duplex data
Introduction to Kit 14 chemistry
Our latest Kit 14 chemistry is an upgrade across our product range to enable highly accurate and tuneable sequencing.
Upgrades include:
- New E8.2 motor protein in our V14 sequencing kits (e.g SQK-LSK114)
- New R10.4.1 flow cells
- Basecalling upgrades for simplex or duplex data
Kit 14 builds on our previous chemistry upgrades to combine high output with Q20+ accuracy. Duplex rates can also be achieved to further improve accuracy to Q30+.
Note: For highest sequencing accuracies, ensure the super-accurate basecaller (SUP) is selected for your sequencing parameters during run set-up on MinKNOW.
Figure 1. Example data expected to be achieved using the Ligation Sequencing Kit V14 (SQK-LSK114) on a GridION Flow Cell.
Introduction to duplex data
Our Kit 14 chemistry has been developed to reach sequencing accuracies of ~Q30 with duplex data and SUP basecalling. Duplex basecalling is the reading of both strands for even higher accuracy compared to our previous kits.
We offer two options for basecalling Kit 14 data with sequencing accuracies of 99% and above (Q20+):
- Simplex basecalling: The sequencing of a single strand. The template DNA strand passes through the nanopore and is basecalled. This is performed on MinKNOW.
- Duplex basecalling: The sequencing of both strands. The complement strand is read immediately after the template strand and the consensus basecall for both strands leads to a further increase in accuracy. This is performed on either Dorado.
Duplex reads do occur at a lower level naturally. However, our Kit 14 chemistry include upgrades in the motor protein and pore in the flow cells, which in combination work to increase second strand capture to improve duplex rates. For more information on basecalling duplex data, please see the "Basecalling Kit 14 duplex data" section.
Below, we illustrate duplex sequencing at the pore level (Fig. 2) and we demonstrate what data can be achieved with duplex sequencing (Fig. 3).
Figure 2. The diagram above demonstrates how duplex reads are sequenced on the flow cell, the first strand is the template strand and the second is the complement strand.
Figure 3. This graph illustrates percentage of data in a duplex pair. Duplex data is presented for both template and complement strands, therefore, for this example, duplex rate is 21.4%.
2. Sequencing high output simplex data
Calculating pore occupancy
Simplex data is generated when a single strand of DNA is sequenced. It happens at a much higher volume than duplex sequencing, meaning higher outputs can be generated.
For experiments that require high data output, we recommend loading your flow cell with a high amount of library (35-50 fmol) to ensure high pore occupancy of >95% is reached to maximise data output. If your sequencing run starts with a lower pore occupancy after 1 hour of starting, data output may be negatively impacted.
Pore occupancy is the percentage of pores that are sequencing at any time out of those available for sequencing. This is displayed in the pore activity plot in MinKNOW during an experiment. During the first few minutes of sequencing, pore occupancy can increase as the library tethers to the membrane. Therefore, we recommend reviewing pore occupancy ~1 hour into a sequencing run.
To calculate pore occupancy:
3. Basecalling Kit 14 simplex data
Basecalling Kit 14 simplex data
Simplex basecalling is where the single template DNA strand passes through the nanopore and is basecalled with sequencing accuracies of up to 99% when using our super-accurate basecaller (SUP). We recommend basecalling Kit 14 simplex data on MinKNOW.
For optimal performance on our devices, please see the specifications below:
- 64 bit Linux or Windows 10
- Intel i7, i9, Xeon, or better processor
- At least 16 GB of RAM
- An NVIDIA GPU, at least RTX 2070 or better, with at least 16 GB of GPU memory
- At least 1 TB SSD
Make sure you are using the most recent version of MinKNOW.
Below are our basecaller and output recommendations for setting up a Kit 14 sequencing run on MinKNOW to generate POD5 files to use as input for duplex basecalling on Dorado.
IMPORTANTE
Ensure you are using the most recent software version of MinKNOW.
When setting up a sequencing run, select the R10.4.1 flow cell (FLO-MIN114 or FLO-PRO114M) and choose the Kit 14 used (e.g. SQK-LSK114) for your library prep.
For detailed instructions on setting up a run in MinKNOW, please see the "Starting a sequencing run" in the MinKNOW protocol.
The parameters on the run options tab can be left to their default settings.
On the analysis tab, we recommend using the Fast basecaller.
Other basecallers are compatible but the Fast basecaller will be quicker during the sequencing run. The High-accuracy (HAC) and Super-accurate (SUP) basecallers can be used when rebasecalling your data on Dorado.
On the output tab, ensure POD5 is selected as the output format and read splitting is on by default.
For more information about read splitting, please see the section Introduction to read splitting in the appendix.
FAST5 files can be used as an output file but POD5 files are required for optimal performance on Dorado.
4. Basecalling Kit 14 duplex data
Optimising your library preparation for duplex data
Duplex basecalling is performed on Dorado after simplex basecalling is completed on MinKNOW. Duplex basecalling is where the complement strand is read immediately after the template strand and the consensus basecall for both strands leads to a further increase in accuracy to ~Q30 with our super-accurate basecaller (SUP).
To generate high duplex output, it is important to follow the library preparation protocol to ensure successful ligation of sequencing adapters onto both ends of the DNA strands. It is also important to follow the flow cell loading recommendation of 10-20 fmols for optimal duplex output to ensure the flow cell is not under- or overloaded:
- Underloading results in lower capture rate and less overall output.
- Overloading adds competition of other strands around the nanopore and reduces the rate of duplex read capture.
For calculating how much DNA to load onto your flow cell, please use the table below for guidance. For more information on DNA input, please see the '[Input DNA/RNA QC](https://community.nanoporetech.com/docs/prepare/library_prep_protocols/input-dna-rna-qc)' document.
Mass | Molarity if fragment length = 2 kb | Molarity if fragment length = 8 kb | Molarity if fragment length = 50 kb |
---|---|---|---|
1 μg | 770 fmol | 193 fmol | 31 fmol |
500 ng | 395 fmol | 96 fmol | 15 fmol |
400 ng | 308 fmol | 77 fmol | 12 fmol |
200 ng | 154 fmol | 39 fmol | 6.2 fmol |
100 ng | 77 fmol | 19 fmol | 3.1 fmol |
75 ng | 61 fmol | 15 fmol | 2.4 fmol |
50 ng | 40 fmol | 10 fmol | 1.6 fmol |
30 ng | 23 fmol | 5.8 fmol | 0.9 fmol |
20 ng | 16 fmol | 4 fmol | 0.6 fmol |
15 ng | 12 fmol | 3 fmol | 0.4 fmol |
10 ng | 7.7 fmol | 1.9 fmol | 0.3 fmol |
To basecall duplex data, we now offer duplex read basecalling in Dorado (recommended) and Guppy, where the template and complement strands of a read can have their basecall data combined to provide a more accurate sequence. For more information, please see our [Dorado page on Github](https://github.com/nanoporetech/dorado).
Please note, to generate data of the highest accuracy, ensure data is basecalled using our super-accurate basecaller (SUP).
IMPORTANTE
Ensure you are using the most recent software version of MinKNOW for simplex basecalling and either Dorado or Guppy for duplex basecalling.
Overview of performing duplex basecalling:
Set up sequencing parameters in MinKNOW to perform simplex basecalling as described in "Basecalling Kit 14 simplex data". a. Basecall using the Fast basecaller. b. Output .POD5 files.
Using Dorado, re-basecall your simplex data with the following command to output simplex and duplex reads using the super-accurate (SUP) basecalling model. Other basecalling models can be used, as listed on the Dorado Github page.
$ dorado duplex dna_r10.4.1_e8.2_400bps_sup@v4.1.0 pod5s/ > duplex.bam
Dorado is a high-performance basecaller which is used to perform duplex basecalling. For further information about Dorado, please see the Dorado Github page.
Note: When running Dorado, we recommend stopping other basecalling for the best performance by maximising memory available to Dorado. This can be stopped and restarted when Dorado has finished via the GUI on MinKNOW.
Guppy may also be used to duplex basecall.
Calculating duplex data
Duplex data can be presented as two values:
- Duplex rate as a percentage of bases or reads in a duplex pair
- Duplex output as Gb of data in a duplex pair
To determine total percentage of duplex reads, we calculate as follows:
= ((template + complement)/total) * 100
Figure 4. This graph illustrates the duplex reads presented as template strand and complement strand, explaining the need for the calculation to combine the reads to calculate the total percentage. In this example, duplex rate would be 21.4%.
5. Introduction to read splitting
Read splitting
Read-splitting is a default parameter in MinKNOW to split the informatic chimeras, as explained below. In our Kit 14 chemistry, we have observed increased follow-on rates, and with increased follow-on rates (the rate of the complement strand entering the pore directly after the template strand has passed through), we have observed a higher rate of concatemerisation compared to the Ligation Sequencing Kit (SQK-LSK110). We are classifying these reads as 'informatic chimeras' as they are not physically joined during the library preparation process.
With SQK-LSK110, we typically observe <2% concatemerisation and at this rate, it typically does not affect downstream applications. With, for example SQK-LSK112, we have observed a rate as high as 10%. Both MinKNOW (v21.11 and higher) and stand-alone Guppy (v5.1 and higher) now offer the option of splitting these reads.
As you can see from the following example, the majority of the informatic chimeras (yellow) are removed after splitting for a human (native) and E. coli (PCR) sample.
It is important to note that the read splitting function is not designed to split reads that are incorrectly ligated together during sample preparation. While these make up a small percentage of reads, users should take care with ligation steps to follow the protocol carefully to reduce the chance of creating them.
6. Assembly
Recommendations for assembly
Flye is recommended as an assembly tool for Kit 14 genome assembly (https://github.com/fenderglass/Flye).
We have observed that assembling haplotypes separately significantly improves genome contiguity, although each assembly only uses half the data.
There are three Flye parameters that we recommend are tuned for good performance with Kit 14 sequence data:
- Configuring the command line parameter
--min-overlap 10000
should deliver a modest improvement in assembly contiguity when using libraries optimised for read length. - It is recommended that the
--nano-corr
parameter is set (to specify that the sequences are "corrected"). This provides a significant improvement to assembly NG50 compared to when the--nano-raw
(uncorrected sequence) setting is used. We have observed NG50 increases from 58 Mb to 67 Mb for collapsed assemblies, when assembling both haplotypes at once. - We typically adjust the "
asm_corrected_reads.cfg
file in theflye/config/bin_cfg/
folder to increase haplotype-specific assembly NG50s and to remove any major misjoins.
a. enable homopolymer compressed scoring (hpc_scoring_on = 1
)
b. increase the minimizer_window
to 10
c. decrease the repeat_graph_ovlp_divergence
to 0.005 increases haplotype-specific assembly NG50s to 84 Mb/84 Mb and removes all major misjoins