Effective characterization of T-DNA insertion lines through nanopore sequencing

Boas Pucker (University of Cambridge, UK) began his talk by describing the most important model organism in plant sciences: Arabidopsis thaliana. He noted that the classic method of investigating gene function in this model plant is by producing a knock-out line for the gene of interest, comparing the phenotypes of this mutant line with that of the wild type plant and, from this, inferring the function of the gene. Understanding the functions of all the genes within the genome would, therefore, require knock-out versions of every gene. Today, this can be achieved precisely using CRISPR/Cas9, but this technology was not available when such studies began decades ago. Instead, scientists relied on random mutagenesis: one method utilises the soil bacterium Agrobacterium tumefaciens to inject ‘transfer DNA’ (T-DNA) into plant cells, which is randomly integrated into the plant’s genome. This T-DNA can disrupt gene function, and so can be used to knock out genes. This method was used to generate >700,000 T-DNA insertion lines.

Boas introduced the project GABI-Kat, a project which has contributed >92,000 such lines. He outlined the process by which T-DNA is inserted into the plant and transformed seedlings are grown. To identify the locations of these insertions, DNA is extracted and the flanking sequence tags (FLTs) – representing the regions flanking either side of the T-DNA insertion – are amplified. These are sequenced, and the sequence data aligned to the reference genome to predict the T-DNA insertion locations; the locations are then confirmed via PCR. To investigate a gene’s function, a transformed line containing a T-DNA insertion in the gene of interest is selected for phenotypic comparison to the wild type. However, Boas explained that ‘the reality can be more messy – so this is where our project starts’.

Boas and his team selected lines for which T-DNA confirmation using this classical method failed, then thoroughly characterised these lines using long nanopore sequencing reads. Their protocol, available on protocols.io, begins with the extraction of high molecular weight DNA from plants containing at least one T-DNA insertion. After QC and quantification, the Circulomics Short Read Eliminator (SRE) Kit is used to remove fragments <10 kb or <20 kb; libraries are then prepared and sequenced on a GridION. Boas showed a read length distribution graph for one such sequencing run, which produced a read length N50 of ~41 kb. He estimated that ~20x depth of coverage should be sufficient to characterise T-DNA insertions in a sample.

Moving on to data analysis, Boas noted that one option was to perform de novo genome analysis; however, this would take ~2-3 days on a compute cluster. To reduce the time and compute needed, the team developed loreta (long read-based t-DNA analysis), available on GitHub. In this pipeline, reads containing T-DNA sequences are identified, assembled locally, and the contigs annotated and visualised. If the insertion is of interest, reads containing the wild type allele can also be treated in the same way. Boas displayed an example visualisation of a T-DNA insertion allele and corresponding aligned reads. Here, it was revealed that confirmation of the insertion initially failed as the T-DNA was fragmented, preventing primer binding on one side and inhibiting PCR.

Boas then gave examples in which the long nanopore reads enabled characterisation of more complex cases. In one case, three copies of a 5.8 kb were inserted in an array, which could be sequenced end-to-end. Boas stressed that the classical FST-based approach to T-DNA insertion is generally only intended for characterisation of one T-DNA insertion, and that previous publications described ~50% of GK lines as displaying a single insertion. Investigating 40 lines, they were able to detect 27 previously confirmed insertions – and identify 11 previously unknown insertion locations. From this, they consider that the prior estimate of ~1.5 insertions per line is likely an underestimate.

Boas described how, as well as identifying T-DNA insertions using this long-read method, it is also possible to characterise insertions associated with structural variants (SVs). In one example, an inversion was identified flanking a T-DNA insertion; in another, an insertion was found to connect different chromosomes, causing a fusion – a process which has been reported previously. In one instance, both a large-scale inversion and a chromosome fusion was observed. Though this observation was not a surprise, they even identified cases where chromosomes were fused without a T-DNA in between, even though a T-DNA was seen connecting the reciprocal fusion. This was identified for both inter- and intra-chromosomal fusions.

Next, the team decided to look for possible duplications, by analysing average coverage across the genomes. They identified one case in which increased coverage was seen across an ~800 kb. Using their local assembly method, they identified multiple T-DNA insertions next to each other in the sequence. The flanking sequences were pointing towards each other, revealing an inversion; they then identified the other end of the inversion. The data suggested that the region was first duplicated, then inverted. Finally, they identified one more T-DNA which did not appear to be related to the structural variant.

Boas and his team characterised 40 lines in this way, producing a dataset representing ~700x coverage of A. thaliana. This inspired them to produce a new de novo genome assembly. To do this, they first removed all reads containing T-DNAs and SVs, and excluded reads <100 kb. Canu was used for the assembly, and Racon then Medaka for polishing. Finally, the assembly was checked for SVs. The new assembly was useful in one particular case to resolve a T-DNA insertion in a peri-centromeric region which, in the TAIR reference, contained gaps and misassemblies. They hope the assembly will also be helpful in resolving similar cases in future.

Concluding his presentation, Boas highlighted how this method is useful not only for characterisation of T-DNA insertions, but potentially for use across other applications too. From DNA extraction to analysis with loreta, the process takes only a few days, and both T-DNA loci and SVs can be resolved. He noted that by re-using MinION Flow Cells or sequencing in multiplex on higher-throughput PromethION Flow Cells, this can be achieved for ~$200-300 per line. He stressed again that there are 700,00 A. thaliana cell lines available and - whilst most are well-characterised - where confirmation of insertions has previously not been possible, long-read sequencing can now be used to help.

Authors: Boas Pucker