Interview: Nanopore sequencing in plants: from greenhouse to genome
Date: Tuesday 18th February
Time: 3pm UK time
Speaker: Maximilian Schmidt, RWTH Aachen University
Maximillian Schmidt is currently completing a PhD at RWTH Aachen University where he is using nanopore technology to improve de novo genome sequencing in plants. We talked to Maximillian about his research, the challenges facing plant genome assembly and how long-read sequencing is impacting plant genomics.
Maximillian will be sharing his insights and advice on de novo genome sequencing in Nanopore sequencing in plants: from greenhouse to genome on Tuesday 18th February 2020 3pm GMT/10am EST/4pm CET with Technology Networks.
Tell us about your current research interests
At the moment my research is in very different directions – firstly, we are working on two Physalis genomes, P. alkekengi and P. ixocarpa, to study their evolution. As they are native to different regions, Europe and meso-America, we assume their last common ancestor grew before the continents separated and so we hope that by comparing their genomes we can get a better understanding of how plants evolve and adapt to environments. Secondly, we are working on Vinca minor which is used as a medicinal plant – we are looking to elucidate the pathways that produce pharmaceutically interesting compounds. This plant is also highly heterozygous but only diploid, so if we can make a fully phased assembly, we may also be able to look how gene expression between the two haplotypes differs. Finally, as we are interested in plant epigenetics, we are training our own methylation aware basecaller on our plant data using machine learning, which is a completely new field for me.
What first ignited your interest in plant genomics?
When I was working on my Masters thesis studying plant cell walls, I was surprised that there are so few genes where we actually know their function - even in a simple model plant like Arabidopsis. And for important crop plants the situation was even worse as there were not many reference genomes, which were also highly fragmented with several hundreds of thousands of contigs. So I wanted to know why this was the case and if there was any way to improve things – as I’ve always liked the duality of working in the lab as well as analyzing data, I decided to pursue a PhD position focused on de novo plant genome projects.
What have been the main challenges in your work, and how have you approached them?
In the beginning it was mostly the fact that we could not generate enough data to span the repeats in plants to get contiguous assemblies. We tried different types of mate-pair sequencing, as well as Fosmid sequencing, but we never got enough useful links to actually fully scaffold genomes to a decent quality. A big improvement came with methods like Hi-C or the Chicago method, but even these suffered from highly fragmented contig assemblies. Things have only really changed since we have been able to do long-read sequencing on nanopore platforms. The good thing is that we can easily sequence in-house which has enabled very fast turnaround times compared to sending samples away for sequencing. Our first de novo genome project only took about 8 weeks in the lab, which is a huge improvement to the 6-12 months that similar projects took before.
But then our challenges shifted completely - I can now generate good data a lot faster than I can analyze it which is a good problem to have. We were drowning in data so we had to get larger storage servers and more computational resources. Then we realized that “PCR-grade” DNA is far away from “Nanopore-grade” DNA, as nanopore sequencing needs DNA to be very high molecular weight, pure and undamaged. Optimizing DNA extraction can be difficult in plants due to their high diversity between species and ecotypes – and the presence of secondary metabolites which are produced by stressed and malnourished plants to interact with nucleic acids and protect themselves also presents a challenge. We realized that it is very important to monitor and optimize growth conditions and plan cultivation and extraction carefully to ensure you have enough biomass to extract DNA from. Also we recognized that longer reads are better for assembly, but at some point the yield per run starts to go down with longer reads - so we are constantly working on improving our library preparation methods in terms of shearing, size-selection and clean-ups to get a balance between yield and read-length. And even here we see that the optimum differs for each species.
Can you tell us more about how long-read sequencing is changing your field? How has it benefited your research?
It has definitely improved things a lot - we can do more contiguous genomes in a shorter time and without many “N”-characters for undefined bases. Even if you were getting contiguous assemblies with short-read assemblies they always contained a lot of N's since all scaffolding methods just gave distance information about the read-pair but not the sequence in between like you get from long reads. Long-read sequencing also makes things easier - we needed to prep a lot of sequencing libraries with short-read technologies to scaffold genomes, for example a typical mate-pair library gave us only about 50,000-100,000 unique 16 kb insert size fragments per sequencing run and PCR-duplicates were a real issue. A few years ago, we were getting about 500,000 reads with an average read-length of 15-16 kb from a single MinION run, but today we are getting millions of reads with average read-lengths in the 20 kb range and read N50s in the 40 kb range. This has been a real game-changer since suddenly the average genome assembly only contained a few hundred to thousand contigs instead of hundreds of thousands. This allows us to check for structural rearrangements and their impact on biological function, where before we were often limited to just look for SNP's or small insertions or deletions.
What impact could improved de novo sequencing of plant genomes have?
As well as being able to identify structural rearrangements and the impact they have on biological function, we can also look at different haplotypes and how they interact – this has been difficult as short reads always come with a higher ambiguity. But for me the biggest impact is it enables us to look at plant species we could not look at before. Around half the plants we are currently working on are not edible crop plants, but do contain interesting secondary metabolites, have potential as energy plants, or are wild relatives of crop plants. Sequencing these genomes in the past would have required too much time and resources, but present-day technologies allow us to identify breeding targets in wild relatives of crop plants like tomato or eggplant or investigate complex secondary metabolite pathways that could lead to important pharmaceutical products. It even allows us to probe into why some plants do better on marginal land than close relatives, because now we can make reference genomes for both instead of just looking at transcriptomic data and map that to a distant relative for which we had a genome assembly available.
What’s next for your research?
Actually, my main goal for this year is quite simple - I finally want to finish my PhD that I started in late 2014, before long-read sequencing became available as I’ve always had so many interesting projects to work on that I just did not find the time to complete it. After that, I would like to work on getting to a point where nanopore sequencing alone is enough to generate full chromosome-scale assemblies. Currently we still need things like optical-mapping, Hi-C and short read data for polishing - but I hope that with methods like Pore-C and much longer reads, in combination with better basecalling algorithms and pores that give an unbiased error profile, we can get to a point where a PromethION and a MinION are enough to get complete assemblies.