Brief Description
Mycobacterium tuberculosis workflow for multiplexed Nanopore sequencing data.
Introduction
wf-tb-amr
is a workflow for determining the antibiotic resistance of
Mycobacterium tuberculosis targeted sequencing samples. The workflow handles
multiplexed sequencing runs and provides clear and simple reports summarising
the predicted resistance profile of each sample according to genetic mutations
discovered.
Compute Requirements
Recommended requirements:
- CPUs = 16
- Memory = 32GB
Minimum requirements:
- CPUs = 8
- Memory = 16GB
Approximate run time: 5 minutes per sample
ARM processor support: True
Install And Run
These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.
The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.
The workflow can currently be run using either Docker or Singularity to provide isolation of the required software.
Both methods are automated out-of-the-box provided
either Docker or Singularity is installed.
This is controlled by the -profile
parameter as exemplified below.
It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.
The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:
nextflow run epi2me-labs/wf-tb-amr --help
To update a workflow to the latest version on the command line use the following command:
nextflow pull epi2me-labs/wf-tb-amr
A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:
wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-tb-amr/wf-tb-amr-demo.tar.gz
tar -xzvf wf-tb-amr-demo.tar.gz
The workflow can then be run with the downloaded demo data using:
nextflow run epi2me-labs/wf-tb-amr \
--fastq 'wf-tb-amr-demo/fastq' \
--sample_sheet 'wf-tb-amr-demo/sample_sheet.csv' \
-profile standard
For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/
Related Protocols
This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.
Find related protocols in the Nanopore community.
Input Example
This workflow accepts either FASTQ or BAM files as input.
The FASTQ or BAM input parameters for this workflow accepts the path to a directory containing one level of sub-directories which in turn contains FASTQ or BAM files. The data is assumed to be multiplexed with the names of the sub-directories as barcodes. A sample sheet must be provided with --sample_sheet
.
input_directory
├── barcode01
│ ├── reads0.fastq
│ └── reads1.fastq
├── barcode02
│ ├── reads0.fastq
│ ├── reads1.fastq
│ └── reads2.fastq
└── barcode03
└── reads0.fastq
Input Parameters
Input Options
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
fastq | string | FASTQ files to use in the analysis. | This accepts the path to a directory containing one level of sub-directories which in turn contain FASTQ files. | |
bam | string | BAM or unaligned BAM (uBAM) files to use in the analysis. | This accepts the path to a directory containing one level of sub-directories which in turn contain BAM files. | |
analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False |
Sample Options
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
sample_sheet | string | A CSV file used to map barcodes to sample aliases. | The sample sheet is a CSV file with, minimally, columns named barcode and alias . Extra columns are allowed. A type column is required for this workflow and should have one of the following values; test_sample , positive_control , no_template_control . |
Output Options
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
out_dir | string | Directory for output of all workflow results. | output |
Reference Options
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
reference | string | NCBI accession for reference genome. | By default the workflow uses NC_000962.3. WARNING: If you change this parameter but don't alter the variant database, Genbank file, and the amplicon BED (all generated for NC_000962.3) to match, then the behaviour of the workflow is unlikely to be as expected. | |
amplicons_bed | string | The location of the amplicons for the assay. | A BED file describing the location of the amplicons used to generate the data to be processed by this workflow, based on NC_000962.3. | |
primers_bed | string | The location of the primers for the assay. | A BED file describing the location of the primers used to generate the data to be processed by this workflow, based on NC_000962.3. | |
rescue_regions_bed | string | The location of the rescue regions for the assay. | A BED file describing the location of variants to rescue. | |
who_json | string | WHO v2 variant database in JSON format. | A list of WHO variants in JSON format, that this assay will genotype. | |
amr_regions_bed | string | The location of the regions for AMR calling in the assay. | A BED file describing the location of the regions used for AMR calling in this workflow, based on NC_000962.3. | |
snpeff_config | string | SnpEff configuration file. | SnpEff configuration file, used to alter start codon possibilities for MTB. | |
expert_rules | string | Expert rules in JSON format. | Structured data describing expert rules to be applied to variants. | |
genbank | string | Genbank file for organism of interest. | Genbank file used for variant annotation, defaults to NC000962.3 annotations. | |
crispr_regions | string | FASTA file of CRISPR regions from Galru. | FASTA file used to identify sample reads mapping to CRISPR regions. | |
spoligo_db | string | Spoligotype database from SITVIT2. | The SITVIT2 database is used to call the spoligotype of the sample. |
Advanced Options
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
maf | number | Minimum mutant allele frequency to consider. | By default the workflow will filter any variant which is present at less than 15% allele frequency (0.15). Change this parameter to alter this filtering behaviour. Minimum is set at 1%. | 0.15 |
low_maf | number | Minimum mutant allele frequency to consider for samples with 10-20 amplicons passing. | By default the workflow will filter any variant which is present at less than 40% allele frequency (0.40) if between 10-20 amplicons have sufficient coverage. Change this parameter to alter this filtering behaviour. Minimum is set at 1%. | 0.4 |
min_coverage | integer | The minimum total coverage for a region to be coinsidered in variant calling. | By default the workflow expects a minimum of 20 total reads in a region. This is to ensure that very low coverage regions do not contribute to potentially false positive variant calls. | 20 |
minimum_read_support | integer | The minimum number of reads to consider for a variant call on each strand. | By default the workflow expects a minimum of 5 reads on each strand supporting a variant. This is to ensure that when using the maf , very low coverage regions do not contribute to potentially false positive variant calls. | 5 |
strand_bias | integer | Set a threshold for strand bias filtering. | Strand bias is represented as a Phred scaled p-value from a Fisher's exact test, with a value close to 0 being preferable. | 1000 |
pos_thresh | integer | Set a threshold for presence in positive filtering. | Presence of variant in positive and sample is represented as a Phred scaled p-value from a Fisher's exact test, with a larger value preferable. | 50 |
report_config | string | Report configuration file. | The report can be configured to help with translation. See report_config.eng.json in the data/general directory. Here you can provide a path to your own report configuration file. | |
control_config | string | Control configuration file. | Provides details required to assess the quality of the run and the samples. | |
antimicrobial_details | string | Antimicrobial agent details. | Provides details about the antimicrobials in the assay. | |
spacers | string | FASTA file of spacer regions. | The presence or absence of the 43 spacer regions in this FASTA file is used to classify the sample strain. | |
kraken_db | string | Custom Kraken2 database of hsp65 alleles. | The workflow assigns reads from the hsp65 locus to a database of pre-defined species, and then collapses these to a single taxonomy. | |
complex_species_to_exclude | string | Tab delimited file describing how some TaxIDs fall into species complex (e.g MTBC or MAC). | ||
taxid_map | string | File describing the map between TaxID and scientific name. | Used in species identification module to map TaxID to scientific name. | |
screen | boolean | Option to screen data for non-specific amplified reads and remove from analysis. | Screening samples for non-specific amplification is switched on by default. | True |
store_dir | string | Directory to store Kraken database used for screening. | The Kraken database is downloaded to to this location when the workflow is run for the first time, and is used for screening non-specific amplified reads. | store_dir |
filter_white_list | string | File containing taxids for the Mycobacteriaceae family - one taxid per line. | ||
ic_ntc | boolean | Option to require internal control to be present in NTC for run quality control. | This is required for v2.0.0 of the workflow, please set to false for older versions of analysis. | True |
Outputs
Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
Title | File path | Description | Per sample or aggregated |
---|---|---|---|
analytical run report | ./wf-tb-amr-report.html | The report for all samples on the workflow. | aggregated |
workflow csv results summary | ./wf-tb-amr-report.csv | The CSV summary of the results of the workflow. | aggregated |
workflow results | ./results.json | Structured workflow results for internal/onward use. | aggregated |
workflow checkpoints | ./checkpoints.json | Structured workflow checkpoints for internal/onward use. | aggregated |
alignment | ./{{ alias }}.bam | Aligned reads for the sample in BAM format. | per-sample |
alignment index | ./{{ alias }}.bam.bai | An index file for the alignment in BAI format. | per-sample |
variants | ./{{ alias }}.final.vcf | Called, annotated variants for the sample. | per-sample |
per-sample report HTML | ./{{ alias }}-report.html | Per sample reports HTML. | per-sample |
per-sample report PDF | ./{{ alias }}-report.pdf | Per sample reports PDF. | per-sample |
Pipeline Overview
1. Concatenates input files and generate per read stats.
The fastcat/bamstats tool is used to concatenate multifile samples to be processed by the workflow. It will also output per read stats including average read lengths and qualities.
2. Align reads to NC_000962.3 reference genome
minimap2 is used to align reads from the samples to the Mycobacterium tuberculosis reference genome FASTA (NC_000962.3). This step also discards unmapped reads and generates statistics from the resulting BAM file.
3. Screen reads
Kraken2 is used to screen reads in the sample to filter those derived from non-specific amplification.
4. Run AMR module
The following steps are carried out during this stage:
- The bcftools mpileup tool is used to determine base composition of pre-defined mutations.
- Clair3 is used to perform indel calling, and the results merged with the previous step.
- The whatshap tool is used to phase mutations, and the results annotated using SnpEff.
- Mutations are then matched to the WHO catalogue, and the WHO expert rules are applied to retrieve those which adhere to these rules, but are not present in the catalogue.
5. Run species identification module
Kraken2 is run using a custom hsp65 database, to identify the Mycobacterium species.
6. Run spoligotyping module
Samples are mapped to CRISPR regions using minimap2, and the results are then processed with blast to detect the presence or absence of 43 spacers. The pattern of spacers is then used to classify strains into the different lineages.
7. Quality control
The workflow generates checkpoint data at various stages which captures which quality control checks have passed or failed.
8. Report results
The workflow outputs an HTML report with overall results for all samples in the run, indivdual sample HTML reports, and a summary CSV file.
Troubleshooting
- If the workflow fails please run it with the demo dataset to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
- See how to interpret some common nextflow exit codes here.
FAQs
If your question is not answered here, please report any issues or suggestions to Technical Support.
Other
See the EPI2ME website for lots of other resources and blog posts.