Giter Site home page Giter Site logo

sydney-informatics-hub / germline-structuralv-nf Goto Github PK

View Code? Open in Web Editor NEW
4.0 7.0 3.0 1.21 MB

Germline structural variant calling pipeline for short read WGS datasets

License: GNU General Public License v3.0

Nextflow 76.70% Shell 23.30%
australian-biocommons bam bioinformatics dsl2 germline-variant-calling manta nextflow ngs pipeline smoove

germline-structuralv-nf's Introduction

GermlineStructuralV-nf

Description

GermlineStructuralV-nf is a pipeline for identifying structural variant events in human Illumina short read whole genome sequence data. GermlineStructuralV-nf identifies structural variant and copy number events from BAM files using Manta, Smoove, and TIDDIT. Variants are then merged using SURVIVOR, and annotated by AnnotSV. The pipeline is written in Nextflow and uses Singularity/Docker to run containerised tools.

Structural and copy number detection is challenging. Most structural variant detection tools infer these events from read mapping patterns, which can often resemble sequencing and read alignment artefacts. To address this, GermlineStructuralV-nf employs 3 general purpose structural variant calling tools, which each support a combination of detection methods. Manta, Smoove and TIDDIT use typical detection approaches that consider:

  • Discordant read pair alignments
  • Split reads that span a breakpoints
  • Read depth profiling
  • Local de novo assembly

This approach is currently considered the best approach for maximising sensitivty of short read data (Cameron et al. 2019, Malmoud et al. 2019). By using a combination of tools that employ different methods, we improve our ability to detect different types and sizes of variant events.

Diagram

User guide

To run this pipeline, you will need to prepare your input files, reference data, and clone this repository. Before proceeding, ensure Nextflow is installed on the system you're working on. To install Nextflow, see these instructions.

1. Prepare inputs

To run this pipeline you will need the following inputs:

  • Paired-end BAM files
  • Corresponding BAM index files
  • Input sample sheet

This pipeline processes paired-end BAM files and is capable of processing multiple samples in parallel. BAM files are expected to be coordinate sorted and indexed (see Fastq-to-BAM for an example of a best practice workflow that can generate these files).

You will need to create a sample sheet with information about the samples you are processing, before running the pipeline. This file must be tab-separated and contain a header and one row per sample. Columns should correspond to sampleID, BAM file, BAI file:

sampleID bam bai
SAMPLE1 /data/Bams/sample1.bam /data/Bams/sample1.bam.bai
SAMPLE2 /data/Bams/sample2.bam /data/Bams/sample2.bam.bai

When you run the pipeline, you will use the mandatory --input parameter to specify the location and name of the input file:

--input /path/to/samples.tsv

2. Prepare the reference materials

To run this pipeline you will need the following reference files:

You will need to download and index a copy of the reference genome you would like to use. Reference FASTA files must be accompanied by a .fai index file. If you are working with a species that has a public reference genome, you can download FASTA files from the Ensembl, UCSC, or NCBI ftp sites. You can use the IndexReferenceFasta-nf pipeline to generate required samtools and bwa indexes.

When you run the pipeline, you will use the mandatory --ref parameter to specify the location and name of the reference.fasta file:

--ref /path/to/reference.fasta

Note

  • Tiddit expects the BWA index files to be in the same directory as the reference fasta file.
  • You must specify the full path for the reference fasta, even if it is in your working directory.

Download the AnnotSV database and supporting files (optional)

If you choose to run the pipeline with AnnotSV annotations, you currently need to download and prepare the relevant AnnotSV files, manually. The AnnotSV data is very large (>20Gb) so we haven't included it in the AnnotSV container.

First, download the AnnotSV database:

wget https://www.lbgi.fr/~geoffroy/Annotations/Annotations_Human_3.2.1.tar.gz

Then unzip it and save to a directory of your choosing:

tar -xf Annotations_Human_3.2.1.tar.gz -C /path/to/AnnotSV

You will also need to download the Exomiser supporting data files:

wget https://www.lbgi.fr/~geoffroy/Annotations/2202_hg19.tar.gz && wget https://data.monarchinitiative.org/exomiser/data/2202_phenotype.zip

Create a directory to house the Exomiser files:

mkdir -p Annotations_Human/Annotations_Exomiser/2202

Save the downloaded Exomiser files to your AnnotSV directory:

tar -xf 2202_hg19.tar.gz -C /path/to/AnnotSV/Annotations_Human/Annotations_Exomiser/2202/ && unzip 2202_phenotype.zip -d /path/to/AnnotSV/Annotations_Human/Annotations_Exomiser/2202/

And finally (optionally), tidy up:

rm -rf Annotations_Human_3.2.1.tar.gz 2202_phenotype.zip 2202_hg19.tar.gz

3. Clone this repository

Download the code contained in this repository with:

git clone https://github.com/Sydney-Informatics-Hub/Germline-StructuralV-nf

This will create a directory with the following structure:

Germline-StructuralV-nf/
├── LICENSE
├── README.md
├── config/
├── main.nf
├── modules/
└── nextflow.config

The important features are:

  • main.nf contains the main nextflow script that calls all the processes in the workflow.
  • nextflow.config contains default parameters to use in the pipeline.
  • modules contains individual process files for each step in the workflow.
  • config contains infrastructure-specific config files (this is currently under development)

4. Run the pipeline

The most basic run command for this pipeline is:

nextflow run main.nf --input sample.tsv --ref /path/to/ref

This will generate work directory, results output directory and a runInfo run metrics directories. To specify additional optional tool-specific parameters, see what flags are supported by running:

nextflow run main.nf --help

Customising the workflow

By default the workflow will merge events together that are supported by >1 SV caller (Tiddit, Smoove, Manta), are a maximum distance of 1kb apart, and at least 40bp long. By default, callers have to agree on the type and strand to merge events. All of these can be overridden using the following flags:

  • --survivorMaxDist: Maximum distance between events to merge. Default: 1000.
  • --survivorConsensus Number of callers required to report a call. Default: 1. Change to 2 or 3 to require more stringent reports for 2 or 3 caller support, respectively.
  • --survivorType: SV type consensus. Default: callers must agree (1). Change to 0 to remove requirement.
  • --survivorStrand: SV strand consensus. Default: callers must agree (1). Change to 0 to remove requirement.
  • --survivorSize: Minimum SV size (bp) to report. Default: 30.

If you need to specify any additional flags supported by Manta, use the --extraMantaFlags flag and add one or more flag inside single quotes. If using multiple flags, they should be separated by a space.

If you need to specify any additional flags supported by Smoove, use the --extraSmooveFlags flag and add one or more flag inside single quotes. If using multiple flags, they should be separated by a space.

If you need to specify any additional flags supported by Tiddit sv or the Tiddit cov, use the --extraTidditSvFlags or --extraTidditCovFlags flag respectively and add one or more flag inside single quotes. If using multiple flags, they should be separated by a space.

AnnotSV annotations for human samples

To run the pipeline with the optional AnnotSV annotations, use the following command to direct Nextflow to your previously prepared AnnotSV resource directory:

nextflow run main.nf --input sample.tsv --ref /path/to/ref --annotsvDir /path/to/annotsv

You can override the default annotation mode (both) and instead apply split or full annotations. See AnnotSV documentation for details. To override this default use the --annotsvDir flag in your run command:

nextflow run main.nf --input sample.tsv --ref /path/to/ref --annotsvDir /path/to/annotsv --annotsvMode {both|split|full}

If you need to specify any additional flags supported by AnnotSV, use the --extraAnnotsvFlags flag and add one or more flag inside single quotes. If using multiple flags, they should be separated by a space:

nextflow run main.nf --input sample.tsv --ref /path/to/ref --annotsvDir /path/to/annotsv --annotsvMode full --extraAnnotsvFlags '-SVminSize 100 -vcf 1'

If for any reason your workflow fails, you are able to resume the workflow from the last successful process with -resume.

5. Results

Once the pipeline is complete, you will find all outputs for each sample in the results directory. Within each sample directory there is a subdirectory for each tool run which contains all intermediate files and results generated by each step. A final merged VCF for each sample will be created: results/$sampleID/survivor/$sampleID_merged.vcf.

The following directories will be created:

  • manta: all intermediate files and results generated by Manta.
  • smoove: all intermediate files and results generated by Smoove.
  • tiddit: all intermediate files and results generated by Tiddit.
  • survivor: summary stats, merged multi-caller VCF (final output), merged multi-caller bedpe file.
  • annotsv: full annotations for the all events in the merged multi-caller VCF.

Infrastructure usage and recommendations

This pipeline has been successfully implemented on NCI Gadi and Pawsey Setonix HPCs using infrastructure-specific configs. These configs can be used to interact with the job scheduler and assign a project code to all task job submissions for billing purposes. You can use the following flags to handle accounting:

  • --whoami your NCI or Pawsey user name
  • --setonix_account the Setonix project account you would like to bill service units to
  • --gadi-account the Gadi project account you would like to bill service units to

NCI Gadi HPC

Before running the pipeline you will need to load Nextflow and Singularity, both of which are globally installed modules on Gadi. You can do this by running the commands below:

module purge
module load nextflow singularity

To execute this workflow on NCI Gadi HPC, you will need to specify the following flags to the default run command:

nextflow run main.nf --input sample.tsv --ref /path/to/ref --gadi-account <account> --whoami <username> -profile gadi

Please be aware that as of October 2023, NCI Gadi HPC queues do not have external network access. This means you will not be able to pull the workflow code base or containers if you submit your nextflow run command as a job on any of the standard job queues. NCI currently recommends you run your Nextflow head job either in a GNU screen or tmux session from the login node or submit it as a job to the copyq.

The NCI Gadi config currently runs all tasks apart from the rehead processes on the normal queue. This config uses the --gadi-account flag to assign a project code to all task job submissions for billing purposes. The version of Nextflow installed on Gadi has been modified to make it easier to specify resource options for jobs submitted to the cluster. See NCI's Gadi user guide for more details.

The NCI Gadi config summarises resource usage in a custom trace file that will be saved to your specified results directory. However, for accounting or resource benchmarking purposes you may need to collect per-task service unit (SU) charges. Upon workflow completion, you can run the Sydney Informatics Hub's gadi_nfcore_report.sh script in your workflow execution directory with:

bash scripts/gadi_usage.sh

This script will collect resources from the PBS log files printed to each task's .command.log. Resource requests and usage for each process is summarised in the output gadi-gsv-usage-report.tsv file. This is useful for resource benchmarking and SU accounting.

Pawsey Setonix HPC

Before running the pipeline you will need to load Nextflow and Singularity, both of which are globally installed modules on Setonix. You can do this by running the commands below:

module purge
module load nextflow singularity

To execute this workflow on Pawsey Setonix HPC, you will need to specify the following flags to the default run command:

nextflow run main.nf --input sample.tsv --ref /path/to/ref --setonix-account <account> --whoami <username> -profile setonix

This config currently submits all tasks apart from the rehead processes to the work queue. This config uses the --setonix-account flag to assign a project code to all task job submissions for billing purposes.

Benchmarking

These resource request recommendations are based on benchmarking performed using 60x human genomes sequenced to a depth of ~30x coverage. All computation was performed on NCI Gadi HPC, running Centos 8, PBS Pro v. 19, on Intel Xeon Cascade Lake 2 x 24 core nodes each with 192 GB RAM on the normal queue.

Nextflow trace, timeline, and workflow reports for this execution are available in the benchmark folder in this repository. Elapsed processing time for 60 samples was 2hrs 36 mins, costing 2,355.86 service units (SU) in total (~39.26 SUs/sample).

Workflow summaries

Metadata

metadata field GermlineStructuralV-nf / v1.0
Version 1.0.0
Maturity First release
Creators Georgie Samaha, Tracy Chew, Marina Kennerson, Sarah Beecroft
Source NA
License GNU General Public License v3.0
Workflow manager NextFlow
Container See component tools
Install method NA
GitHub https://github.com/Sydney-Informatics-Hub/Germline-StructuralV-nf
bio.tools NA
BioContainers NA
bioconda NA

Component tools

To run this pipeline you must have Nextflow and Singularity installed on your machine. All other tools are run using containers.

Tool Version
Nextflow >=20.07.1
Singularity
Manta 1.6.0
Smoove 0.2.7
TIDDIT 3.6.0
BCFtools 1.15.1
HTSlib 1.15.1
SURVIVOR 1.0.7
AnnotSV 3.2.1

Additional notes

Resources

Help/FAQ/Troubleshooting

  • It is essential that the reference genome you're using contains the same chromosomes, contigs, and scaffolds as the BAM files. This is mandated by Manta, which will throw an error if the BAM and FASTA files do not match. To confirm what contigs are included in your indexed BAM file, you can use Samtools idxstats:
samtools idxstats input.bam | cut -f 1

Acknowledgements/citations/credits

Contributors

  • Georgie Samaha (Sydney Informatics Hub, University of Sydney)
  • Tracy Chew (Sydney Informatics Hub, University of Sydney)
  • Marina Kennerson (ANZAC Research Institute)
  • Sarah Beecroft (Pawsey Supercomputing Research Centre)
  • Ching-Yu Lu (Sydney Informatics Hub, University of Sydney)

Acknowledgements

  • This pipeline was developed and tested using data provided by the Northcott Neuroscience Laboratory, ANZAC Research Institute and resources provided by the Australian BioCommons 'Bring Your Own Data' platforms project and the Pawsey Supercomputing Research Centre.
  • This pipeline was built using the Nextflow DSL2 template.
  • Documentation was created following the Australian BioCommons documentation guidelines.

Cite us to support us!

Our workflows are registered on WorkflowHub and this workflow can be cited in your publication:

Samaha, G., Chew, T., Kennerson, M., Beecroft, S. (2023). GermlineStructuralV-nf. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.431.1

Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities. We suggest including the following acknowledgement in any publications that follow from this work:

The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia.

germline-structuralv-nf's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

germline-structuralv-nf's Issues

Merging stringency evaluation

Currently using SURVIVOR merge to combine caller VCFs for each sample using minimal filtering:

  • 1kb distance between pairwise breakpoints
  • Include calls only supported by one caller (higher sensitivity at expense of specificity)
  • Callers do not have to agree on the event type
  • Callers do not have to agree on the strand

Investigate how adjusting these parameters may help with reducing number of false positives.

Pre-download images

Copyq has maximum walltime of 10 hours. I would like to be able to pull the containers first using copyq, and then run the head job on the normal queue. Similar to nf-core download tool, or a test profile perhaps.

Tower integration

Testing at scale

  • Fixes for schema json
  • Report yaml POSTPONED
  • Gadi configurations

nextflow_schema.json

For integration with Nextflow Tower, require the schema to populate params page.

Bug in Tiddit when using alt-contigs

Some samples will fail when using alt-contigs (as opposed to just chromosomes) with the Tiddit-sc step. The specific error I found:

Traceback (most recent call last):
    File "/usr/local/bin/tiddit", line 11, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.9/site-packages/tiddit/__main__.py", line 151, in main
      tiddit_contig_analysis.main(prefix,sample_id,library,contigs,coverage_data,args)
    File "tiddit/tiddit_contig_analysis.pyx", line 133, in tiddit.tiddit_contig_analysis.main
    File "tiddit/tiddit_contig_analysis.pyx", line 57, in tiddit.tiddit_contig_analysis.read_contigs
  KeyError: 'HLADRB1*04:03:01'

It looks like there might be some parsing error, because this should be formatted as HLA-DRB1*04:03:01 (i.e. it's missing a dash). For testing purposes I've added errorStrategy 'ignore' to the tiddit-sv module file so the pipeline can keep going. The failed sample was Platinum genome ERR194159, using the fasta reference from Sarek iGenome Homo_sapiens_assembly38.fasta.

Explore variant prioritisation features

Must be dataset/experiment specific. Should include a summary report?

Could include:

  • candidate gene extraction (Custom script)
  • mode of inheritance based filtering (Slivar?)
  • common variant extraction (Custom script)
  • HPO term filtering of AnnotSV

Read samplesheet via column position not header label

To ensure flexibility in header field spelling/capitalisation. Sample sheet will be OK if the header exists, and the data fields are in the right order, and it won't matter if 'sampleID' or 'SampleID' for example is used.

Add additional checks to check_cohort.nf

  • write script to check integrity of file
  • confirm tab separated
  • confirm correct number of columns
  • Confirm bams and BAI exist
  • Check the time stamp of the BAI vs BAM (to avoid Manta error 'The index file is older than the data file')

Pipeline feedback

Hi!
I'm running this pipeline on Setonix and I'm enjoying it so far 😀
Not so much an issue but I had a few things I wanted to mention/request regarding the pipeline/AnnotSV output, mostly based on my experience using manta + AnnotSV -

  1. Gene annotations are set to full - this removes the 'split' lines containing transcript specific information which is needed for prioritising variants, in particular the frameshift column. The drawback is it will add at least one extra line per entry.
  2. Merging all of the cohort samples before annotating would be beneficial, as this would make filtering based on pedigree much easier. I think this is something you are already looking at, but I generally just do this in excel with the merged cohort AnnotSV outputs. A better way for merging and filtering these could also help to find compount het variants. This would also make it easier to remove false positive SVs and result in a smaller overall output and AnnotSV runtime.
    JasmineSV could be useful here.
  3. Similar to above, I think it could also be beneficial to run manta as a whole cohort rather than individually as it can then genotype each family/cohort member at each variant and provide quality scores for these, which is missing if they are all called individually, and also reduce some false positives. I'm not sure if it's possible to do this with other callers?
    I previously ran manta using --bam (joint diploid analysis) rather than --normalbam which I believe is for tumor analysis.

Thanks for making such a fast, comprehensive, and easy to use pipeline!
Cheers,
Gavin 😊

VEP annotation troubleshooting

Variant annotation with Variant effect predictor

Variant annotation should be a flexible option that allows for different organisms, databases, online and offline use of pipeline. VEP can use multiple annotation sources. This includes:

  • VEP cache: all transcript models, regulatory features, variant data for a species
  • GFF/GTF: transcript models (tabix indexed)
  • Database: MySQL database

Complications/issues

  • Cache dir should correspond to VEP tool version.
  • Using the database or cache options require external network access as users need to either download the data (cache) or access the Ensembl server
  • Downloading cache is time consuming and requires lots of disk (HG38 release 108 21GB)
  • Could run process to download but cost $$$ and how to be flexible for different organisms

Info

Test run modes

  • test cache
  • test database
  • test gtf/gff

Flags to test in the VEP run command with cache/database

  • --overlaps for proportion and length of transcript overlapped by SV
  • --check_svs existing SVs overlapping input, required db access (incompatible with --offline)
  • --af_gnomadg see here

pullContainers.sh not utilising specified cachedir

Job fails with:

Job <jobID> killed due to exceeding jobfs quota

Script needs the following line added:

export SINGULARITY_TMPDIR=<existing-dir>

OR, instructions in README for user to set SINGULARITY_TMPDIR in ~/.bashrc

Optional AnnotSV integration (human/mouse only)

AnnotSV is a great tool for human structural variation annotation, ranking and analysis. It integrates public datasets, SV ranking and phenotypic data to aid variant prioritisation unlike other SV annotation tools.

Including it in the GermlineStructuralV workflow requires:

Input, output and run requirements:

  • default run: vcf in, tsv out, annotation mode: full, genomeBuild: hg38, mm10
  • vcf2circos?

Publications and resources:

Add CNV module

Following feedback on validation of known SV, will need to add CNV module to workflow. Exploring:

  • CNVkit
  • CNVnator

Inconsistent parameter definition and application for gadi account

Project account has been defined and applied inconsistently between nextflow.config and configs/gadi.config:

In nextflow.config it is defined as:

gadi_account = false

In gadi.config, it is applied as:

storage = "scratch/${params.pbs_account}+gdata/${params.pbs_account}"

storage variable needs to call the correct parameter as defined in nextflow.config. Currently failing to apply correct storage access permissions.

Portability testing @ Pawsey Setonix

SB ran pipeline at Setonix, successfully on Platinum Genomes samples. Outcomes:

  • Provided trace, reports
  • Setonix run script
  • Setonix.config

TODO

  • GS to compare performance at NCI Gadi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.