Giter Site home page Giter Site logo

hla_pipeline's Introduction

DNA and RNA variant calling pipelines with HLA typing and Neoantigen predictions

This toolset can process DNA (tumor and normal) and RNA (tumor) sequencing data and generate a list of somatic variants, HLAs and neoantigens with affinity scores.

The DNA and RNA pipelines make use of the latest GATK4 best-practices. All the tools and pipelines are fully parametrised and optimized for speed.

There are 2 pipelines and 2 tools:

dna_pipeline.py processes DNA data and generates a list of unified filtered and annotated somatic variants. The variant callers are Mutect2, Strelka2, Varscan and SomaticSniper and both indels and SNPs are reported. Annotation is performed using VEP. The pipeline uses trim-galore to trim, bwa-men to align and follows GATK4 best practices. The pipeline also performs HLA predictions with OptiType (tumor and normal). QC is performed with FastQC and BamQC.

rna_pipeline.py processes RNA data and generates a list of unified annotated somatic variants (weak filtered) and also a list of gene counts values. The variant callers used are Varscan and HaplotypeCaller. Annotation is performed with VEP. The pipeline uses trim-galore to trim, STAR to align and follows GATK4 best practices. The pipeline also performs HLA predictions with OptiType. The gene counts values are computed with featureCounts. QC is performed with FastQC and BamQC.

merge_resuls.py combines results from 1 or several runs of the DNA and RNA pipelines in order to generate an unified table with useful information where variants are filtered by certain criteria (defined by the user) and epitopes are created for each of the variants somatic effects. The user can define the values of the filters for both dna and rna variants.

mhc_predict.py can take the file generated with merge_results.py and the HLA files generated in the DNA and/or RNA pipelines and then generate a list of predicted neo-antigens with affinity binding scores. Variants are filtered by certain criteria and only the most common alleles for each HLA class 1 are used.

Each tool/pipeline uses a command line interface with parameters which can be shown and described with --help.

Requirements

We strongly recommend to use Anaconda or Miniconda, otherwise you may need to create aliases for some tools as expected in the file hlapipeline/tools.py.

See environment.yml for a list of required packages.

Install

See INSTALL.txt for installation instructions.

See REFERENCES.txt for instructions to download the references needed to run the pipelines.

How to run

See RUN.txt for a running example.

It is recommended to use a Linux machine with at least 20 threads, 64GB of RAM and 500GB of disk space.

Output (important files)

dna_pipeline.py

  • annotated.hgXX_multianno.vcf (annotated and combined somatic variants)
  • HLA predictions DNA (Tumor_hla_genotype.tsv and Normal_hla_genotype.tsv)

Other files:

  • combined_calls.vcf
  • tumor_final.bam
  • normal_final.bam
  • fastqc files
  • cutadapt stats
  • vcf stats
  • bamQC_Normal folder
  • bamQC_Tumor folder

rna_pipeline.py

  • annotated.hgXX_multianno.vcf (annotated and combined germline variants)
  • gene.counts (gene counts from featureCounts)
  • HLA predictions (hla_genotype.tsv)

Other files:

  • combined_calls.vcf
  • sample_final.bam
  • fastqc files
  • cutadapt stats
  • STAR log
  • vcf stats
  • bamQC folder
  • bamQCRNA folder

merge_results.py

  • overlap_final.txt (all the DNA and RNA variants collapsed and filtered with useful information and epitopes)
  • overlap_final_unique_rna.txt (all the RNA variants collapsed and filtered with useful information and epitopes)
  • overlap_final_discarded.txt (all the discarded DNA variants collapsed with useful information and epitopes)
  • overlap_final_discarded_rna.txt (all the discarded RNA variants collapsed with useful information and epitopes)
  • gene.counts.final (same file as gene.counts with three additional columns for TPM, RPKM and Percentile)

mhc_predict.py

  • predictions_mut.csv (all the mutated peptides predictions)
  • predictions_wt.csv (all the WT peptides predictions)

Other files:

  • protein_sequences_mu.fasta
  • protein_sequences_wt.fasta

Authors

Jose Fernandez Navarro [email protected]

Contributors

Jonatan Gonzalez [email protected]

Contact

Contact: Jose Fernandez Navarro [email protected]

hla_pipeline's People

Contributors

akazhiel avatar jfnavarro avatar philthefeel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hla_pipeline's Issues

Replace the in-house epitope generation function

Replace the epitopes generation function by Varcode or improve the in-house approach to obtain
the DNA and Protein sequences from PyEnsembl instead of input files. Use only Ensembl as database
in both approaches.

Replace GATK3 mergeVariants

Use jacquard merge instead of GATK3 to merge variants (with this the awk hacks to replace IUPA REF can be removed)
Other similar tool can be used instead of jacquard merge if the behaviour is the same.

Variant calling for HLAs

Add a module and/or tool to make variant calling in HLA regions by generating
a reference in the HLA typing step and performing variant calling using this.

Replace Annovar for VEP

Use VEP for annotation instead of Annovar (this may make possible to remove hacks to make the VCFs compatible with Annovar)

Add unittests

Create the testing environment and add tests for individual functions and dry runs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.