clinical-genomics / mip Goto Github PK

View Code? Open in Web Editor NEW

42.0 13.0 10.0 73.39 MB

Mutation Identification Pipeline. Read the latest documentation:

Home Page: https://clinical-genomics.gitbook.io/project-mip/

License: MIT License

Perl 99.47% Shell 0.08% Dockerfile 0.43% Lua 0.02%

clinical variants analysis pipeline

mip's Introduction

MIP - Mutation Identification Pipeline

MIP enables identification of potential disease causing variants from sequencing data.

Citing MIP

Integration of whole genome sequencing into a healthcare setting: high diagnostic rates across multiple clinical entities in 3219 rare disease patients
Stranneheim H, Lagerstedt-Robinson K, Magnusson M, Kvarnung M, Nilsson D, Lesko N, Engvall M, Anderlid BM, Arnell H, Johansson CB, Barbaro M, Björck E, Bruhn H, Eisfeldt J, Freyer C, Grigelioniene G, Gustavsson P, Hammarsjö A, Hellström-Pigg M, Iwarsson E, Jemt A, Laaksonen M, Enoksson SL, Malmgren H, Naess K, Nordenskjöld M, Oscarson M, Pettersson M, Rasi C, Rosenbaum A, Sahlin E, Sardh E, Stödberg T, Tesi B, Tham E, Thonberg H, Töhönen V, von Döbeln U, Vassiliou D, Vonlanthen S, Wikström AC, Wincent J, Winqvist O, Wredenberg A, Ygberg S, Zetterström RH, Marits P, Soller MJ, Nordgren A, Wirta V, Lindstrand A, Wedell A.
Genome Med. 2021 Mar 17;13(1):40. doi: 10.1186/s13073-021-00855-5.
PMID: 33726816; PMCID: PMC7968334.

Rapid pulsed whole genome sequencing for comprehensive acute diagnostics of inborn errors of metabolism
Stranneheim H, Engvall M, Naess K, Lesko N, Larsson P, Dahlberg M, Andeer R, Wredenberg A, Freyer C, Barbaro M, Bruhn H, Emahazion T, Magnusson M, Wibom R, Zetterström RH, Wirta V, von Döbeln U, Wedell A.
BMC Genomics. 2014 Dec 11;15(1):1090. doi: 10.1186/1471-2164-15-1090.
PMID:25495354

Overview

MIP is being rewritten in NextFlow as a part of the nf-core project. This repo will mainly receive bugfixes as we are focusing our resources on the new pipeline. You can follow the progress here 👉 raredisease.

MIP performs whole genome or target region analysis of sequenced single-end and/or paired-end reads from the Illumina platform in fastq(.gz) format to generate annotated ranked potential disease causing variants.

MIP performs QC, alignment, coverage analysis, variant discovery and annotation, sample checks as well as ranking the found variants according to disease potential with a minimum of manual intervention. MIP is compatible with Scout for visualization of identified variants.

MIP rare disease DNA analyses single nucleotide variants (SNVs), insertions and deletions (INDELs) and structural variants (SVs).

MIP rare disease RNA analyses mono allelic expression, fusion transcripts, transcript expression and alternative splicing.

MIP rare disease DNA vcf rerun performs re-runs starting from BCFs or VCFs.

MIP has been in use in the clinical production at the Clinical Genomics facility at Science for Life Laboratory since 2014.

Example Usage

MIP analyse rare disease DNA

$ mip analyse rd_dna [case_id] --config_file [mip_config_dna.yaml] --pedigree_file [case_id_pedigree.yaml]

MIP analyse rare disease DNA VCF rerun

mip analyse rd_dna_vcf_rerun [case_id] --config_file [mip_config_dna_vcf_rerun.yaml] --vcf_rerun_file vcf.bcf  --sv_vcf_rerun_file sv_vcf.bcf --pedigree [case_id_pedigree_vcf_rerun.yaml]

MIP analyse rare disease RNA

$ mip analyse rd_rna [case_id] --config_file [mip_config_rna.yaml] --pedigree_file [case_id_pedigree_rna.yaml]

Features

Installation
- Simple automated install of all programs using conda/docker/singularity via supplied install application
- Downloads and prepares references in the installation process
Autonomous
- Checks that all dependencies are fulfilled before launching
- Builds and prepares references and/or files missing before launching
- Decompose and normalise reference(s) and variant VCF(s)
Automatic
- A minimal amount of hands-on time
- Tracks and executes all recipes without manual intervention
- Creates internal queues at nodes to optimize processing
Flexible:
- Design your own workflow by turning on/off relevant recipes in predefined pipelines
- Restart an analysis from anywhere in your workflow
- Process one, or multiple samples
- Supply parameters on the command line, in a pedigree.yaml file or via config files
- Simulate your analysis before performing it
- Limit a run to a specific set of genomic intervals or chromosomes
- Use multiple variant callers for both SNV, INDELs and SV
- Use multiple annotation programs
- Optionally split data into clinical variants and research variants
Fast
- Analyses an exome trio in approximately 4 h
- Analyses a genome in approximately 21 h
Traceability
- Track the status of each recipe through dynamically updated status logs
- Recreate your analysis from the MIP log or generated config files
- Log sample meta-data and sequence meta-data
- Log version numbers of softwares and databases
- Checks sample integrity (sex, contamination, duplications, ancestry, inbreeding and relationship)
- Test data output file creation and integrity using automated tests
Annotation
- Gene annotation
  - Summarize over all transcript and output on gene level
- Transcript level annotation
  - Separate pathogenic transcripts for correct downstream annotation
- Annotate all alleles for a position
  - Split multi-allelic records into single records to facilitate annotation
  - Left align and trim variants to normalise them prior to annotation
- Extracts QC-metrics and stores them in YAML format
- Annotate coverage across genetic regions via Sambamba and Chanjo
Standardized
- Use standard formats whenever possible
Visualization
- Ranks variants according to pathogenic potential
- Output is directly compatible with Scout

Getting Started

Installation

MIP is written in perl and therefore requires that perl is installed on your OS.

Prerequisites

Perl, version 5.26.0 or above
Cpanm
Miniconda version 4.5.11
[Singularity] version 3.2.1

We recommend miniconda for installing perl and cpanm libraries. However, perlbrew can also be used for installing and managing perl and cpanm libraries together with MIP. Installation instructions and setting up specific cpanm libraries using perlbrew can be found here.

Automated Installation (Linux x86_64)

Below are instructions for installing the Mutation Identification Pipeline (MIP).

1. Clone the official git repository

$ git clone https://github.com/Clinical-Genomics/MIP.git
$ cd MIP

2. Install required perl modules from cpan to a specified conda environment

$ bash mip_install_perl.sh -e [mip] -p [$HOME/miniconda3]

3. Test conda and mip installation files (optional, but recommended)

$ perl t/mip_install.test

A conda environment will be created where MIP with all dependencies will be installed.

4. Install MIP

$ perl mip install --environment_name [mip] --reference_dir [$HOME/mip_references]

This will cache the containers that are used by MIP.

Note:

For a full list of available options and parameters, run: $ perl mip install --help

6. Test your MIP installation (optional, but recommended)

Make sure to activate your MIP conda environment before executing prove.

$ prove t -r
$ perl t/mip_analyse_rd_dna.test

When setting up your analysis config file

A starting point for the config is provided in MIP's template directory. You will have to modify the load_env keys to whatever you named the environment. If you are using the default environment name the load_env part of the config should look like this:

load_env:
  mip:
    mip:
    method: conda

Usage

MIP is called from the command line and takes input from the command line (precedence) or falls back on defaults where applicable.

Lists are supplied as repeated flag entries on the command line or in the config using the yaml format for arrays. Only flags that will actually be used needs to be specified and MIP will check that all required parameters are set before submitting to SLURM.

Recipe parameters can be set to "0" (=off), "1" (=on) and "2" (=dry run mode). Any recipe can be set to dry run mode and MIP will create the sbatch scripts, but not submit them to SLURM. MIP can be restarted from any recipe using the --start_with_recipe flag and after any recipe using the --start_after_recipe flag.

MIP will overwrite data files when reanalyzing, but keeps all "versioned" sbatch scripts for traceability.

You can always supply mip [process] [pipeline] --help to list all available parameters and defaults.

Example usage:

$ mip analyse rd_dna case_3 --sample_ids 3-1-1A --sample_ids 3-2-1U --sample_ids 3-2-2U --start_with_recipe samtools_merge --config 3_config.yaml

This will analyse case 3 using 3 individuals from that case and begin the analysis with recipes after Bwa mem and use all parameter values as specified in the config file except those supplied on the command line, which has precedence.

Running programs in containers

Aside from a conda environment, MIP uses containers to run programs. You can use either singularity or docker as your container manager. Containers that are downloaded using MIP's automated installer will need no extra setup. By default MIP will make the reference-, outdata- and temp directory available to the container. Extra directories can be made available to each recipe by adding the key recipe_bind_path in the config.

In the example below the config has been modified to include the infile directories for the bwa_mem recipe:

recipe_bind_path:
  bwa_mem:
    - <path_to_directory_with_fastq_files>

Input

Fastq file directories can be supplied with --infile_dirs [PATH_TO_FASTQ_DIR=SAMPLE_ID]
All references and template files should be placed directly in the reference directory specified by --reference_dir.

Meta-Data

Configuration file (YAML-format)
Gene panel file
Pedigree file (YAML-format)
Rank model file (Ini-format; SNV/INDEL)
SV rank model file (Ini-format; SV)
Qc regexp file (YAML-format)

Output

Analyses done per individual is found in each sample_id directory and analyses done including all samples can be found in the case directory.

Sbatch Scripts

MIP will create sbatch scripts (.sh) and submit them in proper order with attached dependencies to SLURM. These sbatch script are placed in the output script directory specified by --outscript_dir. The sbatch scripts are versioned and will not be overwritten if you begin a new analysis. Versioned "xargs" scripts will also be created where possible to maximize the use of the cores processing power.

Data

MIP will place any generated data files in the output data directory specified by --outdata_dir. All data files are regenerated for each analysis. STDOUT and STDERR for each recipe is written in the recipe/info directory.

mip's People

Contributors

Stargazers

Watchers

Forkers

robinandeer limbus-medtec jemten hassanfa j35p312 xchromosome219 hmyh1202 eriksjolund bin-guan genostack

mip's Issues

Update VariantRecalibration annotations

Changed 'maxGaussian' [4] to always be enabled for indels
Added '-an SOR'

Add population frequencies and adjust rank model to accomodate for Exac, 1000G

Use vt to split and normalize vt - replacing vcfparser

http://gemini.readthedocs.org/en/latest/content/preprocessing.html

Update GATK version

Update perl 5.22

BEA_MEM BAM to CRAM

Use of display_name where relevant with MIP

Update C-score database

Update SNPEff

Remove VariantAnnotationBlock info file from AnalysisRunStatus when -rio 0

Update 1000G

Update to: v.5a

Update VEP

Replace PicardToolsMarkDuplicates for whole genomes

Often failing due to time limits

Family analysis error

I'm running:

run_mip_family_analysis.py /mnt/hds/proj/cust004/exomes/218fam/218fam_pedigree.txt -tres -100 /mnt/hds/proj/cust004/analysis/exomes/218fam/mosaik/GATK/candidates/ranking/clinical/218fam.selectVariants /mnt/hds/proj/bioinfo/mip/mip_references/hg19_refGene.txt -o /mnt/hds/proj/cust004/analysis/exomes/218fam/mosaik/GATK/candidates/ranking/clinical/218fam_ranked_BOTH.txt

and getting an error:

usage: run_mip_family_analysis.py [-h] [-o OUTFILE] [--version] [-v] [-cmms]
                                  [-s] [-pos] [-tres TRESHOLD]
                                  family_file variant_file
run_mip_family_analysis.py: error: unrecognized arguments: /mnt/hds/proj/bioinfo/mip/mip_references/hg19_refGene.txt

After talking to Måns, this file seems to no longer be needed and shouldn't be included when generating the call to run_mip_family_analysis.py

Make virtualenv optional?

Our use case is to have a single default production environment which is always sourced. This environment also contains the other MIP dependencies which makes it very convenient to deploy.

But if we have to source another env we loose all these binaries - how can we use the already existing env to run MIP?

Remake RemoveRedundantFiles

Update Genmod >2.0.6

Add CLNACC for deep linking in Scout

Add spidex database annotation

Update dbNSFP

Investigate hg19, hg20 changes and cmd line

Update 1000G

Absolute path in generated MIP call

It would be a nice feature to generate the complete perl mip.pl call using absolute paths only. For example the config file. Now if you want to restart a failed analysis you need to worry about where you had cd'd to before running the original MIP command.

-configFile ../bioinfo_rasta_config.yaml

GATKVariantRecalibration vcf to bcf

Migration guide/change log

Hi,

was looking for either a change log describing what's different between v1.x and v2.0 or a migration guide for previous user of the v1.x pipeline. Does such a thing exist or is it planned?

Integrate Cosmid database manager

https://github.com/robinandeer/cosmid

Update genmod

AnalysisDate not set in QC sample info

Hi,

running latest version of MIP (master). After running the pipeline and checking that "AnalysisRunStatus: Finished" I still don't have an "AnalysisDate" in my *_qc_sampleInfo.yaml file. When is the "AnalysisDate" supposed to be set? During "AnalysisRunStatus"?

I believe that I've rerun some step (perhaps MIP in dryRun mode) and this has added the date in there...

Incompatible command option

Hi, I'm trying out the new -pythonVirtualEnvironmentCommand-option. I would like to set the value to a string with two words "source activate". However it seems as if MIP doesn't parse this from the command line as expected.

-pythonVirtualEnvironment mip2.0 -pythonVirtualEnvironmentCommand source activate -pQCCollect 1

In the generated SBATCH files it prints "source" and misses the second term "activate". Is there a workaround for this that I'm not aware of?

Simply adding quotation marks around the option in the YAML file will likely not help I guess.

Suggestions

See in which intron a variant is found. Should you have to go to IGV?

Update HaploTypeCaller annotations

Add '-an SOR' and remove old UnifiedGenoType annotations

Update clinvar

Update to: version 150504

YAML key unexpectadly integer

For some reason when MIP is run with a family ID that happens to be an integer, the keys in the YAML output are marked as integers which is a little confusing when you are trying to read it.

Is it possible to enforce the keys to always be strings?

{14049: {'SIB914A22': {'ExomeTargetBedInfileLists': ... }}}

Update clinvar

Added pcrIndelModel None as default when using 'at genomes'

Add Genetic_disease_model from clinical_list

Adding a sample to pedigree does not update qc_sampleInfo for pedigreeElements for reanalysis runs

Lock MIP to specific software versions

To better position MIP as a clinical grade pipeline it would be nice to be able to infer version numbers of each component from the overall version of MIP.

This means that a bug fix update to Chanjo would be explicitly implemented in MIP by also bumping the MIP version accordingly.

Enable exome & genomes joint analysis

Add flags to disable '-an DP' and 'pcrIndelModel' when performing joint exome and genome analysis

nocmdinput

"nocmdinput" is added to all options. This could therefore be handled downstream by e.g. DefineParameters.

Suggestion: deprecate this argument.

Independent log files for all processes

Now all invocations of programs through xargs or sent to one log file, which makes it really hard to determine which program specifically has produced an error. It creates really messy log files and makes it really time consuming to debug.

e.g. if a bash script like BAMCalibrationAndGTBlock__.0.sh is created, it will send all output to
..../MIP_ANALYSIS/cust00X/170/analysis/exomes/sampleid/bwa/info/BAMCalibrationAndGTBlock__.std[err|out].txt

Which is horrible to read. Errors are not matched up to output and because xargs doesn't halt execution on errr, you mostly get regular output after an error occurred.

Any solution to this would be awesome :)

QC collect error

Hi, can you help me out with an error I haven't seen before?

From STDERR:

Argument "\x{6f}\x{74}..." isn't numeric in numeric eq (==) at /mnt/hds/proj/bioinfo/MIP_ANALYSIS/modules/MIP/qcCollect.pl line 512.

The line referred to is:

if ( ($$chanjoSexCheckGenderRef eq "female") && ($sampleInfoFile{$$familyIDRef}{$$sampleIDRef}{'Sex'} == 2) ) { #Female

And as for the chanjo output:

#X_coverage     Y_coverage      sex
7.01124753059   0.0720640965212 female

Add contig tag to clinical master list, repopulate selected contigs based on tags

Only analyse what actually exists
This will avoid processing of empty contig files
Removing of clear traps
Enable tiny clinical lists to be processed without breaking the analysis

Open sourcing dbCMMS?

I don't know if it matters to you but haven't you open sourced dbCMMS by including it in this repository?

It seems like this is the kind of thing that you (KI/CMM) would like to claim some ownership over, no? I don't think that the current license even requires attribution to use the dbCMMS list of genes for any purpose.

In any case, I find it kind of odd that a pipeline supposed to be used by multiple users includes references to a specific lab.

Perhaps it would be a better idea to provide the list under a stricter license in an Amazon S3 bucket or simply a different GitHub repo.

Non-standard perl installation

Hi,

I think it would really help to use more general hashbang, especially to be able to use the MIP script as an executable. In Python this is the standard way of writing it, in fact.

Consider something like this:

#!/usr/bin/env perl

Then the script will be invoked using whatever perl executable comes first in the $PATH variable.

Add CalculateGenoTypePosterior for trios

Add automatic detection of trios
Enabled automatic processing using CalculateGenoTypePosterior for trios post VariantRecalibration
Added new reference '1000G_phase3_v4_20130502.sites.vcf' as supporting data set using flag 'GATKCalculateGenotypePosteriorsSupportSet'

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.