Giter Site home page Giter Site logo

google / deepsomatic Goto Github PK

View Code? Open in Web Editor NEW
92.0 8.0 11.0 21 KB

DeepSomatic is an analysis pipeline that uses a deep neural network to call somatic variants from tumor-normal sequencing data.

License: BSD 3-Clause "New" or "Revised" License

bioinformatics deep-learning dna genome genomics machine science sequencing somatic-mutations somatic-variants

deepsomatic's Introduction

DeepSomatic

release announcements blog

DeepSomatic is an extension of deep learning-based variant caller DeepVariant that takes aligned reads (in BAM or CRAM format) from tumor and normal data, produces pileup image tensors from them, classifies each tensor using a convolutional neural network, and finally reports somatic variants in a standard VCF or gVCF file.

DeepSomatic supports somatic variant-calling from tumor-normal sequencing data.

The following case studies show example runs for supported technologies:

  • Illumina tumor-normal whole genome sequencing case study.

  • PacBio tumor-normal whole genome sequencing case study.

This is the first release of DeepSomatic. Properties such as runtime, accuracy across different sample preparations, and supported technologies will evolve with future releases. Your active feedback will help us prioritize use cases most important for the genomics community.

For details around runtime and accuracy expectations, please see the DeepSomatic metrics page.

NOTE: At this time, DeepSomatic has not been trained for or optimized for FFPE-prepared samples. You will likely not be able to successfully run FFPE-prepared data.

How to run DeepSomatic

sudo docker run \
-v ${INPUT_DIR}:${INPUT_DIR} \
-v ${OUTPUT_DIR}:${OUTPUT_DIR} \
google/deepsomatic:"${BIN_VERSION}" \
run_deepsomatic \
--model_type=WGS \ ** Can be either WGS or PACBIO **
--ref=${INPUT_DIR}/REF.fasta \ **Path to reference fasta file.
--reads_normal=${INPUT_DIR}/normal.bam \ **Path to normal bam file.
--reads_tumor=${INPUT_DIR}/tumor.bam \ * Path to tumor bam file.
--output_vcf=${OUTPUT_DIR}/OUTPUT.vcf.gz \ **Path to output VCF file.
--output_gvcf=${OUTPUT_DIR}/OUTPUT.g.vcf.gz \ **Path to output gVCF file.
--sample_name_tumor="tumor" \
--sample_name_normal="normal" \
--num_shards=$(nproc) \ **Total number of threads to use.
--logging_dir=${OUTPUT_DIR}/logs \ **Log output directory.
--intermediate_results_dir ${OUTPUT_DIR}/intermediate_results_dir \
--regions=chr1 \ **Region of the genome, if not provided then runs on whole genome
--dry_run=false **Default is false. If set to true, commands will be printed out but not executed.

Please follow the Quick Start for more details on different setups like Docker and Singuarity. available for DeepSomatic

Example output

DeepSomatic utilizes FILTER in VCF format to report identified germline and somatic variants. The description of the filters can be found in the header:

##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call.">
##FILTER=<ID=GERMLINE,Description="Non somatic variants">

For example, the variants reported below:

# CHROM POS     ID  REF ALT QUAL    FILTER      INFO    FORMAT              SAMPLE_NAME
chr1    14001   .   A   G   3.7     GERMLINE    .       GT:GQ:DP:AD:VAF:PL  0/0:4:8:4,4:0.5:1,0,34
chr1    14002   .   T   A   0       RefCall     .       GT:GQ:DP:AD:VAF:PL  0/0:51:60:57,2:0.0333333:0,51,58
chr1    14003   .   C   G   43.8    PASS        .       GT:GQ:DP:AD:VAF:PL  1/1:43:74:0,74:1:43,52,0

In this example:

  • The variant with GERMLINE FILTER status is identified as a germline variant
  • The variant with RefCall FILTER status is homozygous to the reference
  • The variant with PASS FILTER status is a somatic variant.

Prerequisites

  • Unix-like operating system (cannot run on Windows)
  • Python 3.8

Contribution Guidelines

Please open a pull request if you wish to contribute to DeepSomatic. Note, we have not set up the infrastructure to merge pull requests externally. If you agree, we will test and submit the changes internally and mention your contributions in our release notes. We apologize for any inconvenience.

If you have any difficulty using DeepSomatic, feel free to open an issue. If you have general questions not specific to DeepSomatic, we recommend that you post on a community discussion forum such as BioStars.

License

BSD-3-Clause license

Disclaimer

This is not an official Google product.

NOTE: the content of this research code repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.

deepsomatic's People

Contributors

kishwarshafin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepsomatic's Issues

Depth of coverage for WGS

Hi, Does the performance depend on the depth of coverage of WGS data? Is the model trained on a specific depth of coverage and does deviating from that in test samples affect/bias performance?

Feature request: Output normal allelic depth

@kishwarshafin, likely not a priority at this point, but outputting normal depth (as a separate sample column) could be useful for:

  1. Tools that ingest somatic VCF, e.g. Purple in hmftools suite for purity/ploidy estimation
  2. Manual inspection and QC visualization of variants

Thank you!

Performance regression with a different HCC1395 Illumina library

Hi,

First of all, thank you for open sourcing DeepSomatic. Hope that it is going to be a useful tool for the community moving forward.

I am unable to reproduce the variant calling performance results shown in the Illumina case study page in the docs, with a different library of the same HCC1395 T/N sample pair from the SEQC2 consortium. WGS_NS_T_1 (https://www.ncbi.nlm.nih.gov/sra/SRX4728475) and WGS_NS_N_1 (https://www.ncbi.nlm.nih.gov/sra/SRX4728425) were used for this purpose as T/N sample pairs.

  • You can find the whole genome T/N (85x/70x) bams for WGS_NS_T_1 and WGS_NS_N_1 here - gs://lancet2-test-datasets/SEQC2/single_library
  • We downsampled coverage to the same T/N coverage as in the case study to check if makes a difference. (It didn't)
  • Two runs of the case study data were performed with two evaluation tools (RTG vcfeval & hap.py) to show that the tools don't change the peformance results.

Results from all the runs that were tested are summarized in this google sheet. Red cells highlights the significant loss in precision when using the different library compared to the dataset provided in the case study example shown in green.
https://docs.google.com/spreadsheets/d/1ReOMR85lPvC_Y6xZCPiY1SOFRDmQ5NXTSMUL0rOaaNA/edit?usp=sharing

  1. Do you have any thoughts or suggestions on why there is such a big difference in precision when using a different library of the same sample? Is this expected? If there is anything I am missing or doing incorrectly, please do let me know.
  2. It might be useful in general to document which specific library (among the SEQC2 consortium datasets - https://sites.google.com/view/seqc2/home/sequencing) is being used in the case study.

why no heterozygous site

Dear,

I have already test the deepsomatic to run both illumina and pacbio platform

but I have find a ambiguous result that all Pass variant is no heterozygous site ,the Pass variant is all homozygous

I wander why

PacBio case study

Hi :D

I am super excited by this release. It seems like an actually cool tool to try!

Regarding the PacBio data example, in which sequencer it was produced? I am saying this because PacBio is well known for long-read sequencing but also they have now released the Onso short-read sequencing, so it would be nice that that example is full of details regarding the sequencing library and the sequencer.

Thank you,

Pedro

It does clonal vs suclonal variants?

Hi,
It's a great tool using the neural network approach. Could you clarify, does this tool classifies somatic clonal and subclonal variants? Or it classify germline variants vs. somatic variants?

Thanks again!!

DeepSomatic freezes after call_variants when there is 0 example

@kishwarshafin Khi PIn here. Filing a report of an issue I'm seeing. I'm trying to implement DeepSomatic into a pipeline. What the pipeline does now is that it splits the genome into equal interval and uses the --region parameter to call variants in just that region. However, when I was testing this with a toy example which doesn't have variants in most intervals, the DeepSomatic step stops progressing after call_variants.

Attaching a zip file of the log files collected here with the command used (command file). Thanks.

logs.zip

Typo in readme?

Hello!

Thanks for developing this tool, eager to try it out.

I saw what possibly it's a typo in the readme? both reads normal and tumor refer to the same filepath named normal.bam. I know it's not supposed to be a real file but it might be confusing to people if both the normal bam and the tumor bam arguments link to the same file in the "quickstart" code.

--reads_normal=${INPUT_DIR}/normal.bam \ **Path to normal bam file.
--reads_tumor=${INPUT_DIR}/normal.bam \ * Path to tumor bam file.

Best,
DMP

Nanopore model availability

I wonder if there are plans for training DeepSomatic for Nanopore samples anytime soon? If not, would you expect the performance to deteriorate greatly if I were to select the PacBio model for our Nanopore RNA-seq samples?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.