Giter Site home page Giter Site logo

shishenyxx / deepmosaic Goto Github PK

View Code? Open in Web Editor NEW
41.0 8.0 5.0 143.76 MB

DeepMosaic is a deep-learning-based mosaic single nucleotide classification tool without the need of matched control information.

Home Page: https://www.nature.com/articles/s41587-022-01559-w

License: Other

Python 97.81% Shell 2.19%
bioinformatics deep-learning human-genetics mosaic-mutation

deepmosaic's Introduction

DeepMosaic DeepMosaic_Logo

Visualization and control-independent classification tool of noncancer (somatic or germline) mosaic single nucleotide variants (SNVs) with deep convolutional neural networks. Originally written by Virginia (Xin) Xu and Xiaoxu Yang, maintained by Arzoo Patel.


Contents

Overview

Requirements before you start

Installation

Usage

-Step 1. Feature extraction and visualization of the candidate mosaic variants(DeepMosaic Visualization Module)

-Step 2. Prediction for mosaicism (DeepMosaic Classification Module)

Demo

Model Training

Singularity

Performance

Q&A

Cite DeepMosaic

Licence

Maintainance Team

Contact


Overview

  • DeepMosaic Visualization Module: Information of aligned sequences for any SNV represented with an RGB image:

Image_representation

An RGB image was used to represent the pileup results for all the reads aligned to a single genomic position. Reads supporting different alleles were grouped, in the order of the reference allele, the first, second, and third alternative alleles, respectively. Red channel was used to represent the bases, green channel for the base qualities, and blue channel for the strand orientations of the read. Note that the green channel is modified to show better contrast for human eyes.

  • DeepMosaic Classification Module: Workflow from variant to result (10 models were compared and Efficientnet b4 was selected as default because it performed the best on a gold standard benchmark dataset.):

Workflow of DeepMosaic on best-performed deep convolutional neural network model after benchmarking. Variants were first transformed into images based on the alignment information. A deep convolution neural network then extracted the high-dimensional information from the image, and experimental, genomic, and population-related information was further incorporated into the classifier.

Return to Contents


Requirements before you start

Some of the versions of packages are provided as an example in this list.

Alternatively, you can use the singularity container. See Singularity.

Return to Contents


Installation

We are now providing a singularity image to run DeepMosaic. If you want to install and run DeepMosaic manually, please read through and follow these steps. The following steps could be performed in a command line shell environment (Linux, Mac, Windows subsystem Linux etc., whichever has the computational resource and >20G storage to run DeepMosaic)

Step 1. Install DeepMosaic

Make sure you have git-lfs installed in your environment (download git-lfs, unzip the tar.gz and put the binary file git-lfs in your bin folder/your $PATH, and run git lfs install to initialize git-lfs, you only need to do it once) to be able to download this repository correctly.

> git clone --recursive https://github.com/shishenyxx/DeepMosaic

Make sure you cloned the whole repository, total folder size should be ~ 4G.

> cd DeepMosaic   

Step 2. Install dependency: BEDTools (via conda)

> conda install -c bioconda bedtools    

Step 3. Install dependency: ANNOVAR

a) Go to the ANNOVAR website and click "here" to register and download the annovar distribution.

b) Once you have sucessfully download ANNOVAR package, run

> cd [path to ANNOVAR]

> perl ./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad_genome humandb/    

to intall the hg19.gnomad_genome file needed for the feature extraction from the bam file

Return to Contents


Usage

Step 1. Feature extraction and visualization of the candidate mosaic variants (Visualization Module)

This step is used for the extraction of genomic features of the variant from raw bams as well as population information. It can serve as an independent tool for the visualization and evaluation of mosaic candidates.

Usage

> [DeepMosaic Path]/deepmosaic/deepmosaic-draw -i <input.txt> -o <output_dir> -a <path_to_ANNOVAR> -b <genome_build> -db <name_of_annovar_db>

Note:

  1. input.txt file should be in the following format.

Input format

#sample_name bam vcf depth sex
sample_1 sample_1.bam sample_1.vcf 200 M
sample_2 sample_2.bam sample_2.vcf 200 F

Each line of input.txt is a sample with its aligned reads in the bam format (with index in the same directory), and its candidate variants in the vcf (or vcf.gz) format. User should also provide the sequencing depth and the sex (M/F) of the corresponding sample. Sample name (#sample_name column) should be a unique identifier for each sample; duplicated names are not allowed.

Note the sequencing depth is required for increasing specificity and if the user is not clear about the average depth, we recommend piloting a fast depth analysis with SAMtools mpileup for several hundreds of variants, or a complete depth of coverage analysis. The depth value should be integers.

  1. DeepMosaic supports no-loss image representation for sequencing depth up to 500x. Reads with deeper sequencing depth will be randomly down-sampled to 500x during image representation.

  2. sample.bam is a bam file that is generated through alignment, sort, markduplicate, indel realign, and base quality score recalibration. You can follow the BSMN common pipeline for both GRCh37 and GRCh38, or this pipeline for GRCh37 alignment specifically. Note that this used to be the best pipeline for GATK3 and earlier version. GATK4 onwards, however, integrated indel realign into haplotypecaller and MuTect2. So if you want to use any external tools you have to prepare the bam with earlier GATK and the tutorials should be here.

  3. sample.vcf is the vcf file of input variants you are interested in, or prior file generated by GATK haplotypecaller with polidy 50 as described in previosu pipelines, or MuTect2 single mode, each vcf should be provided for each input bam and the format should be in the following format, gziped vcf is also recognizable:

sample.vcf format

#CHROM POS ID REF ALT ...
1 17697 . G C .
1 19890 . T C .

"#CHROM", "POS", "REF", "ALT" are essential columns that will be parsed and utilized by DeepMosaic.

While using MuTect2 we recommend "PASS" vcfs as input for DeepMosaic. Running MuTect2 single mode, generate the panel of normals and downstream filtering could either be found following the official GATK tutorials, or following this example snakemake pipeline.

  1. The outputs files including the extracted features and encoded imaged will be output to [output_dir]. DeepMosaic will create a new directory if [output_dir] hasn't been initialized by users.

  2. path_to_ANNOVAR is the absolute path to the ANNOVAR program directory.

  3. genome_build is the build version of the reference genome, currently hg19 and hg38 are supported. defaults to hg19.

  4. name_of_annovar_db is the name of the db you want to use from the annovar subdirectory [annovar/humandb]. For example, if you want to use annovar/humandb/hg38_gnomad312_genome.txt, you would use -db gnomad312_genome. This option is fed directly into the annovar command as --dbtype. defaults to gnomad_genome.

  5. To generate h5 files for other genome builds (not recommended) please follow this link, note that this package runs in Python 2.7.

Output:

After deepmosaic-draw is successfully executed, the following files/directories would be generated in the [output_dir]

  1. features.txt contains the extracted features and the absolute path to the encoded image (.npy) file for each variant in each row. features.txt will serve as input file to the next step of mosaicism prediction.

features.txt format

#sample_name sex chrom pos ref alt variant maf lower_CI upper_CI variant_type gene_id gnomad all_repeat segdup homopolymer dinucleotide depth_fraction image_filepath npy_filepath
sample_1 M 1 17697 G C 1_17697_G_C 0.18236472945891782 0.15095348571574527 0.21862912439071866 ncRNA_exonic WASH7P 0.1231 1 1 0 0 3.09 /.../images/sample_1-1_17697_G_C.jpg /.../matrices/sample_1-1_17697_G_C.npy
  1. matrices is a directory of the encoded image representations in the .npy format for all the candidate variants from all samples. Names of the file would be in the format of [sample_name]-[chrom]_[pos]_[ref]_[alt].npy.

  2. images is a directory of the encoded image representations in the .jpg format for all the candidate variants from all samples. Names of the file would be in the format of [sample_name]-[chrom]_[pos]_[ref]_[alt].jpg. Image files in this directory could be directly open and inspected visually by users.

  3. repeats_annotation.bed is the intermediate file annotating the repeat and segdup information of each variant.

  4. input.hg19_gnomad_genome_dropped, input.hg19_gnomad_genome_filtered, input.exonic_variant_function, input.variant_function are ANNOVAR outputs annotating the gnomad and variant function information.

Return to Contents


Step 2. Prediction for mosaicism (Classification Module)

Usage

> [DeepMosaic Path]/deepmosaic/deepmosaic-predict -i <output_dir/feature.txt> -o <output.txt> -m [prediction_model (default: efficientnet-b4_epoch_6.pt)] -b [batch_size (default: 10)] -gb <genome_build>

Note:

  1. output_dir/feature.txt is the output file from last step.

  2. output.txt is the final prediction results.

  3. prediction_model is the pretrained DeepMosaic model. The default one (best performing model efficientnet-b4_epoch_6.pt) is trained on our train set for 6 epoch from the efficientnet-b4 architecture.

  4. batch_size is the number of images (variants) predicted by DeepMosaic model simultaneously. Larger batch size means more memory and faster prediction. User can adjust this value depending on his/her available computing power. Default batch size is 10.

  5. genome_build is the build version of the reference genome, currently hg19 and hg38 are supported.

Output:

Output format

#sample_name sex chrom pos ref alt variant maf lower_CI upper_CI variant_type gene_id gnomad all_repeat segdup homopolymer dinucleotide depth_fraction score1 score2 score3 prediction image_filepath
sample_1 M 1 17697 G C 1_17697_G_C 0.18236472945891782 0.15095348571574527 0.21862912439071866 ncRNA_exonic WASH7P 0.1231 1 1 0 0 3.09 0.9999058880667084 6.519687262508766e-10 9.411128132280348e-05 artifact /.../images/sample_1-1_17697_G_C.jpg
  1. The prediction result is in the column "prediction". The possible results are mosaic, heterozygous, ref_homozygous, alt_homozygous or artifact. Only variants marked by mosaic are DeepMosaic predicted mosaic positive. The prediction decision is made by considering the mosaic score generated by DeepMosaic deeplearning model as well as the extracted, user-input, as well as annotated features such as maf, depth_fraction, repeat, segdup, etc. All genomic coordinates and annotations are based on GRCh37d5 reference genome.

  2. Image representations of the variants are stored in the files indicated by "image_filepath" column. User can directly open the .jpg files and visually inspect the piled reads for sanity check.

  3. Raw extracted, user-input, as well as annotated features are listed in the output file, to allow users to implement further filters:

maf,lower_CI, and upper_CI are calculated from the mutant allelic fractions and 95% exact binomial confidence intervals extracted from the bam file.

variant_type and gene_id are annotated by ANNOVAR.

gnomad is annotated from the combined allele frequency in gnomAD (v2.1.1).

all_repeat and segdup are provided in the "resources" folder.

homopolymer and dinucleotide are calculated from the .h5 files in the "resources" folder.

We also provided a Snakemake wrapper for DeepMosaic users.

Return to Contents


Demo

We have provided a simple example in the sub-directory of "demo". The directory includes the input files and the expected results from running DeepMosaic. User could refer to the example for the expected input format and output format.

"Demo" Directory hierarchy

--input.txt
---vcfs
 sample_1.vcf
 sample_2.vcf
 sample_3.vcf
 sample_4.vcf
---bams
 sample_1.bam  sample_1.bam.bai
 sample_2.bam  sample_2.bam.bai
 sample_3.bam  sample_3.bam.bai
 sample_4.bam  sample_4.bam.bai
---results
 features.txt                (intermediate result of running deepmosaic-draw)
 final_predictions.txt       (final result of running deepmosaic-predict)
 -----images (image encodings in .jpg formats)
 -----matrices (image encodings in .npy format to be used in prediction directly)
 repeat.annotation.bed       (intermediate file for repeat annotation)
 input.variant_function, input.exonic_variant_function, input.hg19_gnomad_genome_dropped, input.hg19_gnomad_genome_filtered, input.log (intermediate files after running annovar)

Demo input: input.txt

#sample_name bam vcf depth sex
sample_1 bams/sample_1.bam vcfs/sample_1.vcf 200 M
sample_2 bams/sample_2.bam vcfs/sample_2.vcf 200 M
sample_3 bams/sample_3.bam vcfs/sample_3.vcf 200 M
sample_4 bams/sample_4.bam vcfs/sample_4.vcf 200 M

Expected output: results/final_predictions.txt

#sample_name sex chrom pos ref alt variant maf lower_CI upper_CI variant_type gene_id gnomad all_repeat segdup homopolymer dinucluotide depth_fraction score1 score2 score3 prediction image_filepath
sample_1 M 10 25509499 A G 10_25509499_A_G 0.05737704918032788 0.03448247887605271 0.09399263167327017 intronic GPR158 0.0 0 0 1 0 1.22 0.00010761513038663674 3.852715883900453e-05 0.9998538577107744 mosaic results/images/sample_1-10_25509499_A_G.jpg
sample_2 M 14 37531674 A T 14_37531674_A_T 0.9948186528497408 0.9712392635106 0.9990847787125622 intronic SLC25A21 0.2267 1 0 1 1 0.98 0.19976294102631714 4.0270887857736005e-06 0.800233031884897 alternative_homozygous results/images/sample_2-14_37531674_A_T.jpg
sample_3 M 20 1805075 G T 20_1805075_G_T 0.018072289156626502 0.008308354195089811 0.03886110152464575 intergenic LOC100289473(dist=44683),SIRPA(dist=69738) 0.0 0 0 0 0 1.66 0.003562673370702711 2.9057256040721804e-06 0.9964344209036933 mosaic results/images/sample_3-20_1805075_G_T.jpg
sample_4 M 16 65589896 G C 16_65589896_G_C 0.5306122448979592 0.43252467204457545 0.6263904306010359 ncRNA_intronic LINC00922 0.3142 1 0 1 0 0.49 0.9998079754132149 5.6467567415316954e-08 0.00019196811921752858 heterozygous results/images/sample_4-16_65589896_G_C.jpg

Due to package differences and internal machine differences, the demo result on your machine might be slightly different from the numbers shown here (<0.1% deviations), but the overall prediction should be the same.

Return to Contents


Model Training

If you have you own training set, you can train you own DeepMosaic model using trainModel.py.

-i: input file, tab delimiated |path_to_npy_file_generated_by_DeepMosaic_draw|label|

-e: training epoches

-o: output directory

--model_type: supported model types, see the model folder

--model_path: path to the base model (pt file)

example command:

python trainModel.py -i test_input_training_10.csv -e 2 --model_type efficientnet-b4 --model_path efficientnet-b4_epoch_6.pt -o ./test_trained_model

Return to Contents


Singularity

Singularity containers can be found on Sylabs.

Note

  1. The singularity container currently only works with hg19/GRCh37 and hg38/GRCh38.
  2. You'll need your own copy of ANNOVAR outside the singularity (please specify the path of ANNOVAR in <options>).

Usage

Basic Usage

  1. singularity exec DeepMosaic.sif deepmosaic-draw <options>
  2. singularity exec DeepMosaic.sif deepmosaic-predict <options>

Training and using your own model

  1. singularity exec DeepMosaic.sif python /DeepMosaic/deepmosaic/trainModel.py <options>
  2. singularity exec DeepMosaic.sif deepmosaic-predict <options> --model-path <path_to_your_model>

See Usage and Model Training for more details.

Return to Contents


Performance

  1. WGS

We estimated > 90% experimental validation rate for WGS data identified as "mosaic" by DeepMosaic (GRCh37).

  1. WES

We estimated ~40% experimental validation rate for WES data identified as "mosaic" by the current DeepMosaic WGS model (GRCh37).

Note that the performance of DeepMosaic on GRCh38 will be different, our preliminary estimation showed.

Return to Contents


Q&A

Starting from Jan 2023, new Q&A section will be added to the wiki page, please also visit the issues or closed issues sections to see whether other users already encountered the same questions.

  1. Q: How do I run DeepMosaic for multiple samples most efficiently?

    A: If you have a large number of variants in each file, to run DeepMosaic in parallel, submit each file in independent input files. If you have a relatively small number of variants from each file but multiple files (samples), integrate everything together into one input file. If you have a huge vcf, you can split it into smaller vcfs and run them parallelly (for both visualization and quantification). You only need to split the vcf, not the bam file.

  2. Q: How do I balance/further filter the variants base on DeepMosaic output?

    A: For WGS variants, the exclusion of annotated homopolymer and dinucleotide repeats will remove false positives and increase the validation rate, but decrease the sensitivity.

  3. Q: What do Score 1, Score 2, and Score 3 mean in the output file?

    A: The three scores are combined information from the complex features extracted by the neural network, from our experiences, Score 1 is more like a "het and homo probability", Scores 2&3, especially Score 3 is more like a "potential mosaic possibility". In other words, the higher Score 1 is, the more likely the candidate is a germline variant, whereas the higher Score 3 is, the more likely the candidate is a mosaic variant. But both categories contained a lot of potential artifacts, that's why for the final output we included a more complex classifier.

  4. Q: How to deal with mitochondria and sex chromosomes?

    A: First you should choose a reference genome that supports mitochondria as a separate chromosome. DeepMosaic is not specifically trained on mitochondria variants so we can't guarantee the result, thus we suggest removing the MT variants from DeepMosaic input. For sex chromosomes, DeepMosaic takes into consideration the biological gender of the input sample and also considered the pseudo autosomal regions separately.

  5. Q: Can I use DeepMosaic for cancer somatic mutation detection without control?

    A: The current models presented by DeepMosaic does not support cancer samples, according to benchmarks, the specificity is high (0.97) while the sensitivity is low. We are training new models that support single sample accurate detection of somatic mutations in cancer.

  6. Q: What genome versions does DeepMosaic support?

    A: DeepMosaic is benchmarked on GRCh37(hg19) we are working on some tests for GRCh38(hg39) and are providing some scripts and annotation resources here the model is still the same so the main differences lie in coordinate differences. We will make further updates when we finish new models trained on GRCh38 or CHM13. As most of our current benchmark experiments are carried out on GRCh37 we cannot guarantee the performance on GRCh38.

  7. Q: Why I got errors about pickle_module.load(f, **pickle_load_args)?

    A: Because you didn't fully download DeepMosaic, the entire model folder should be more than 200 MB. Please refer to the git-lfs section in the tutorial.

Return to Contents


Cite DeepMosaic

Yang X*,#, Xin X*, et al. Gleeson JG#. Control-independent mosaic single nucleotide variant detection with DeepMosaic. (Nature Biotechnology)

The Manuscript is also available here.


Licence

Released under GNU-GPL 3.0 licence.


Maintainance Team

Arzoo Patel

Virginia (Xin) Xu

Jiawei Shen

Xiaoxu Yang


Contact

If you have any questions please post a thread at the issues section or contact us at:

📧 Xiaoxu Yang: [email protected], [email protected]

📧 Virginia (Xin) Xu: [email protected]

📧 Joseph Gleeson: [email protected]

Return to Contents

deepmosaic's People

Contributors

arzoopatel5 avatar shishenyxx avatar virginiaxu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepmosaic's Issues

CRAM input

Hi, Does DeepMosaic work with CRAM format input?

running DeepMosaic with hg38

Hi, I am trying to run DeepMosaic with hg38 and I found one resource is missing:

'Error: The requested file (/resources/all_repeats.hg38.bed) could not be opened. Error message: (No such file or directory). Exiting!'

Do you have this file ? Or could you provide description how to generate it?
Thanks!

Very high number of mosaic predictions

Hi,
My input into deepmosaic is mutect2 tumor only mode of 250x WGS healthy human tissue.
After mutect2 I run GATK LearnReadOrientationModel and FilterMutectCalls, and I keep only PASS variants. This leaves 119360 candidate variants

However, I am getting 12350 variants from Deepmosaic = mosaic prediction.

Even after filtering for gnomad = 0, all_repeat = 0 , segdup = 0, homopolymer = 0, dinucleotide = 0, I'm still getting 10,380 variants.

This seems much too high for 250x healthy tissue.

What do you recommend as input into deepmosaic instead of the above, or any other way to filter?

Thanks

Effect of adapters on the final calling

Hello Xiaoxu Yang,

I ran DeepMosaic on WES data preprocessed as recommended in the paper (duplicate marked and base recalibrated with GATK 4.0.4, indel realigned with GATK 3.8.1) but forgot to remove adapters before mapping. My dataset is quite large so before re-running all the analyses with trimmed reads, I wanted to ask:

  1. Could you confirm that you've been removing adapters before mapping when benchmarking DeepMosaic?
  2. Have you had the chance to assess the effect of adapter trimming on the final calling?

Thank you in advance!

NAs in the predictions

We have used this algorithm on deep exome data but there seems to be a lot of NAs in our predictions. Nearly half of the predictions are NAs. What could be the reason for it?

deepmosaic-draw: error: unrecognized arguments: -db gnomad_genome

Hi,
I'm getting this error:

usage: deepmosaic-draw [-h] -i INPUT_FILE [-f VCF_FILTERS] -o OUTPUT_DIR -a
                       ANNOVAR_PATH [-b BUILD]
deepmosaic-draw: error: unrecognized arguments: -db gnomad_genome

It looks like deepmosaic-draw does not have the -db option even though the documentation says it does.

I am using the singularity version of deepmosaic.

Indel predictions

Hi,
I'm testing Deepmosaic on a true positive and I have noticed that the predictions.txt file does not contain the true positive variant because it is a Indel. Predictions.txt does not contain any Indel that is present in the input vcf file and I tried also to recalculate Indel qualities in the BAM file with Dindel but it seems that this tool can handle only SNVs. Can you confirm this?

Best Regards,

Riccardo

Recommendations for calling candidate variants

Hi,

This looks like a great tool and I'd love to get started with it on some of our datasets. Do you have best practice recommendations for calling candidate variants upfront of your tool?

Thanks,
Wouter

Condition for "extra_mosaic_filters"

Hi,
I was looking at the prediction code and noticed the condition for extra_mosaic_filters filter here:

extra_mosaic_filters = (depth_fractions >= 0.6) & (depth_fractions <= 1.7) & (segdups == 0) & (all_repeats == 0) &\
(gnomads < 0.001) & (lower_CIs >= 0.5) & (upper_CIs < 0.5)

Shouldn’t the condition (lower_CIs >= 0.5) & (upper_CIs < 0.5) always evaluate to false? It seems contradictory since the lower bound of the confidence interval (CI) should not be greater than the upper bound.

Thank you,
Duc

speed up step 1?

Hello,

I was trying to process a list of 4 million putative SNVs with your tool. Wondering if there's a way to speed up processing? It's been running for quite a few days. Would you advise running batches of variants independently and assembling the predictions at the end? Thanks!

Kunal

mac?

hello we are hoping to start using this soon
we actually do a lot of our work on Macs, do you know if it will work?
(apologies if it is covered somewhere, i didn't spot it)

different gnomad databases?

Hello!
Thank you for making this program!
I believe that the the hg19/hg38 gnomad annotations are somewhat hardcoded into the program. Specifically to v2.0.1, which for hg38 is installed with annovar at hg38_ gnomad_genome.txt. I only have the newer versions of gnomad downloaded in my annovar installation.

  • hg38_gnomad312_genome.txt
  • hg38_gnomad30_genome.txt
  • hg38_gnomad211_genome.txt

Could you create an input option to control what version of gnomad is used for annotations?
I got around the error:

Error: the database file /annovar/annovar/humandb/hg38_gnomad_genome.txt is not present

by symlinking hg38_gnomad312_genome.txt to hg38_ gnomad_genome.txt. Which works but is not an ideal solution.

Questions about sample.vcf in step1

Hello,
I noticed that while using MuTect2 single model you recommend "PASS" vcfs as input for DeepMosaic.
Is the command parameter of MuTect2 single model similiar as "gatk Mutect2 -R reference.fa -I sample.bam -O single_sample.vcf.gz"? AND how to generate "PASS" vcfs?
Could you help me?

Run errors

Hi, I'm running into trouble and getting several errors:

  1. I'm using the latest 1.1.1 version.

  2. Input is a BAM file with config:
    #sample_name bam vcf depth sex
    60603-W1-B2 60603-W1-B2.cram 60603-W1-B2.filtered.vcf.gz 51.3408 M

  3. input.log shows this:

ANNOVAR Version:
        $Date: 2020-06-07 23:56:37 -0400 (Sun,  7 Jun 2020) $
ANNOVAR Information:
        For questions, comments, documentation, bug reports and program update, please visit http://www.openbioinformatics.org/annovar/
ANNOVAR Command:
        /bin/annovar/annotate_variation.pl -filter -build hg38 -dbtype gnomad_genome /tmp/tmpy7mejtlt /bin/annovar/humandb -outfile 60603-W1-B2/input
ANNOVAR Started:
        Fri Aug 25 18:38:21 2023
NOTICE: Output file with variants matching filtering criteria is written to 60603-W1-B2/input.hg38_gnomad_genome_dropped, and output file with other variants is written to 60603-W1-B2/input.hg38_gnomad_genome_filtered
NOTICE: Processing next batch with 3158595 unique variants in 3158595 input lines
NOTICE: Database index loaded. Total number of bins is 28084439 and the number of bins to be scanned is 2768371
NOTICE: Scanning filter database /bin/annovar/humandb/hg38_gnomad_genome.txt...Done
  1. Output:
  • input.hg38_gnomad_genome_filtered has 201,820 lines
  • features.txt is empty
  • images and matrices folders are empty
  1. In STDERR, I'm getting this WARNING when I run deepmosaic-draw:
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LC_CTYPE = "C.UTF-8",
        LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
  1. After this step: "NOTICE: Scanning filter database /bin/annovar/humandb/hg38_gnomad_genome.txt...Done
    /DeepMosaic/deepmosaic/gnomadAnnotation.py:30: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
    df = pd.read_csv(output_dir + "input." + build + "_" + dbtype + "_dropped", header=None, sep="\t")"

    I'm getting a bunch of these warnings (thousands of lines) right when the pipeline starts:

/reference-files/cram-cache/6a/ef/897c3d6ff0c78aff06ac189178dd.tmp_260008_1_2600980418: No such file or directory
/reference-files/cram-cache/6a/ef/897c3d6ff0c78aff06ac189178dd.tmp_260007_1_2601222400: No such file or directory
/reference-files/cram-cache/6a/ef/897c3d6ff0c78aff06ac189178dd.tmp_260012_1_2601368861: No such file or directory
/reference-files/cram-cache/6a/ef/897c3d6ff0c78aff06ac189178dd.tmp_260011_1_2601476278: No such file or directory
/reference-files/cram-cache/6a/ef/897c3d6ff0c78aff06ac189178dd.tmp_260010_1_2600633160: No such file or directory
/reference-files/cram-cache/6a/ef/897c3d6ff0c78aff06ac189178dd.tmp_260006_1_2601420330: No such file or directory
[E::cram_read_container] Container header CRC32 failure

It looks like there are CRAM warnings/errors, but there is no CRAM input into the pipeline. So I'm not sure what the issue is.

Can you please assist?

Thanks!

A little issue in canvasPainter.py

Hi, Dr. Yang.

The canvasPainter.py code in DeepMosaic:

Line 38, maybe should replace 'offset = 0' using 'offset = int(start-start_pos)'
Because a deletion may occur which will lead to 'start != start_pos'

The windows 'width/2=150bp' is longer than most Illumina sequencing, so 'line38' is less to be used.
So the issue won't affect you too much in your training, but maybe this issue indeed exists.

Best regards,
Weixiang

Docker image?

Hi,

Will you develop the docker image of DeepMosaic?

If input VCFs are splitted by chromosomes, should input BAM files also be splitted by chromosomes too?

Dear shishenyxx,

Thank you for your amazing program that applies image classification technology into bioinformatics/genomics field. I want to use your program, and I got one question: If input VCFs are splitted by chromosomes, should input BAM files also be splitted by chromosomes too? I have many WGS samples that I would like to process with DeepMosaic.

If you have any comment, please answer.

Thanks.

Sincerely, June

Too many mutect2 variants

Hi,
For my 250x genome (after BQSR and indel realignment) has 4 million PASS variants.

Is that normal? That seems like too much.

Anything else that you do standard to pre-filter the variants to reduce the number?

Thanks

high maf and score3 simultaneously?

Hi Xiaoxu,

I observed cases where the variant allele frequency was 1.0 and yet score3 was very high (though prediction was not "mosaic"):

grep -w -F -f pred.common.snv.txt prediction_all_batches.dlpfc.no_artif_intronic.txt | awk '$21 > .9' | cut -f3-8,11,19,21-22 | head
chr10 56646808 T C chr10_56646808_T_C 1.0 intergenic 0.08634431215220496 0.9056392422676766 alternative_homozygous
chr10 64154450 T A chr10_64154450_T_A 0.9583333333333334 intergenic 0.05680040770804196 0.9114665426861789 alternative_homozygous
chr10 64263645 A G chr10_64263645_A_G 1.0 intergenic 0.035593511213016175 0.9638000832314269 alternative_homozygous
chr10 69640841 G C chr10_69640841_G_C 1.0 intergenic 0.06964778067165234 0.9294363717543487 alternative_homozygous
chr10 64890126 C T chr10_64890126_C_T 1.0 intergenic 0.014043077162248046 0.9857749298484482 alternative_homozygous

Wondering how an maf = 1.0, which would typically indicate a germline variant, can have a high mosaic score. The 6th column is maf, 8th is score 1 and 9th is score3 above.

Thanks again!

Kunal

About SimData

Hi, I am reading your papar on Nature Biotechnology and get confused about your SimDatas

  1. For SimData1 and SimData2, did you simulate those variants (10,000 and 7,610, respectively) via Pysim? or just manually create those variants?
  2. For SimData3, if I understand correctly, those 30,090 variants were obtained from GIAB's callset on HG002. However, what did you mean by 'original BAM file' (In Method-SimData3 section, The original BAM file was first up-sampled, and alternative reads were replaced to generate the expected AF)? How did you obtain or generate the 'original BAM file'?

Thanks and looking forward to your reply

Using DeepMosaic for diploid plants

Hi, Dr. Yang,
I'm interested in using DeepMosaic in diploid plant species.
Any guidance you could provide on using DeepMosaic for diploid plant, especially for species without sex chromosomes, would be much appreciated. Thank you for your time and for creating this useful tool!

Error in downloading new 1.1.1 singularity container

Hi,
I'm getting this error when downloading the new singularity container:

$ singularity pull library://arzoopatel5/deepmosaic/deepmosaic:v1.1.1
INFO:    Downloading library image
7.8GiB / 7.8GiB [==========================================================================] 100 % 11.4 MiB/s 0s
WARNING: integrity: signature not found for object group 1
WARNING: Skipping container verification

too many PASS for 30X WGS

Hi,
For my 30x genome (after BQSR and indel realignment) has 100000 PASS variants, using GATK Mutect2 single mode .Based on your responses to other questions, I filtered out all variants locating near near indel, homopolymer, repeats, there were still 80000. That seems like too much. But how I set gnomAD frequency(less or more than)?

Anything else that you do standard to filter the variants to reduce the number?

Thanks

Errors at multiprocessing and matplotlib

Hi, Dr. Yang. I am trying to test DeepMosaic with files in demo after launching, it shows error below:

$ deepmosaic-draw.py -i /hdd1/DeepMosaic/demo/input.txt -o ./ -a /hdd1/annovar/

NOTICE: Output files are written to ./input.variant_function, ./input.exonic_variant_function
NOTICE: Reading gene annotation from /hdd1/annovar/humandb/hg19_refGene.txt ... Done with 72567 transcripts (including 17617 without coding sequence annotation) for 28263 unique genes
NOTICE: Processing next batch with 4 unique variants in 4 input lines
NOTICE: Output file with variants matching filtering criteria is written to ./input.hg19_gnomad_genome_dropped, and output file with other variants is written to ./input.hg19_gnomad_genome_filtered
NOTICE: Processing next batch with 4 unique variants in 4 input lines
NOTICE: Database index loaded. Total number of bins is 28127612 and the number of bins to be scanned is 4
NOTICE: Scanning filter database /hdd1/annovar/humandb/hg19_gnomad_genome.txt...Done

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/hdd1/DeepMosaic/deepmosaic/featureExtraction.py", line 120, in multiprocess_iterator
fig1.savefig(image_file)
File "/usr/lib/python3/dist-packages/matplotlib/figure.py", line 2180, in savefig
self.canvas.print_figure(fname, **kwargs)
File "/usr/lib/python3/dist-packages/matplotlib/backend_bases.py", line 2021, in print_figure
canvas = self._get_output_canvas(format)
File "/usr/lib/python3/dist-packages/matplotlib/backend_bases.py", line 1961, in _get_output_canvas
raise ValueError(
ValueError: Format 'jpg' is not supported (supported formats: eps, pdf, pgf, png, ps, raw, rgba, svg, svgz)
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/hdd1/DeepMosaic/deepmosaic/./deepmosaic-draw.py", line 4, in
if name=='main': main()
File "/hdd1/DeepMosaic/deepmosaic/featureExtraction.py", line 230, in main
results = pool.map(multiprocess_iterator, all_variants, 8)
File "/usr/lib/python3.8/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
ValueError: Format 'jpg' is not supported (supported formats: eps, pdf, pgf, png, ps, raw, rgba, svg, svgz)

The issue matplotlib seems to be resolved by re-installing pillow==6.0 (my pillow==7.0 and matplotlib==3.1.2.)
Could you help me resolving multiprocessing issues? I could see some other StackOverFlow or boards for similar thing, but there is no clear resolution.
Thank you!

How DeepMosaic use population information in training

Hi Dr. Yang,

I have a question about how DeepMosaic incorporate population AF in the training model.
How does the population AF information at each position are incorporated in the model?
Are the absolute positional information of the germline variants used in the training could affect the final output?

Thank you for your support

homopolymer and dinucluotide filter

Hi,
I noticed in the Q&A, you have recommended that For WGS variants, the exclusion of annotated homopolymer and dinucleotide repeats will remove false positives and increase the validation rate, but decrease the sensitivity. But I do not kown what does homopolymer=0 and dinucleotide=0 mean, is it more reliable as it gets closer to zero or less reliable.
What do you recommend?

Thanks

Multi-threaded support

Hi,
Do either of the deepmosaic steps support multi-threading to help speed up the pipeline?
Thanks

How does DeepMosaic perform on PCR-amplified WES data?

DeepMosaic seems like a very powerful tool. I may have missed it, but it looks in the paper you have trained the model on PCR-amplified libraries and tested it on PCR-free libraries. How does DeepMoasic perform if tested on PCR-amplified libraries, specifically WES data?

DeepMosaic performance at 30x depth?

Hi.

My main question is "Is it recommended to run DeepMosaic to identify MVs in 30x data?"

I am planning to pass thousands of human genome data (bam, vcf) to DeepMosaic, but their depth is 30x mostly.
In the DeepMosaic paper, 30x results seem to be not okay from Extended Data Fig. 1.
image

And there is no data in the Main Figure 2 and Extended Data Figure 5.
image

It looks okay in Extended Data Fig. 1 but I am not sure because I have no experience on it.

Can I use DeepMosaic for 30x data?

I also am wondering about the runtime (cpu time, wall time...)

Thanks,
James

Large number of output files

Hi, There seems to be an output folder with a huge number of files. So many that it takes a long time to delete. Which directory is this, and is this a necessary output and/or is there a way to suppress this?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.