felixkrueger / snpsplit Goto Github PK

View Code? Open in Web Editor NEW

51.0 51.0 19.0 3.4 MB

Allele-specific alignment sorting

Home Page: http://felixkrueger.github.io/SNPsplit/

License: GNU General Public License v3.0

Perl 100.00%

allele-specific bioinformatics ngs sequencing

snpsplit's People

Contributors

Stargazers

Watchers

Forkers

jinfengchen shicheng-guo vivekbhr steffenheyne 1010stone musketeer-d yixf-self nservant alaminzju hyperworker icebert xma82 andy3nieto fransua lytytli guangjunyin jing-xinxing tianwen0003 jlt3

snpsplit's Issues

Using the --full_sequence genome

Hi,

I created a hybrid genome using the --full_sequence option to avoid masking and get a higher mapping accuracy. Nevertheless, 100% of my reads are unassignable after mapping because there are no Ns, which makes perfect sense.

What's the point of having the --full_sequence option if you can't split reads to alleles after? Is it just a SNPsplit feature to create a new genome?

Thanks!

SNP split for the hybrid cross two different strains

Hello,

Thanks for the tool once again its been while I am using SNPsplit I have doubt on. I have hybrid cross data CAST vs 129S1 for this cross SNP split can be used?

prepareGenome

Hi,
When I run the SNPsplit_genome_preparation script on the complete Mouse genome (base chromosomes + all scaffolds/fixes), with --no_nmasking, the full_sequence output contains only the base chromosome.

My genome reference comes from ;

 ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa.gz

>>grep ">" Mus_musculus.GRCm38.dna.toplevel.fa 
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
>2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF
>3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF
>4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF
>5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF
>6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF
>7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF
>8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF
>9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF
>10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
>11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
>12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF
>13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF
>14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF
>15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF
>16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF
>17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF
>18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF
>19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF
>X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF
>Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF
>MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF
>CHR_MG171_PATCH dna:chromosome chromosome:GRCm38:CHR_MG171_PATCH:1:151834685:1 PATCH_FIX
>CHR_MG4222_MG3908_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4222_MG3908_PATCH:1:94987243:1 PATCH_FIX
>CHR_MG51_PATCH dna:chromosome chromosome:GRCm38:CHR_MG51_PATCH:1:156507375:1 PATCH_FIX
>CHR_MG3496_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3496_PATCH:1:195440828:1 PATCH_FIX
>CHR_MG4200_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4200_PATCH:1:94983374:1 PATCH_FIX
>CHR_MG4243_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4243_PATCH:1:156484188:1 PATCH_FIX
>CHR_MG4209_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4209_PATCH:1:91793962:1 PATCH_FIX
>CHR_MG74_PATCH dna:chromosome chromosome:GRCm38:CHR_MG74_PATCH:1:104052134:1 PATCH_FIX
>CHR_MG4310_MG4311_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4310_MG4311_PATCH:1:156656003:1 PATCH_FIX
>CHR_MG4249_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4249_PATCH:1:61433356:1 PATCH_FIX
>CHR_MG3833_MG4220_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3833_MG4220_PATCH:1:98208654:1 PATCH_FIX
>CHR_MG3231_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3231_PATCH:1:171029545:1 PATCH_FIX
>CHR_MG4151_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4151_PATCH:1:145439975:1 PATCH_FIX
>CHR_MG104_PATCH dna:chromosome chromosome:GRCm38:CHR_MG104_PATCH:1:170913546:1 PATCH_FIX
>CHR_MMCHR1_CHORI29_IDD5_1 dna:chromosome chromosome:GRCm38:CHR_MMCHR1_CHORI29_IDD5_1:1:195506435:1 PATCH_NOVEL
>CHR_MG3700_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3700_PATCH:1:90658154:1 PATCH_FIX
>CHR_MG3530_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3530_PATCH:1:130695022:1 PATCH_FIX
>CHR_MG4261_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4261_PATCH:1:103906836:1 PATCH_FIX
>CHR_MG3251_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3251_PATCH:1:145419646:1 PATCH_FIX
>CHR_MG3562_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3562_PATCH:1:156470354:1 PATCH_FIX
>CHR_CAST_EI_MMCHR11_CTG4 dna:chromosome chromosome:GRCm38:CHR_CAST_EI_MMCHR11_CTG4:1:122190308:1 PATCH_NOVEL
>CHR_WSB_EIJ_MMCHR11_CTG2 dna:chromosome chromosome:GRCm38:CHR_WSB_EIJ_MMCHR11_CTG2:1:122242168:1 PATCH_NOVEL
>CHR_PWK_PHJ_MMCHR11_CTG2 dna:chromosome chromosome:GRCm38:CHR_PWK_PHJ_MMCHR11_CTG2:1:122246885:1 PATCH_NOVEL
>CHR_MG3648_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3648_PATCH:1:104165524:1 PATCH_FIX
>CHR_MG3618_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3618_PATCH:1:156477076:1 PATCH_FIX
>CHR_CAST_EI_MMCHR11_CTG5 dna:chromosome chromosome:GRCm38:CHR_CAST_EI_MMCHR11_CTG5:1:122035401:1 PATCH_NOVEL
>CHR_PWK_PHJ_MMCHR11_CTG3 dna:chromosome chromosome:GRCm38:CHR_PWK_PHJ_MMCHR11_CTG3:1:122032376:1 PATCH_NOVEL
>CHR_MG4136_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4136_PATCH:1:156508116:1 PATCH_FIX
>CHR_MG4138_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4138_PATCH:1:130620757:1 PATCH_FIX
>CHR_MG3835_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3835_PATCH:1:90835696:1 PATCH_FIX
>CHR_MG89_PATCH dna:chromosome chromosome:GRCm38:CHR_MG89_PATCH:1:159939961:1 PATCH_FIX
>CHR_MG4213_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4213_PATCH:1:91736668:1 PATCH_FIX
>CHR_MG3829_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3829_PATCH:1:122082543:1 PATCH_FIX
>CHR_MG209_PATCH dna:chromosome chromosome:GRCm38:CHR_MG209_PATCH:1:94987270:1 PATCH_FIX
>CHR_WSB_EIJ_MMCHR11_CTG3 dna:chromosome chromosome:GRCm38:CHR_WSB_EIJ_MMCHR11_CTG3:1:122041104:1 PATCH_NOVEL
>CHR_MG4308_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4308_PATCH:1:122082543:1 PATCH_FIX
>CHR_MG3609_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3609_PATCH:1:156568640:1 PATCH_FIX
>CHR_MG4180_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4180_PATCH:1:120129530:1 PATCH_FIX
>CHR_MG3686_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3686_PATCH:1:170897390:1 PATCH_FIX
>CHR_MG65_PATCH dna:chromosome chromosome:GRCm38:CHR_MG65_PATCH:1:61442615:1 PATCH_FIX
>CHR_MG3627_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3627_PATCH:1:124903046:1 PATCH_FIX
>CHR_MG3999_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3999_PATCH:1:195424274:1 PATCH_FIX
>CHR_MG3699_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3699_PATCH:1:90657263:1 PATCH_FIX

Command line ;

SNPsplit_genome_preparation --strain CAST_EiJ --reference_genome genome/ --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --no_nmasking

Output ;

>>grep ">" CAST_EiJ_maternal_genome.fa 
>10
>11
>12
>13
>14
>15
>16
>17
>18
>19
>1
>2
>3
>4
>5
>6
>7
>8
>9
>MT
>X
>Y

I think it would be good to export all chromosomes, even if there have no SNPs.
From ENSEMBLE ; Fix patches: provide improved sequence for known assembly errors. These patches will be incorporated into the primary assembly in the next major assembly release. They are coloured green in the Chromosome summary page and Region in detail page. They are improvements on the primary assembly and should be used preferentially over the primary assembly.

Thanks @FelixKrueger !
Nicolas

MD:A tag

Hi @FelixKrueger,

Great tool; quick question-- I get some reads with a MD:A tag (see attached .bam file), which causes SNPsplit to fail. My understanding though is that one could still use this tag as it specifies a character instead of a string:
https://samtools.github.io/hts-specs/SAMv1.pdf

Any thoughts to change the regex to include MD:A: tags?

SNPsplit/SNPsplit

Line 236 in 7bed7b6

my $md_tag = $1 if ($_ =~ /MD:Z:(.+?)\s+/);

On this note, I think that a more-graceful bow-out would be just leaving reads without an MD:Z tag as unassigned rather than killing the whole script. I don't know if you have thoughts on this?

Thanks,
-Caleb

Keeping reads at CG SNPs using Bismark

I'm working on a bisulfite sequencing project with heterozygotes that have SNPs in CG sites. These SNPs in CpG sites are of primary interest to the research question. Some, but not all, of these SNPs are C/T. However, my bedgraph data never have any methylation calls at SNPs in CG sites when using the bismark2bedGraph function, which is because I was aligning reads to an N-masked genome.

I have a well-defined list of 100% fixed SNPs for genome1 and genome2. I considered aligning the bam files to genome1 then using SNPsplit (as discussed under "Utilization of SNP positions and allele assignment of bisulfite reads" in the User Guide), but then there is a mapping bias toward genome1.

Is there a way to keep methylation calls, while avoiding mapping bias, in the SNPs?

My current workflow is to align reads to an N-masked reference genome using Bismark, then the bam files are used as input for SNPsplit to assign reads to haplotype, and I then extract methylation:

SNPsplit --snp_file <SNP.file.gz> <inputfiles>
bismark_methylation_extractor <val_1_bismark_bt2_pe.genome1.bam>
bismark2bedGraph <CpG_OT_R1_val_1_bismark_bt2_pe.genome1.txt> -o <R1>

Change the BAM file sort command for Samtools v1.3+

The Samtools sort command seems to have changed from earlier versions and now fails when using Samtools version 1.3+ with the following error message:

File specified as unsorted paired-end BAM file
Sorting BAM file 'test.bam' by read IDs ...
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files

This probably just needs specifically specifying the outfile explicitly.

option to specify output directory

Dear Felix

Thanks for the useful tool. I am trying to add SNPsplit into an RNAseq pipeline but the lack of option to specify output directory of the allele-sorted files is creating a trouble.. It would be good to add this parameter in SNPsplit..

Also the documentation doesn't mention which file names/prefix/suffix to expect from the output files in case of single/dual hybrid genome preparation.. I figured this out in my case but would be good to have it in the docs..

Thanks
Vivek

a naive question

Could this software seperae Father source and Mother source sequenced reads from fertilized egg sample data?

All reads are unassignable

Dear SNPsplit developer,

I am running into an issue where all reads are unassignable, although my SNP file and bam file use the same reference. My aligner is bismark v0.18.1_dev and the version for SNPsplit is 0.3.2_dev.

I therefore made a tiny test SNP file https://github.com/RoseString/Sparrow/blob/master/test.txt and a test sam file https://github.com/RoseString/Sparrow/blob/master/test.sam to see if the assignment works.

In the test dataset, both SN996:348:HLW3VBCXY:1:1106:7653:2122_1:N:0:GCGCTA and SN996:348:HLW3VBCXY:1:1106:3931:2390_1:N:0:GCGCTA should be assigned to G1 because they overlap with the SNPs I provided; however, they are marked UA for some reason.

Allele-tagging report
=====================
Processed 6 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
6 reads were unassignable (100.00%)
0 reads were specific for genome 1 (0.00%)
0 reads were specific for genome 2 (0.00%)
0 reads did not contain one of the expected bases at known SNP positions (0.00%)
0 contained conflicting allele-specific SNPs (0.00%)

The command I am using is
SNPsplit --bisulfite --paired --sam --snp_file test.txt test.sam

I would really appreciate it if you can take some time to look into this. Thanks!

Dan

Add --high_confidence option for dual hybrid genomes

We have come across certain position in the genome where different strains appear to have the same SNP (indicated by the GT/genotype field), but one of the strains failed the FI/FILTER criterium (1 is PASS, 0 is FAIL). Here is an example:

GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI
1/1:22:6:0.166667:152,22,0:137,18,0:2:36:6:0,0,6,0:0:-0.616816:.:1 (129) 1/1:15:4:0:79,15,0:67,12,0:2:24:4:0,0,4,0:0:-0.556411:.:0 (Cast)

For single hybrid genomes we would include this position into the 129 genome (1/1 homozygous SNP, first line), but would ignore the position for the Cast genome (also 1/1 homozygous SNP, but failed the high confidence FI filter, second line). This seems like a reasonable approach.

For dual hybrid genomes such positions might be a problem though because when the 129 and Cast SNP lists are compared with each other it looks like there is now a SNP between 129 and Cast, even though there was evidence that the genotype was the same (1/1) in and Cast, only that it did not pass the threshold to count as high confidence SNP in Cast.

As a solution to this can we change the SNPsplit genome preparation to store the FI value as well as the GT genotype and only use the position for a dual-hybrid SNP list if the position was measured with high confidence (i.e. FI=1) in both strains? Thanks to @nservant for helpful discussions in this regard.

Please return an error when runnning out of memory

In SNPsplit_genome_preparation the program stopped after the line Now reading in and storing sequence information of the genome specified in: /data/data/ref_genome/. It said complete, but I did not obtain the files I wanted. I increased the available memory and the program performed as intended. Please return an error if the genome can't be loaded into memory.

Writing sorting report for single-end files to file

Currently the sorting report is only printed to the screen but not to the sorting or SNPsplit report for Single-End files (while this works fine for Paired-End files).

C>T SNPs preventing the assignment

Hello!

I was wondering how SNPsplit behaves when a given read contains a C>T SNP as well as another one which would allow assignment. Is the read discarded altogether because of the C>T SNP or is the other "useful" SNP allowing assignment?

Thanks for your support!

Error with SNPsplit_genome_preparation

Hi Felix,

I'm having an issue with the SNPsplit_genome_preparation (v0.5.0) processing step. There seems to be an issue with variable definition, but I'm unfortunately not familiar enough with Perl to troubleshoot myself.

Command:

SNPsplit_genome_preparation --vcf_file snp.noMiss.fix.withFI.vcf --strain lane1-A1-A_S1_L001_1 --strain2 lane1-A3-C_S3_L001_1 --reference_genome $STOCKYARD/stampede2/db/Amil_chr1/

Output log:

Strain defined as 'lane1-A1-A_S1_L001_1' (strain index: 9)
Strain 2 specified, setting option '--dual_hybrid'
Dual Hybrid strain selected
Strain2 defined as 'lane1-A3-C_S3_L001_1' (strain2 index: 11)
Reference genome folder provided is /work/05703/jprippe/stampede2/db/Amil_chr1/	(absolute path is '/work2/05703/jprippe/stampede2/db/Amil_chr1/)'

Summarising SNPsplit Genome Preparation Parameters
==================================================
Processing SNPs from VCF file:		snp.noMiss.fix.withFI.vcf
Reading/filtering VCF file:		Yes (default)
Reference genome:			/work2/05703/jprippe/stampede2/db/Amil_chr1/
N-masking:				Yes
Full SNP genome:			Yes
SNP strain:				lane1-A1-A_S1_L001_1
SNP strain 2:				lane1-A3-C_S3_L001_1
Dual hybrid, new Ref/SNP:		lane1-A1-A_S1_L001_1/lane1-A3-C_S3_L001_1

Using the following chromosomes (detected from VCF file >>snp.noMiss.fix.withFI.vcf<<):
chr1,length=39361238,assembly=null,md5=a4bc6abe85cd962980b8fe1c5a7ddb94

Analysing SNP fields for name >lane1-A1-A_S1_L001_1<
Using FORMAT field index: 8
Using INFO field index: 7
GT index not defined, checking...
Setting GT index to >>0<<
Setting FI index to >>11<<
Can't use an undefined value as a symbol reference at /home1/05703/jprippe/SNPsplit-0.5.0/SNPsplit_genome_preparation line 1145, <IN> line 63.

If it's worth noting, I'm using a custom VCF and I had the program working a few weeks ago after modifying the v0.4.0 code to address a different error (following #47). Since then, I've had to switch computing systems and haven't been able to reproduce the success with either the newest release or previous one.

VCF (in case it helps...):

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##BisSNP-1.0.0 Program Args= -R /work/05703/jprippe/stampede2/db/Amil_chr1/chr1.fasta -T BisulfiteGenotyper -nonDirectional -nt 12 -I lane1-A1-A_S1_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -I lane1-A2-B_S2_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -I lane1-A3-C_S3_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -I lane1-A4-D_S4_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -vfn1 cpg.raw.vcf -vfn2 snp.raw.vcf
##FORMAT=<ID=BQ,Number=.,Type=Float,Description="Average base quality for reads supporting alleles. For each allele, in the same order as listed">
##FORMAT=<ID=BRC6,Number=.,Type=Integer,Description="Bisulfite read counts(not filtered by minConv): 1) number of C in cytosine strand, 2) number of T in cytosine strand, 3) number of A/G/N in cytosine strand, 4) number of G in guanine strand, 5) number of A in guanine strand, 6) number of C/T/N in guanine strand.">
##FORMAT=<ID=CM,Number=1,Type=Integer,Description="Number of Unconverted Cytosines in this position(filtered by minConv)">
##FORMAT=<ID=CP,Number=1,Type=String,Description="Best Cytosine Context">
##FORMAT=<ID=CU,Number=1,Type=Integer,Description="Number of Converted Cytosines in this position(filtered by minConv)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth, not filtered by mapping quality and base quality score criteria yet">
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="Reads supporting ALT, only keep good base. Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles">
##FORMAT=<ID=GP,Number=G,Type=Integer,Description="Normalized, Phred-scaled posteriors for genotypes as defined in the VCF specification">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Somatic status of the variant. 1) wildtype; 2) germline, somatic; 3) LOH; 4) post-transcriptional modification; 5) unknown">
##INFO=<ID=CS,Number=1,Type=Character,Description="Strand of cytosine relative to reference genome. Does not apply to non-cytosine positions">
##INFO=<ID=Context,Number=.,Type=String,Description="Cytosine Context. e.g. 'CG' means homozygous CG sites, while heterozygous CG sites will be output as IUPAC code">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth, not filtered by mapping quality and base quality score criteria yet">
##INFO=<ID=HQ,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=REF,Number=.,Type=String,Description="Reference Cytosine Context. if all cytosine context found are in reference genome, REF=0; otherwise, e.g. it will be REF=CG for position not confidently called as CG or real context is CH.">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##assembly=none
##center=none
##contig=<ID=chr1,length=39361238,assembly=null,md5=a4bc6abe85cd962980b8fe1c5a7ddb94,species=null>
##fileDate=20210802
##geneAnno=none
##phasing=none
##reference=<ID=null,Source=file:/work2/05703/jprippe/stampede2/db/Amil_chr1/chr1.fasta>
##tcgaversion=1.1
##vcfProcessLog=<InputVCF=<>,InputVCFSource=<>,InputVCFVer=<1.1>,InputVCFParam=<> InputVCFgeneAnno=<>>
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view -e GT[*]="mis" snp.sorted.vcf; Date=Fri Aug  6 10:56:27 2021
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="Whether a sample was a Pass(1) or fail (0) based on FILTER values">
##bcftools_annotateVersion=1.13+htslib-1.13
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A1-A_S1_L001_1 -c CHROM,POS,FORMAT/FI snp.noMiss.fix.vcf; Date=Fri Aug  6 16:00:26 2021
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A2-B_S2_L001_1 -c CHROM,POS,FORMAT/FI -; Date=Fri Aug  6 16:00:26 2021
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A3-C_S3_L001_1 -c CHROM,POS,FORMAT/FI -; Date=Fri Aug  6 16:00:26 2021
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A4-D_S4_L001_1 -c CHROM,POS,FORMAT/FI -; Date=Fri Aug  6 16:00:26 2021
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	lane1-A1-A_S1_L001_1	lane1-A2-B_S2_L001_1	lane1-A3-C_S3_L001_1	lane1-A4-D_S4_L001_1
chr1	193	.	T	C,G	31.76	PASS	DP=23;MQ0=2;NS=4	GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI	0/0:37,nan:0,6,1,0,0,0:.:.:.:.:2,4,0,0:0,27,28:27:5:1	0/0:37,nan:0,1,0,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1	1/1:37,nan:0,6,0,0,1,0:.:.:.:.:5,2,0,0:2147483647,4,0:4:5:1	1/1:.:0,6,0,0,2,0:.:.:.:.:0,0,0,0:0,57,6:6:5:1
chr1	245	.	T	C,G	31.76	PASS	DP=27;MQ0=3;NS=4	GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI	0/0:37,nan:0,7,0,0,2,0:.:.:.:.:4,5,0,0:0,31,34:31:5:1	0/0:37,nan:0,1,0,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1	1/1:.:0,5,0,0,5,0:.:.:.:.:0,0,0,0:0,57,6:6:5:1	1/1:.:0,5,0,0,2,0:.:.:.:.:0,0,0,0:0,57,6:6:5:1
chr1	246	.	A	G,C	31.76	PASS	DP=28;MQ0=3;NS=4	GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI	0/0:37,nan:0,0,8,1,0,1:.:.:.:.:4,5,0,0:0,31,34:31:5:1	0/0:37,nan:0,0,1,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1	1/1:.:0,0,4,0,0,6:.:.:.:.:0,0,0,0:0,57,6:6:5:1	1/1:.:0,0,5,0,0,2:.:.:.:.:0,0,0,0:0,57,6:6:5:1
chr1	247	.	A	G,C	36.03	PASS	DP=29;MQ0=3;NS=4	GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI	0/0:34.3,37:0,0,8,0,0,2:.:.:.:.:4,5,1,0:0,36,52:36:5:1	0/0:37,nan:0,0,1,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1	1/1:.:0,0,3,0,0,7:.:.:.:.:0,0,0,0:0,57,6:6:5:1	1/1:37,37:0,0,6,0,0,2:.:.:.:.:1,6,1,0:2147483647,8,0:8:5:1

Any ideas?

Thanks!
JP

Two much variation in the Two strain specific reads percentage

Hi Felix,

I am getting too much variation read assignment in my analysis this. Thread is from after resolving my issue. https://github.com/FelixKrueger/reStrainingOrder/issues/2 same problem observed in snpsplit that's why I asked again here. Can you please confirm modification needed here ... 11-13% and 50-60% JF1 and 129s1 respectively I am using Vcf v7 please suggest me. for the data which I have is pure hybrid strain confirm this wetlab person several times. I have another dataset with the different strain cross C57 vs 129S1 there I am not facing any issue.

please help me

Thank you

Why are there so many reads not containing one of the expected bases at known SNP positions ?

Hi, I've been using SNPsplit recently with our monkey data. The input is the output of bismark (v 0.18.0,default parameters). SNP information is generated by genome resequencing. And this is part of my allele-tagging report. SNPsplit seems working well.
What I am confused about is the bold text. Why are there so many reads not containing one of the expected bases at known SNP positions (Well I think it's a little much )? Is it because of mismatches allowed by bismark? I also wonder if there are recommended parameters for bismark when the output is used for phasing ?
Thank you very much for comments.

Allele-tagging report

Processed 366283612 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
288619297 reads were unassignable (78.80%)
35559649 reads were specific for genome 1 (9.71%)
34895376 reads were specific for genome 2 (9.53%)
30805929 reads did not contain one of the expected bases at known SNP positions (8.41%)
7209290 contained conflicting allele-specific SNPs (1.97%)

Question about homozygous SNP ?

I am a little confused with "homozygous SNP".
In SNPsplit paper, It's said that "If the genotype is not known, SNP positions may be called from the data itself, or from genome re-sequencing performed in parallel".
For example,I have bs-seq data for a monkey as well as her genome re-sequencing data. The genotype is not known. Can SNPsplit genome preparation still be used? SNPsplit genome preparation uses only high confidence homozygous positions, which confuses me. I guess it is heterozygous SNP rather than homozygous SNP that can be used to assign reads in this case.
In the case reciprocal mouse crosses reported by Xie and colleague (GSE33722), I understand the necessity of homogeneous SNP.
Thanks for any suggestion.

Stats in YAML format to enable integration into MultiQC

Phil asked if we could produce the stats in YAML format to make it easier to integrate into MultiQC MultiQC/MultiQC#593 (comment)

printing genome1 and genome2 strain name while sorting

Hi @FelixKrueger

During SNPsplit sort, the strain names of the bam files are not entered (the suffix says _genome1 and _genome2). neither is it printed in the SNPsplit sort log files. The only way to find out what genome1 and 2 correspond to is to look back at the log of genome_generate.

It would be good to either replace the suffix _genome1/2 with strain ID or (probably even easier) to print the strain name for genome1 and 2 in the log file of SNPsplit sort

Thanks..
Best,
Vivek

Phased human genomes

I was wondering how SNPsplit would handle phased human genomes. I didn't see anything in the documentation that seemed to take it into account but I could've missed it. Thanks!

The awk script makes me confused

I think the awk script you given has some mistakes
awk '{if($1 ~ "^#") {gsub("contig=<ID=", "contig=<ID=chr"); gsub("contig=<ID=chrMT", "contig=<ID=chrM"); print} else gsub("^MT", "M"); print "chr"$0}' mgp.v5.merged.snps_all.dbSNP142.vcf
I think this script should be this
awk '{if($1 ~ "^#") {gsub("contig=<ID=", "contig=<ID=chr"); gsub("contig=<ID=chrMT", "contig=<ID=chrM"); print} else {gsub("^MT", "M"); print "chr"$0}}' mgp.v5.merged.snps_all.dbSNP142.vcf
my system is centos 7

'S' CIGAR operation not allowed

I aligned reads using Bowtie2, and when SNPsplit read the bam file, it said: Found CIGAR operations other than M, I, D or N: 'S'. Not allowed at the moment
Should I specify any parameter when running Bowtie2 or filter the input reads in some reads to rule out "S" type reads?
Could I just filter the sam file to get alignments other than 'S' alignments? Does this differ with specify --end-to-end. when using Bowtie2?

Look at file type auto-detection

I'm pretty sure that there are too many @PG program lines so that the auto-detection doesn't do the right thing anymore. Related to FelixKrueger/Bismark#384?

Question about genome preparation for C57BL/6J x JF1/Ms hybrid mouse

Dear Felix,

My group is hoping to analyse some published data generated from a hybrid mouse train (C57BL/6J x JF1/Ms) in which SNP information for JF1/Ms is not available in the “mgp.v5.merged.indels.dbSNP142.normed.vcf.gz” vcf file from sanger institute website
We have however managed to find the SNP info stored in a .tsv file for this strain in the following format:

Chromosome	Start_position	End_position	Ref	Alt
1	3000176	3000176	T	G

Can I please check with you whether we can use this tsv file (or a reformatted version of this tsv file) for generation of the masked genome using the following command:

SNPsplit_genome_preparation --reference /ref_genome/mm10/ --strain JF1/Ms --skip_filtering

Many thanks!

Best,
Pui

Output genome files contain UA flags

Hello!

First of all thanks for your great work with this tool, it's helping a lot my analyses!
I was hoping you could help me figure out what am I doing wrong while using SNPsplit: for some reason many of the reads in the output genome1 and genome2 files are flagged as unassigned (UA). Is that suppose to be the case?

Here is my command line:

SNPsplit INPUT_FILE
--snp_file SNP_FILE
--paired
--bisulfite
--conflicting

The INPUT_FILE is the output from a bismark alignment, while the SNP_FILE is the all_STRAIN2_SNPs_STRAIN1_reference.based_on_GRCm38.txt generated during a SNPsplit_genome_preparation run.

Those are the percentages resulting from this run:
Reads were unassignable (not overlapping SNPs): 49377695 (48.29%)
Reads were specific for genome 1: 28090395 (27.47%)
Reads were specific for genome 2: 23789145 (23.26%)
Reads contained conflicting SNP information: 998916 (0.98%)

and in fact my genome1.bam file contains more than double 28090395 reads.

Is that anything you encountered before?

Thanks a lot for your support!

Supporting soft-clipping

Would it be particularly difficult to support soft-clipping in reads? This would be very useful for use with alternative aligners, in my case I'm using minimap2 for Nanopore long reads.

Add option to skip sorting

To allow easier implementation of allele-specific sorting into ChIP-seq, RNA-seq or BS-seq pipelines workflows via Nextflow (e.g. our in-house Nextflow pipelines) can we add an option that will only run the allele-tagging process, and then stop at generating the allele_flagged.bam files.

This will allow adding markdup or deduplicate_bismark steps, and then run the tag2sort step at a later point in time.

High level of unassigned reads

Possibly not an issue, but wanted to double check. I'm getting very low rates mapping to either genome when looking at a number of different mouse lines.

I'm using SNPsplit to identify reads coming from specific genomes (GRCm39 vs C3H_HeJ).

Basic pipeline is:

./SNPsplit_genome_preparation --vcf_file file.vcf --strain C3H_HeJ --reference_genome

STAR --runMode genomeGenerate --genomeDir pathtodir --genomeFastaFiles pathtogenome/*.fa

STAR --genomeDir pathtogenome --sjdbGTFfile pathtoannotation/genes.gtf --readFilesIn file1.fastq file2.fastq --quantMode GeneCounts --alignEndsType EndToEnd --outSAMattributes NH HI NM MD --outSAMtype BAM Unsorted

./SNPsplit --SNP_file pathtoSNPsfile/file.txt Aligned.out.bam

I'm getting mapping to GRCm38 of around 0.6% and to C3H_HeJ around 0.6%. Has anyone done similar and is this in line with the sort of numbers you would expect for mouse strains? I guess there's not that much difference between the mouse lines, and the data is from very early development, so may not be that much expressed.

Tidy up and MD-ify docs

The documentation would benefit from markdownification, the old release notes documents could me merged into the Changelog.md file etc.

Some questions about working on plant species

Hello Felix,

I'm working on hybrid maize and also want to identify reads from specific genome. Is SNPsplit suitable for the species?
I have mapped the paried-end reads to the maize genome by using Hisat2, and the call the alleles (stored in VCFv4.2 format) by freebayes.
I followed the instruction you mentioned in the document.
First, I prepared the genome:

SNPsplit_genome_preparation --vcf B73_Ki11_all.ref_B73.vcf --reference_genome Zea_mays.B73_RefGen_v4.50.dna.toplevel.fa --strain maize

But it turned out:

Strain name specified [maize] does not match any of the available strain names!

Available genomes to choose from are:
=====================================
unknown
=====================================

Please double check the name and try again (using '--strain NAME')

It may means I have set uncorrected genome name, then I found "unknown" in line #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown in the vcf file, then I ran the command again with --strain from "maize" to "unknown", :

SNPsplit_genome_preparation --vcf B73_Ki11_all.ref_B73.vcf --reference_genome Zea_mays.B73_RefGen_v4.50.dna.toplevel.fa --strain unknown

The exception came out:

Strain defined as 'unknown' (strain index: 9)
Failed to move to genome folder > Zea_mays.B73_RefGen_v4.50.dna.toplevel.fa/ <: Not a directory

SNPsplit_genome_preparation --help for more details

It may means I should put the genome in a fold, so I reran the command as below with the genome in maize_genome fold:

SNPsplit_genome_preparation --vcf B73_Ki11_all.ref_B73.vcf --reference_genome maize_genome --strain unknown

But there are so many exceptions like below:

Use of uninitialized value $fi in numeric eq (==) at /data/CrazyHsu_data/opts/SNPsplit_genome_preparation line 1041, <IN> line 125001.

There are many files generated in SNPs_unknown, but they are all empty. And the content of unknown_SNP_filtering_report.txt is:

SNP position summary for strain unknown (based on genome build GRCm38)
===========================================================================

Positions read in total:        302818

47016   SNP were homozygous. Of these:
0       SNP were homozygous and passed high confidence filters and were thus included into the unknown genome

Not included into unknown genome:
1296    had the same sequence as the reference
0               had no clearly defined alternative base
254501          Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
47016           were homozygous but the filtering call was low confidence

Printed a single list of all SNPs to >all_SNPs_unknown_GRCm38.txt.gz<...

Head 10 line of unknown_genome_preparation_report.txt is:

0 positions on chromosome 1 were changed to 'N'
0 positions on chromosome 10 were changed to 'N'
0 positions on chromosome 2 were changed to 'N'
0 positions on chromosome 3 were changed to 'N'
0 positions on chromosome 4 were changed to 'N'
0 positions on chromosome 5 were changed to 'N'
0 positions on chromosome 6 were changed to 'N'
0 positions on chromosome 7 were changed to 'N'
0 positions on chromosome 8 were changed to 'N'
0 positions on chromosome 9 were changed to 'N'

Is there anything wrong when I use the SNPsplit_genome_preparation? And how can I do?

Best,
Feng

SNPsplit_genome_preparation processing error on custom vcf and genome

I am experiencing processing error when using SNPsplit_genome_preparation.

SNPsplit Genome Preparation version: 0.4.0

command:

SNPsplit_genome_preparation --vcf_file ~/project/zebrafish_resequencing/deepvariant/merged.vcf.gz --strain TU --dual_hybrid --strain2 TL --reference_genome ref_genome

processing log:

Strain defined as 'TU' (strain index: 9)
Dual Hybrid strain selected
Strain2 defined as 'TL' (strain2 index: 10)
Reference genome folder provided is ref_genome/	(absolute path is '/mnt/Storage/home/wangwen/source/bySpecies/danRer11_2/strains/ref_genome/)'

Summarising SNPsplit Genome Preparation Parameters
==================================================
Processing SNPs from VCF file:		/mnt/Storage/home/wangwen/project/zebrafish_resequencing/deepvariant/merged.vcf.gz
Reading/filtering VCF file:		Yes (default)
Reference genome:			/mnt/Storage/home/wangwen/source/bySpecies/danRer11_2/strains/ref_genome/
N-masking:				Yes
Full SNP genome:			Yes
SNP strain:				TU
SNP strain 2:				TL
Dual hybrid, new Ref/SNP:		TU/TL

Using the following chromosomes (detected from VCF file >>/mnt/Storage/home/wangwen/project/zebrafish_resequencing/deepvariant/merged.vcf.gz<<):
chr10	chr8	chr17	chr21	chrM	chr25	chr9	chr22	chr15	chr5	chr2	chr6	chr12	chr16	chr7	chr11	chr1	chr3	chr14	chr18	chr24	chr23	chr20	chr19	chr4	chr13

Analysing SNP fields for name >TU<
Use of uninitialized value $fi in string ne at /mnt/Storage/home/wangwen/bin/SNPsplit_genome_preparation line 976, <IN> line 44.
Use of uninitialized value $fi in string ne at /mnt/Storage/home/wangwen/bin/SNPsplit_genome_preparation line 976, <IN> line 45.

vcf file

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="Whether a sample was a Pass(1) or fail (0) based on FILTER values">
##DeepVariant_version=1.0.0
##contig=<ID=chr1,length=59578282>
##contig=<ID=chr10,length=45420867>
##contig=<ID=chr11,length=45484837>
...
##bcftools_mergeVersion=1.6-37-g9a38c20+htslib-1.6-38-g8003166
##bcftools_mergeCommand=merge -Oz -o merged.vcf.gz TU.filtered.vcf.gz TL.filtered.vcf.gz; Date=Tue Apr  6 16:26:23 2021
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TU      TL
chr1    119     .       TG      T       0       RefCall .       GT:GQ:DP:AD:VAF:PL:FI   ./.:.:.:.:.:.:. 0/0:29:27:24,3:0.111111:0,28,48:0
chr1    187     .       A       T       0       RefCall .       GT:GQ:DP:AD:VAF:PL:FI   0/0:20:11:9,2:0.181818:0,20,53:0        ./.:.:.:.:.:.:.
chr1    235     .       C       T       0.4     RefCall .       GT:GQ:DP:AD:VAF:PL:FI   ./.:.:.:.:.:.:. ./.:10:38:32,5:0.131579:0,9,37:0

Could help me?

Consider adding single-end/paired-end auto-detection

It is quite easy to forget to specify --paired for paired-end files, so it would help to add a library strategy auto-detection, at least for the first file.

Add check for incompatible genomes (e.g. UCSC Golden Path)

Users reported issues that no SNPs were incorporated when they ran SNPsplit. The issue turned out to be caused by the fact that the Ensembl version of the vcf file uses a chromosome nomenclature like 1, 2, 3, X, Y, MT whereas UCSC uses chr1, chr2, chr3, chrX, chrY, chrM instead. This causes SNPsplit to look up chromosomes that do not actually exist.

Can we add a check to see if the chromosome names appear to be conflicting and make it die horribly.

Why all my reads are unassigned

Hi! SNPsplit seems like a perfect tool for allele specific research, nice work!

Here is my question:
I hope to look at RNA expression at individual chromosome, but although my reads are processed, no read were assigned.

my code:

#get SNPsplit format snp from sanger vcf
SNPsplit_genome_preparation --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --reference_genome ../raw_data/ --strain CAST_EiJ --strain2 C57BL_6NJ

#generate star index
STAR --runMode genomeGenerate --runThreadN 80 --genomeDir ./CAST_B6_star_index --genomeFastaFiles CAST_B6/CAST_EiJ_C57BL_6NJ_dual_hybrid.based_on_GRCm38_full_sequence/chr*.fa

#map RNA reads to ref genome
STAR --runThreadN 20 --genomeDir ./CAST_B6_star_index --readFilesIn RNA_all/umibycell.E752003.rna.R1.fq --outFileNamePrefix starAlign/star. --outSAMtype BAM Unsorted --alignEndsType EndToEnd --outSAMattributes NH HI NM MD --outReadsUnmapped Fastx

#use SNP_split
SNPsplit --snp_file ./CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt ./starAlign/star.Aligned.out.bam -o ./res

SNPsplit output:

Created output directory ./res/!

Output will be written into the directory: /shareb/zliu/test/RNAsplit/res/
Testing if input file './starAlign/star.sort.bam' looks like a Bisulfite-Seq file
Treating file(s) as single-end data (as extracted from @PG line)

Treating file(s) as single-end data (as extracted from @PG line)

Treating file(s) as single-end data (as extracted from @PG line)

Now starting to process file <<< './starAlign/star.sort.bam' >>>

Input file:                                     'star.sort.bam'
Writing SNPplit-tag report to:                  'star.sort.SNPsplit_report.txt'
Writing allele-flagged output file to:          'star.sort.allele_flagged.bam'


Summary of parameters for SNPsplit-tag:
========================================
SNPsplit infile:                ./starAlign/star.sort.bam
SNP annotation file:            ./CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt
Output directory:               >/shareb/zliu/test/RNAsplit/res/<
Parent directory:               >/shareb/zliu/test/RNAsplit<
Samtools path:                  /share/home/zliu/miniconda3/envs/chip/bin/samtools
Output format:                  BAM (default)
Input format:                   Single-End


File specified as single-end BAM file
Storing SNP positions provided in './CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt'
Stored 20635313 positions in total

Reading from sorted mapping file './starAlign/star.sort.bam'

Allele-tagging report
=====================
Processed 195953 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
Reads were hard-clipped (CIGAR: H) and skipped: 0
195953 reads were unassignable (100.00%)
0 reads were specific for genome 1 (0.00%)
0 reads were specific for genome 2 (0.00%)
0 reads did not contain one of the expected bases at known SNP positions (0.00%)
0 contained conflicting allele-specific SNPs (0.00%)


SNP coverage report
===================
SNP annotation file:    ./CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt
SNPs stored in total:   20635313
N-containing reads:     0
non-N:                  195953
total:                  195953
Reads had a deletion of the N-masked position (and were thus called Conflicting):       0 (0.00%)
Of which had multiple deletions of N-masked positions within the same read:     0 (0.00%)

Of valid N containing reads,
N was present in the list of known SNPs:        0 (N/A%)
N was not present in the list of SNPs:          0 (N/A%)

Finished allele-tagging for file <<< './starAlign/star.sort.bam' >>>


Summary of parameters for SNPsplit-sort:
========================================
SNPsplit tagged infile:         star.sort.allele_flagged.bam
Output directory:               >/shareb/zliu/test/RNAsplit/res/<
Parent directory:               >/shareb/zliu/test/RNAsplit<
Samtools path:                  /share/home/zliu/miniconda3/envs/chip/bin/samtools
Output format:                  BAM (default)
Input format:                   Single-End


Now processing input file <<< star.sort.allele_flagged.bam >>>

Writing YAML report to star.sort.SNPsplit_sort.yaml

Input file:                                             'star.sort.allele_flagged.bam'
Writing SNPsplit-sort report to:                        'star.sort.SNPsplit_sort.txt'
Writing unassigned reads to:                            'star.sort.unassigned.bam'
Writing genome 1-specific reads to:                     'star.sort.genome1.bam'
Writing genome 2-specific reads to:                     'star.sort.genome2.bam'

![image-20210703130704301](https://user-images.githubusercontent.com/16440619/124344013-8c7d4f80-dc02-11eb-8ed9-805019504a40.png)
![image-20210703130819568](https://user-images.githubusercontent.com/16440619/124344016-8e471300-dc02-11eb-9905-49ef8ba24d28.png)

Allele-specific single-end sorting report
=========================================
Read alignments processed in total:             195953
Reads were unassignable:                        195953 (100.00%)
Reads were specific for genome 1:               0 (0.00%)
Reads were specific for genome 2:               0 (0.00%)
Reads contained conflicting SNP information:    0 (0.00)


Sorting finished successfully



SNPsplit processing finished... [Allele-tagging + tag2sort]

I have checked my bam and snp file, they seems right. And I'm confident about my mouse genotype.
Here is an example from igv:

And here are sample bam file and it's index:
example.zip

Any ideas why? Thank you very much for taking the time to help！

Nmasked human genome

Update:
I think the problem is in the format of my SNP files. I used the one from mouse genome and replaced the content. Everything is good now. Sorry for the inconvenience and thank you very much!

Hi,

I'm trying to prepare an N-masked human genome with SNPsplit. I have read the issues #22 in the thread on a similar topic and I decided to follow your suggestion to use --skip_filtering option. I used the strain name 'SPRET_EiJ' for convenience and put my SNP files to a folder named 'SNPs_SPRET_EiJ'. However I still got problems as 0 positions are changed to N per chromosome. Could you help to check where is the problem?

My command is ../SNPsplit-0.3.4/SNPsplit_genome_preparation --nmasking --skip_filtering --vcf_file sample.vcf.gz --reference_genome ../hg38_genome/ --strain SPRET_EiJ --genome_build hg38_Nmasked

My vcf file is downloaded from https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-common_all.vcf.gz

My reference genome is downloaded from http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh38.primary_assembly.genome.fa.gz. All 'chr*' are replaced by '*'.

My SNP files are in the below format.
rs544419019 1 11012 C/G
rs561109771 1 11063 T/G
rs540538026 1 13110 G/A
rs62635286 1 13116 T/G
rs62028691 1 13118 A/G
rs531730856 1 13273 G/C
rs548333521 1 13284 G/A
rs571093408 1 13380 C/G
rs568927457 1 13453 T/C
rs546169444 1 14464 A/T

My log file is

Reading/filtering VCF file: No (skipped by user)
Reference genome: ../hg38_genome/
N-masking: Yes
Full SNP genome: No
SNP strain: SPRET_EiJ

Using the following chromosomes (HARCODED IN!!!):
1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22
X Y MT

Skipped reading the VCF file and filtering SNPs again (specified by user)

Now reading in and storing sequence information of the genome specified in: ../hg38_genome/
chr 1 (248956422 bp)
chr 2 (242193529 bp)
chr 3 (198295559 bp)
chr 4 (190214555 bp)
chr 5 (181538259 bp)
chr 6 (170805979 bp)
chr 7 (159345973 bp)
chr 8 (145138636 bp)
chr 9 (138394717 bp)
chr 10 (133797422 bp)
chr 11 (135086622 bp)
chr 12 (133275309 bp)
chr 13 (114364328 bp)
chr 14 (107043718 bp)
chr 15 (101991189 bp)
chr 16 (90338345 bp)
chr 17 (83257441 bp)
chr 18 (80373285 bp)
chr 19 (58617616 bp)
chr 20 (64444167 bp)
chr X (156040895 bp)
chr Y (57227415 bp)
chr M (16569 bp)
Processing chromosome 1 (for strain SPRET_EiJ)
Reading SNPs from file /net/noble/vol8/hefang2/hg38.Nmasked.common/SNPs_SPRET_EiJ/chr1.txt
Clearing SNP array...
Writing modified chromosome (N-masking)
Writing N-masked output to: /net/noble/vol8/hefang2/hg38.Nmasked.common/SPRET_EiJ_N-masked/chr1.N-masked.fa
0 SNPs total for chromosome 1
0 positions on chromosome 1 were changed to 'N'

Processing chromosome 2 (for strain SPRET_EiJ)
Reading SNPs from file /net/noble/vol8/hefang2/hg38.Nmasked.common/SNPs_SPRET_EiJ/chr2.txt
Clearing SNP array...
Writing modified chromosome (N-masking)
Writing N-masked output to: /net/noble/vol8/hefang2/hg38.Nmasked.common/SPRET_EiJ_N-masked/chr2.N-masked.fa
0 SNPs total for chromosome 2
0 positions on chromosome 2 were changed to 'N'

PrepareGenome input

Hi,
Idea of improvment ;
Would be great if SNPsplit_genome_preparation could either take a folder or a fasta file directly in input.
Cheers
N

STAR provides only unassigned reads

Hi, I'm running SNPsplit with the mouse genome GRCm38 and the strain CAST_EiJ.

The workflow was as such

# genome preparations (for comparison reason I tried both full_seqnece and N-masked)
SNPsplit_genome_preparation --strain CAST_EiJ  --vcf Mmu.GrCm38/SNPsplit/mgp.v5.merged.snps_all.dbSNP142.vcf.gz --reference_genome GRCm38/ --nmasking --full_sequence --genome_build GRCm38

#  genome indexing (command added only for full sequence, but is identical to N-masked run, onlt the directories differ).
STAR --runThreadN 10 --runMode genomeGenerate --genomeDir STARIndex_full_sequence/ --genomeFastaFiles CAST_EiJ_full_sequence/*.fa

# Mapping
STAR  --runThreadN 10 --genomeDir Mmu.GrCm38/SNPsplit/STARIndex_full_sequence --readFilesCommand zcat --readFilesIn IP_H3K4m3_WK36.conc.R1.fastq.gz IP_H3K4m3_WK36.conc.R2.fastq.gz --outFileNamePrefix SNPsplit/W36_full_sequence. --outSAMtype BAM Unsorted --outSAMattributes NH HI NM MD --outReadsUnmapped Fastx  --alignEndsType EndToEnd

# SNPsplit ( Am I using the correct SNP file?)
SNPsplit --snp_file Mmu.GrCm38/SNPsplit/all_SNPs_CAST_EiJ_GRCm38.txt.gz --paired --conflicting --verbose SNPsplit/W36_full_sequence.Aligned.out.bam

My first question is what is the difference in the results between a full_sequence run and an N-masked run?

Second question is my main problem, as I get only unassigned reads.

my SNP file after filtering:

cat CAST_EiJ_SNP_filtering_report.txt
SNP position summary for strain CAST_EiJ (based on genome build GRCm38)
===========================================================================

Positions read in total:        78772544

21331153        SNP were homozygous. Of these:
20668547        SNP were homozygous and passed high confidence filters and were thus included into the CAST_EiJ genome

Not included into CAST_EiJ genome:
54636251        had the same sequence as the reference
4808            had no clearly defined alternative base
2800332         Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
662606          were homozygous but the filtering call was low confidence

Printed a single list of all SNPs to >all_SNPs_CAST_EiJ_GRCm38.txt.gz<...

so ~20M SNP were founds and passed the threshold. But how can I check how many of them overlap an exon (like you did in #24)?

Can this be a problem of using the STAR aligner? Should I try bowtie2?

Any ideas?

thanks

BWA is not supported [as BWA does not allow Ns in the index]

Hi @FelixKrueger ,

Thank you for the great tool! It is a great help to my research.

I have a question. Is it possible to add BWA support? Our lab is interested in doing population genetics with split reads. Basically, we will follow the GATK pipeline which recommends using BWA for read mapping. I have tried running bwa (with soft-clipping option off) followed by SNPsplit, but it failed. So I guess BWA output is not yet supported.

Also I'm wondering in a future release if we'll be able to run the tool with soft-clipping on? That would increase the mapping efficiency.

Thank you very much!

Question about analysis pipeline when using SNPsplit

Hi~

Recently, I took over an analysis project in which the reads from ChIPseq are needed to be splited into paternal and maternal ones(C57BL6-NJ and PWK-PhJ strain mouse). I'm wondering whether my analysis pipleline is wrong. So, could you help me to check the whole process???

filter and trim the reads before alignment
prepare the N-masked genome: in this procedure, I found the SNPsplit_genome_preparation tools in SNPsplit can process one genome one by one by specifying the strain name. So, I firstly masked the mm10 genome with PWK-PhJ SNP information downloaded from sanger institute website (file name: mgp.v5.merged.snps_all.dbSNP142.vcf.gz), and then masked the fa file outputed by previous step with C57BL6-NJ SNP information. Finally, I merged the fa files in the output directory of last maskking step into the new genome masked with maternal and paternal SNP sites.
align the reads to the new genome by bowtie2 with options [--end-to-end -N1 -q]
remove duplicates by sambamba
split the allele reads: in this step, I'm not sure how to input the snp file. I'm planning to merge the txt files from the step 2 in which SNPsplit seem to output the SNP information as txt files in each chromosome. But there are some questions: Can SNPsplit split the alignment into paternal and maternal ones at once? If true, how to provide the snp file with parameter --snp_file? or Should I run twice and split the maternal and paternal alignments seperately?

Can you help me to solve these problems???

Thank you sincerely!

Hanwen

SNP value column in input snp file correspond to which information in vcf format files?

What does the SNP value mean?

Does it mean "QUAL" in vcf files or any else information?

How to use a self prepared VCF ?

I had do WGS for my samples, and how to reform the vcf that suit for the SNPsplit ?

Which reference genome I have to use it for genomepreparation

Greetings..,
Nice tool I wanted use SNP split in my chipseq data I am new to this tool Can you please suggest me which reference file I can use here I have downloaded the vcf file v5 from the Sanger database as mentioned in user guide and reference genome from the ensemble GRCm39 latest version with GTF.

Can you please suggest can I proceed with the genome preparation or please suggest link to download reference genome and GTF it will be very helpful

Thank you

set "--dual_hybrid" did not mask reference

Hi,

Recently, I found when I want to set two strains of mouse background (use "--dual_hybrid"), it had failed to mask the reference genome. Followed is my command:
########
./SNPsplit_genome_preparation_v2VCF --vcf_file ~/Annotation/SNP/mouse/mm9/mgp.v2.snps.annot.reformat.vcf --reference_genome ~/Annotation/SNP/mouse/mm9/mm9 --strain C57BL6NJ --strain2 PWKPhJ --dual_hybrid --genome_build mm9 --nmasking
########
I didn't know what's the problem?
Could you give me some advises?

Thanks,
Best,
Garen

Running SNPsplit in parallel

Dear Felix,

I am using SNPsplit with WGS data from an F1 mouse cell (BL6/spretus) line with a very high SNP density (41mio). Mapping using the N-masked genome works very well and X-linked reads map only to the maternal chromosome, which would be expected in this cross. Also coverage is quite even across chromosomes.

I now want to separate the reads overlapping SNPs to specific alleles and am using SNPsplit for this. These are deep WGS samples (60x, paired-end 150) and this was taking forever with one sample, so I parsed the reads only overlapping unfiltered SNPs in the UV treated samples (~10k) and am running SNPsplit on them now to assign mutations to one genome for only this subset of the library. This is taking very long as well and am curious if you have a method for parallelizing SNPsplit to speed up the process.

Of course it would be possible to split reads again on chromosome and run one at a time in several processes and concatenate results, but was wondering if you had possibly developed an alternative method. I apologize in advance it I am missing something obvious and would greatly appreciate any input. Thanks!!

Cheers,
Paul

How to use the SNPsplit for higher version of VCF

Hello,
Great tool I am trying with the v6 version of VCF( ftp://ftp-mouse.sanger.ac.uk/REL-1807-SNPs_Indels/mgp.v6.merged.norm.snp.indels.sfiltered.vcf.gz ) because of desired genome its not present in V5,

I tried with seperate vcf file for 129S1 and JF1 and tried to followed steps in issues #31 , #41 still get empty files as a output command I tried as below

SNPsplit_genome_preparation --reference /media//SNPsplit/ --skip_filtering --dual_hybrid --strain 129S1_SvImJ --strain2 JF1_MsJ

output

Summary
0 Ns were newly introduced into the N-masked genome for strain 2 [JF1_MsJ] in total
Determining new Ref [129S1_SvImJ] and SNP [JF1_MsJ] annotations

Writing 129S1_SvImJ specific SNPs (relative to the GRCm38 reference) to >>129S1_SvImJ_specific_SNPs.GRCm38.txt<<
Writing JF1_MsJ specific SNPs (relative to the GRCm38 reference) to >>JF1_MsJ_specific_SNPs.GRCm38.txt<<
Writing SNPs in common between 129S1_SvImJ and JF1_MsJ (relative to the GRCm38 reference) to >>129S1_SvImJ_JF1_MsJ_SNPs_in_common.GRCm38.txt<<
Writing all new SNPs >>129S1_SvImJ/JF1_MsJ to >>all_JF1_MsJ_SNPs_129S1_SvImJ_reference.based_on_GRCm38.txt<<

Expected SNP file [all_SNPs_129S1_SvImJ_GRCm38.txt.gz] for strain 129S1_SvImJ did not exist! Please make sure that it is present in the current working directory

question regarding the HiC module

Hi @FelixKrueger

Could you please tell me a bit about the format of the BAM files that you need in snpsplit for read sorting from HiC data? HiCUP manual is not clear about the output BAM, but I am assuming that

It maps the R1 and R1 seperately using bowtie/bowtie2
removes invalid HiC pairs
Interleaves the R1 and R2 from valid pairs and write the sample1_2.bam, which can be taken by snpsplit

In that case I could in principle just map my R1 and 2 seperately using bowtie2, then interleave the output files, and this should serve as valid input for snpsplit. Am I correct?

Thanks,
Vivek

Relax genome preparation filtering criteria

Historically, we used to filter out any positions from VCF files where the alternative allele was not defined as as a single base (probably a relic from the days when there was one VCF file for a single strain). For the current mouse genomes project VCF file this seems overly harsh though since there may be different strains that are homozygous for different bases but at the same position.

Here is an example:

chr  //  pos  //  REF  //   ALT  //  GT strain1 // GT strain2  // GT strain3
 1      135446     G        A,T          0/0           2/2           1/1

Here all three strains would be homozygous compared to the reference, strain1 would have the same sequence as the reference, i.e. G/G, strain2 would be T/T and strain3 would be A/A. Can we please include these multiple variants as valid positions for the genome preparation.

BS-seq about SNP_split

Dear Felix，
recently I tried to use the SNP_split to split BS-seq reads from a rice hybrid, in the user guide I see that for SNP_split it adatped the bam from bismark, and before using SNP_split, the SNP position should be masked with "N", but when using the bismark, it will firstly convert "C" to "T", "G" to "A" in the BS-seq genome preparation and mapping step, so is the SNP positions masked by "N" are excluded for methylation level caculation because of no conversion?

Waiting for your reply, thanks.

Using SNPsplit to analysis Bisulfite-Seq data in plant

Dear Felix,

I am wondering if SNPsplit could be used to analyze Bisulfite-Seq data in plants? Or is SNPsplit designed for data analysis in mammalian systems?

I am asking this question because plant DNA methylation occurs in all cytosine sequence contexts: CG, CHG, and CHH. In contrast, mammalian DNA methylation is located almost exclusively in the CG context. If SNPsplit considers CG methylation only, SNPs of cytosine at CHG/CHH contexts can't be differentiated between a real SNP and methylation state. Does this make sense?

Thank you so much for your help!

Best,

Shan

SNP removal in Bismark

Hi Felix,
I wanted to know how Bismark handle positions with SNPs. I am planning to remove SNPs (C to T) in each sample after identifying them with BIS-SNP. I want to know whether Bismark infer the actual methylation proportion at these genomic positions and therefore no need to remove them. But as I understood by reading the issue #17 we need to remove them. Am I correct?
Thank You

PS: Sorry I just noticed that I have added this issue to SNPsplit, not Bismark. But I cannot delete it now.

felixkrueger / snpsplit Goto Github PK

snpsplit's People

Contributors

Stargazers

Watchers

Forkers

snpsplit's Issues

Command:

Output log:

VCF (in case it helps...):

Allele-tagging report

command:

processing log:

vcf file

Recommend Projects

Recommend Topics

Recommend Org