felixkrueger / snpsplit Goto Github PK
View Code? Open in Web Editor NEWAllele-specific alignment sorting
Home Page: http://felixkrueger.github.io/SNPsplit/
License: GNU General Public License v3.0
Allele-specific alignment sorting
Home Page: http://felixkrueger.github.io/SNPsplit/
License: GNU General Public License v3.0
Hi,
I created a hybrid genome using the --full_sequence option to avoid masking and get a higher mapping accuracy. Nevertheless, 100% of my reads are unassignable after mapping because there are no Ns, which makes perfect sense.
What's the point of having the --full_sequence option if you can't split reads to alleles after? Is it just a SNPsplit feature to create a new genome?
Thanks!
Hello,
Thanks for the tool once again its been while I am using SNPsplit I have doubt on. I have hybrid cross data CAST vs 129S1 for this cross SNP split can be used?
Hi,
When I run the SNPsplit_genome_preparation script on the complete Mouse genome (base chromosomes + all scaffolds/fixes), with --no_nmasking, the full_sequence output contains only the base chromosome.
My genome reference comes from ;
ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa.gz
>>grep ">" Mus_musculus.GRCm38.dna.toplevel.fa
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
>2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF
>3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF
>4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF
>5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF
>6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF
>7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF
>8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF
>9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF
>10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
>11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
>12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF
>13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF
>14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF
>15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF
>16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF
>17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF
>18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF
>19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF
>X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF
>Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF
>MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF
>CHR_MG171_PATCH dna:chromosome chromosome:GRCm38:CHR_MG171_PATCH:1:151834685:1 PATCH_FIX
>CHR_MG4222_MG3908_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4222_MG3908_PATCH:1:94987243:1 PATCH_FIX
>CHR_MG51_PATCH dna:chromosome chromosome:GRCm38:CHR_MG51_PATCH:1:156507375:1 PATCH_FIX
>CHR_MG3496_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3496_PATCH:1:195440828:1 PATCH_FIX
>CHR_MG4200_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4200_PATCH:1:94983374:1 PATCH_FIX
>CHR_MG4243_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4243_PATCH:1:156484188:1 PATCH_FIX
>CHR_MG4209_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4209_PATCH:1:91793962:1 PATCH_FIX
>CHR_MG74_PATCH dna:chromosome chromosome:GRCm38:CHR_MG74_PATCH:1:104052134:1 PATCH_FIX
>CHR_MG4310_MG4311_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4310_MG4311_PATCH:1:156656003:1 PATCH_FIX
>CHR_MG4249_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4249_PATCH:1:61433356:1 PATCH_FIX
>CHR_MG3833_MG4220_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3833_MG4220_PATCH:1:98208654:1 PATCH_FIX
>CHR_MG3231_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3231_PATCH:1:171029545:1 PATCH_FIX
>CHR_MG4151_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4151_PATCH:1:145439975:1 PATCH_FIX
>CHR_MG104_PATCH dna:chromosome chromosome:GRCm38:CHR_MG104_PATCH:1:170913546:1 PATCH_FIX
>CHR_MMCHR1_CHORI29_IDD5_1 dna:chromosome chromosome:GRCm38:CHR_MMCHR1_CHORI29_IDD5_1:1:195506435:1 PATCH_NOVEL
>CHR_MG3700_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3700_PATCH:1:90658154:1 PATCH_FIX
>CHR_MG3530_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3530_PATCH:1:130695022:1 PATCH_FIX
>CHR_MG4261_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4261_PATCH:1:103906836:1 PATCH_FIX
>CHR_MG3251_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3251_PATCH:1:145419646:1 PATCH_FIX
>CHR_MG3562_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3562_PATCH:1:156470354:1 PATCH_FIX
>CHR_CAST_EI_MMCHR11_CTG4 dna:chromosome chromosome:GRCm38:CHR_CAST_EI_MMCHR11_CTG4:1:122190308:1 PATCH_NOVEL
>CHR_WSB_EIJ_MMCHR11_CTG2 dna:chromosome chromosome:GRCm38:CHR_WSB_EIJ_MMCHR11_CTG2:1:122242168:1 PATCH_NOVEL
>CHR_PWK_PHJ_MMCHR11_CTG2 dna:chromosome chromosome:GRCm38:CHR_PWK_PHJ_MMCHR11_CTG2:1:122246885:1 PATCH_NOVEL
>CHR_MG3648_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3648_PATCH:1:104165524:1 PATCH_FIX
>CHR_MG3618_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3618_PATCH:1:156477076:1 PATCH_FIX
>CHR_CAST_EI_MMCHR11_CTG5 dna:chromosome chromosome:GRCm38:CHR_CAST_EI_MMCHR11_CTG5:1:122035401:1 PATCH_NOVEL
>CHR_PWK_PHJ_MMCHR11_CTG3 dna:chromosome chromosome:GRCm38:CHR_PWK_PHJ_MMCHR11_CTG3:1:122032376:1 PATCH_NOVEL
>CHR_MG4136_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4136_PATCH:1:156508116:1 PATCH_FIX
>CHR_MG4138_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4138_PATCH:1:130620757:1 PATCH_FIX
>CHR_MG3835_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3835_PATCH:1:90835696:1 PATCH_FIX
>CHR_MG89_PATCH dna:chromosome chromosome:GRCm38:CHR_MG89_PATCH:1:159939961:1 PATCH_FIX
>CHR_MG4213_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4213_PATCH:1:91736668:1 PATCH_FIX
>CHR_MG3829_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3829_PATCH:1:122082543:1 PATCH_FIX
>CHR_MG209_PATCH dna:chromosome chromosome:GRCm38:CHR_MG209_PATCH:1:94987270:1 PATCH_FIX
>CHR_WSB_EIJ_MMCHR11_CTG3 dna:chromosome chromosome:GRCm38:CHR_WSB_EIJ_MMCHR11_CTG3:1:122041104:1 PATCH_NOVEL
>CHR_MG4308_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4308_PATCH:1:122082543:1 PATCH_FIX
>CHR_MG3609_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3609_PATCH:1:156568640:1 PATCH_FIX
>CHR_MG4180_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4180_PATCH:1:120129530:1 PATCH_FIX
>CHR_MG3686_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3686_PATCH:1:170897390:1 PATCH_FIX
>CHR_MG65_PATCH dna:chromosome chromosome:GRCm38:CHR_MG65_PATCH:1:61442615:1 PATCH_FIX
>CHR_MG3627_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3627_PATCH:1:124903046:1 PATCH_FIX
>CHR_MG3999_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3999_PATCH:1:195424274:1 PATCH_FIX
>CHR_MG3699_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3699_PATCH:1:90657263:1 PATCH_FIX
Command line ;
SNPsplit_genome_preparation --strain CAST_EiJ --reference_genome genome/ --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --no_nmasking
Output ;
>>grep ">" CAST_EiJ_maternal_genome.fa
>10
>11
>12
>13
>14
>15
>16
>17
>18
>19
>1
>2
>3
>4
>5
>6
>7
>8
>9
>MT
>X
>Y
I think it would be good to export all chromosomes, even if there have no SNPs.
From ENSEMBLE ; Fix patches: provide improved sequence for known assembly errors. These patches will be incorporated into the primary assembly in the next major assembly release. They are coloured green in the Chromosome summary page and Region in detail page. They are improvements on the primary assembly and should be used preferentially over the primary assembly.
Thanks @FelixKrueger !
Nicolas
Hi @FelixKrueger,
Great tool; quick question-- I get some reads with a MD:A
tag (see attached .bam file), which causes SNPsplit to fail. My understanding though is that one could still use this tag as it specifies a character instead of a string:
https://samtools.github.io/hts-specs/SAMv1.pdf
Any thoughts to change the regex to include MD:A: tags?
Line 236 in 7bed7b6
On this note, I think that a more-graceful bow-out would be just leaving reads without an MD:Z tag as unassigned rather than killing the whole script. I don't know if you have thoughts on this?
Thanks,
-Caleb
I'm working on a bisulfite sequencing project with heterozygotes that have SNPs in CG sites. These SNPs in CpG sites are of primary interest to the research question. Some, but not all, of these SNPs are C/T. However, my bedgraph data never have any methylation calls at SNPs in CG sites when using the bismark2bedGraph
function, which is because I was aligning reads to an N-masked genome.
I have a well-defined list of 100% fixed SNPs for genome1 and genome2. I considered aligning the bam files to genome1 then using SNPsplit (as discussed under "Utilization of SNP positions and allele assignment of bisulfite reads" in the User Guide), but then there is a mapping bias toward genome1.
Is there a way to keep methylation calls, while avoiding mapping bias, in the SNPs?
My current workflow is to align reads to an N-masked reference genome using Bismark, then the bam files are used as input for SNPsplit to assign reads to haplotype, and I then extract methylation:
SNPsplit --snp_file <SNP.file.gz> <inputfiles>
bismark_methylation_extractor <val_1_bismark_bt2_pe.genome1.bam>
bismark2bedGraph <CpG_OT_R1_val_1_bismark_bt2_pe.genome1.txt> -o <R1>
The Samtools sort command seems to have changed from earlier versions and now fails when using Samtools version 1.3+ with the following error message:
File specified as unsorted paired-end BAM file
Sorting BAM file 'test.bam' by read IDs ...
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files
This probably just needs specifically specifying the outfile explicitly.
Dear Felix
Thanks for the useful tool. I am trying to add SNPsplit into an RNAseq pipeline but the lack of option to specify output directory of the allele-sorted files is creating a trouble.. It would be good to add this parameter in SNPsplit..
Also the documentation doesn't mention which file names/prefix/suffix to expect from the output files in case of single/dual hybrid genome preparation.. I figured this out in my case but would be good to have it in the docs..
Thanks
Vivek
Could this software seperae Father source and Mother source sequenced reads from fertilized egg sample data?
Dear SNPsplit developer,
I am running into an issue where all reads are unassignable, although my SNP file and bam file use the same reference. My aligner is bismark v0.18.1_dev and the version for SNPsplit is 0.3.2_dev.
I therefore made a tiny test SNP file https://github.com/RoseString/Sparrow/blob/master/test.txt and a test sam file https://github.com/RoseString/Sparrow/blob/master/test.sam to see if the assignment works.
In the test dataset, both SN996:348:HLW3VBCXY:1:1106:7653:2122_1:N:0:GCGCTA and SN996:348:HLW3VBCXY:1:1106:3931:2390_1:N:0:GCGCTA should be assigned to G1 because they overlap with the SNPs I provided; however, they are marked UA for some reason.
Allele-tagging report
=====================
Processed 6 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
6 reads were unassignable (100.00%)
0 reads were specific for genome 1 (0.00%)
0 reads were specific for genome 2 (0.00%)
0 reads did not contain one of the expected bases at known SNP positions (0.00%)
0 contained conflicting allele-specific SNPs (0.00%)
The command I am using is
SNPsplit --bisulfite --paired --sam --snp_file test.txt test.sam
I would really appreciate it if you can take some time to look into this. Thanks!
Dan
We have come across certain position in the genome where different strains appear to have the same SNP (indicated by the GT/genotype field), but one of the strains failed the FI/FILTER criterium (1 is PASS, 0 is FAIL). Here is an example:
GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI
1/1:22:6:0.166667:152,22,0:137,18,0:2:36:6:0,0,6,0:0:-0.616816:.:1 (129) 1/1:15:4:0:79,15,0:67,12,0:2:24:4:0,0,4,0:0:-0.556411:.:0 (Cast)
For single hybrid genomes we would include this position into the 129 genome (1/1 homozygous SNP, first line), but would ignore the position for the Cast genome (also 1/1 homozygous SNP, but failed the high confidence FI filter, second line). This seems like a reasonable approach.
For dual hybrid genomes such positions might be a problem though because when the 129 and Cast SNP lists are compared with each other it looks like there is now a SNP between 129 and Cast, even though there was evidence that the genotype was the same (1/1) in and Cast, only that it did not pass the threshold to count as high confidence SNP in Cast.
As a solution to this can we change the SNPsplit genome preparation
to store the FI
value as well as the GT
genotype and only use the position for a dual-hybrid SNP list if the position was measured with high confidence (i.e. FI=1
) in both strains? Thanks to @nservant for helpful discussions in this regard.
In SNPsplit_genome_preparation
the program stopped after the line Now reading in and storing sequence information of the genome specified in: /data/data/ref_genome/
. It said complete
, but I did not obtain the files I wanted. I increased the available memory and the program performed as intended. Please return an error if the genome can't be loaded into memory.
Currently the sorting report is only printed to the screen but not to the sorting or SNPsplit report for Single-End files (while this works fine for Paired-End files).
Hello!
I was wondering how SNPsplit behaves when a given read contains a C>T SNP as well as another one which would allow assignment. Is the read discarded altogether because of the C>T SNP or is the other "useful" SNP allowing assignment?
Thanks for your support!
Hi Felix,
I'm having an issue with the SNPsplit_genome_preparation (v0.5.0) processing step. There seems to be an issue with variable definition, but I'm unfortunately not familiar enough with Perl to troubleshoot myself.
SNPsplit_genome_preparation --vcf_file snp.noMiss.fix.withFI.vcf --strain lane1-A1-A_S1_L001_1 --strain2 lane1-A3-C_S3_L001_1 --reference_genome $STOCKYARD/stampede2/db/Amil_chr1/
Strain defined as 'lane1-A1-A_S1_L001_1' (strain index: 9)
Strain 2 specified, setting option '--dual_hybrid'
Dual Hybrid strain selected
Strain2 defined as 'lane1-A3-C_S3_L001_1' (strain2 index: 11)
Reference genome folder provided is /work/05703/jprippe/stampede2/db/Amil_chr1/ (absolute path is '/work2/05703/jprippe/stampede2/db/Amil_chr1/)'
Summarising SNPsplit Genome Preparation Parameters
==================================================
Processing SNPs from VCF file: snp.noMiss.fix.withFI.vcf
Reading/filtering VCF file: Yes (default)
Reference genome: /work2/05703/jprippe/stampede2/db/Amil_chr1/
N-masking: Yes
Full SNP genome: Yes
SNP strain: lane1-A1-A_S1_L001_1
SNP strain 2: lane1-A3-C_S3_L001_1
Dual hybrid, new Ref/SNP: lane1-A1-A_S1_L001_1/lane1-A3-C_S3_L001_1
Using the following chromosomes (detected from VCF file >>snp.noMiss.fix.withFI.vcf<<):
chr1,length=39361238,assembly=null,md5=a4bc6abe85cd962980b8fe1c5a7ddb94
Analysing SNP fields for name >lane1-A1-A_S1_L001_1<
Using FORMAT field index: 8
Using INFO field index: 7
GT index not defined, checking...
Setting GT index to >>0<<
Setting FI index to >>11<<
Can't use an undefined value as a symbol reference at /home1/05703/jprippe/SNPsplit-0.5.0/SNPsplit_genome_preparation line 1145, <IN> line 63.
If it's worth noting, I'm using a custom VCF and I had the program working a few weeks ago after modifying the v0.4.0 code to address a different error (following #47). Since then, I've had to switch computing systems and haven't been able to reproduce the success with either the newest release or previous one.
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##BisSNP-1.0.0 Program Args= -R /work/05703/jprippe/stampede2/db/Amil_chr1/chr1.fasta -T BisulfiteGenotyper -nonDirectional -nt 12 -I lane1-A1-A_S1_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -I lane1-A2-B_S2_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -I lane1-A3-C_S3_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -I lane1-A4-D_S4_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -vfn1 cpg.raw.vcf -vfn2 snp.raw.vcf
##FORMAT=<ID=BQ,Number=.,Type=Float,Description="Average base quality for reads supporting alleles. For each allele, in the same order as listed">
##FORMAT=<ID=BRC6,Number=.,Type=Integer,Description="Bisulfite read counts(not filtered by minConv): 1) number of C in cytosine strand, 2) number of T in cytosine strand, 3) number of A/G/N in cytosine strand, 4) number of G in guanine strand, 5) number of A in guanine strand, 6) number of C/T/N in guanine strand.">
##FORMAT=<ID=CM,Number=1,Type=Integer,Description="Number of Unconverted Cytosines in this position(filtered by minConv)">
##FORMAT=<ID=CP,Number=1,Type=String,Description="Best Cytosine Context">
##FORMAT=<ID=CU,Number=1,Type=Integer,Description="Number of Converted Cytosines in this position(filtered by minConv)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth, not filtered by mapping quality and base quality score criteria yet">
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="Reads supporting ALT, only keep good base. Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles">
##FORMAT=<ID=GP,Number=G,Type=Integer,Description="Normalized, Phred-scaled posteriors for genotypes as defined in the VCF specification">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Somatic status of the variant. 1) wildtype; 2) germline, somatic; 3) LOH; 4) post-transcriptional modification; 5) unknown">
##INFO=<ID=CS,Number=1,Type=Character,Description="Strand of cytosine relative to reference genome. Does not apply to non-cytosine positions">
##INFO=<ID=Context,Number=.,Type=String,Description="Cytosine Context. e.g. 'CG' means homozygous CG sites, while heterozygous CG sites will be output as IUPAC code">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth, not filtered by mapping quality and base quality score criteria yet">
##INFO=<ID=HQ,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=REF,Number=.,Type=String,Description="Reference Cytosine Context. if all cytosine context found are in reference genome, REF=0; otherwise, e.g. it will be REF=CG for position not confidently called as CG or real context is CH.">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##assembly=none
##center=none
##contig=<ID=chr1,length=39361238,assembly=null,md5=a4bc6abe85cd962980b8fe1c5a7ddb94,species=null>
##fileDate=20210802
##geneAnno=none
##phasing=none
##reference=<ID=null,Source=file:/work2/05703/jprippe/stampede2/db/Amil_chr1/chr1.fasta>
##tcgaversion=1.1
##vcfProcessLog=<InputVCF=<>,InputVCFSource=<>,InputVCFVer=<1.1>,InputVCFParam=<> InputVCFgeneAnno=<>>
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view -e GT[*]="mis" snp.sorted.vcf; Date=Fri Aug 6 10:56:27 2021
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="Whether a sample was a Pass(1) or fail (0) based on FILTER values">
##bcftools_annotateVersion=1.13+htslib-1.13
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A1-A_S1_L001_1 -c CHROM,POS,FORMAT/FI snp.noMiss.fix.vcf; Date=Fri Aug 6 16:00:26 2021
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A2-B_S2_L001_1 -c CHROM,POS,FORMAT/FI -; Date=Fri Aug 6 16:00:26 2021
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A3-C_S3_L001_1 -c CHROM,POS,FORMAT/FI -; Date=Fri Aug 6 16:00:26 2021
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A4-D_S4_L001_1 -c CHROM,POS,FORMAT/FI -; Date=Fri Aug 6 16:00:26 2021
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT lane1-A1-A_S1_L001_1 lane1-A2-B_S2_L001_1 lane1-A3-C_S3_L001_1 lane1-A4-D_S4_L001_1
chr1 193 . T C,G 31.76 PASS DP=23;MQ0=2;NS=4 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI 0/0:37,nan:0,6,1,0,0,0:.:.:.:.:2,4,0,0:0,27,28:27:5:1 0/0:37,nan:0,1,0,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1 1/1:37,nan:0,6,0,0,1,0:.:.:.:.:5,2,0,0:2147483647,4,0:4:5:1 1/1:.:0,6,0,0,2,0:.:.:.:.:0,0,0,0:0,57,6:6:5:1
chr1 245 . T C,G 31.76 PASS DP=27;MQ0=3;NS=4 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI 0/0:37,nan:0,7,0,0,2,0:.:.:.:.:4,5,0,0:0,31,34:31:5:1 0/0:37,nan:0,1,0,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1 1/1:.:0,5,0,0,5,0:.:.:.:.:0,0,0,0:0,57,6:6:5:1 1/1:.:0,5,0,0,2,0:.:.:.:.:0,0,0,0:0,57,6:6:5:1
chr1 246 . A G,C 31.76 PASS DP=28;MQ0=3;NS=4 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI 0/0:37,nan:0,0,8,1,0,1:.:.:.:.:4,5,0,0:0,31,34:31:5:1 0/0:37,nan:0,0,1,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1 1/1:.:0,0,4,0,0,6:.:.:.:.:0,0,0,0:0,57,6:6:5:1 1/1:.:0,0,5,0,0,2:.:.:.:.:0,0,0,0:0,57,6:6:5:1
chr1 247 . A G,C 36.03 PASS DP=29;MQ0=3;NS=4 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI 0/0:34.3,37:0,0,8,0,0,2:.:.:.:.:4,5,1,0:0,36,52:36:5:1 0/0:37,nan:0,0,1,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1 1/1:.:0,0,3,0,0,7:.:.:.:.:0,0,0,0:0,57,6:6:5:1 1/1:37,37:0,0,6,0,0,2:.:.:.:.:1,6,1,0:2147483647,8,0:8:5:1
Any ideas?
Thanks!
JP
Hi Felix,
I am getting too much variation read assignment in my analysis this. Thread is from after resolving my issue. https://github.com/FelixKrueger/reStrainingOrder/issues/2 same problem observed in snpsplit that's why I asked again here. Can you please confirm modification needed here ... 11-13% and 50-60% JF1 and 129s1 respectively I am using Vcf v7 please suggest me. for the data which I have is pure hybrid strain confirm this wetlab person several times. I have another dataset with the different strain cross C57 vs 129S1 there I am not facing any issue.
please help me
Thank you
Hi, I've been using SNPsplit recently with our monkey data. The input is the output of bismark (v 0.18.0,default parameters). SNP information is generated by genome resequencing. And this is part of my allele-tagging report. SNPsplit seems working well.
What I am confused about is the bold text. Why are there so many reads not containing one of the expected bases at known SNP positions (Well I think it's a little much )? Is it because of mismatches allowed by bismark? I also wonder if there are recommended parameters for bismark when the output is used for phasing ?
Thank you very much for comments.
Processed 366283612 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
288619297 reads were unassignable (78.80%)
35559649 reads were specific for genome 1 (9.71%)
34895376 reads were specific for genome 2 (9.53%)
30805929 reads did not contain one of the expected bases at known SNP positions (8.41%)
7209290 contained conflicting allele-specific SNPs (1.97%)
I am a little confused with "homozygous SNP".
In SNPsplit paper, It's said that "If the genotype is not known, SNP positions may be called from the data itself, or from genome re-sequencing performed in parallel".
For example,I have bs-seq data for a monkey as well as her genome re-sequencing data. The genotype is not known. Can SNPsplit genome preparation still be used? SNPsplit genome preparation uses only high confidence homozygous positions, which confuses me. I guess it is heterozygous SNP rather than homozygous SNP that can be used to assign reads in this case.
In the case reciprocal mouse crosses reported by Xie and colleague (GSE33722), I understand the necessity of homogeneous SNP.
Thanks for any suggestion.
Phil asked if we could produce the stats in YAML format to make it easier to integrate into MultiQC MultiQC/MultiQC#593 (comment)
During SNPsplit sort
, the strain names of the bam files are not entered (the suffix says _genome1
and _genome2
). neither is it printed in the SNPsplit sort
log files. The only way to find out what genome1 and 2 correspond to is to look back at the log of genome_generate
.
It would be good to either replace the suffix _genome1/2
with strain ID or (probably even easier) to print the strain name for genome1 and 2 in the log file of SNPsplit sort
Thanks..
Best,
Vivek
I was wondering how SNPsplit would handle phased human genomes. I didn't see anything in the documentation that seemed to take it into account but I could've missed it. Thanks!
I think the awk script you given has some mistakes
awk '{if($1 ~ "^#") {gsub("contig=<ID=", "contig=<ID=chr"); gsub("contig=<ID=chrMT", "contig=<ID=chrM"); print} else gsub("^MT", "M"); print "chr"$0}' mgp.v5.merged.snps_all.dbSNP142.vcf
I think this script should be this
awk '{if($1 ~ "^#") {gsub("contig=<ID=", "contig=<ID=chr"); gsub("contig=<ID=chrMT", "contig=<ID=chrM"); print} else {gsub("^MT", "M"); print "chr"$0}}' mgp.v5.merged.snps_all.dbSNP142.vcf
my system is centos 7
I aligned reads using Bowtie2, and when SNPsplit read the bam file, it said: Found CIGAR operations other than M, I, D or N: 'S'. Not allowed at the moment
Should I specify any parameter when running Bowtie2 or filter the input reads in some reads to rule out "S" type reads?
Could I just filter the sam file to get alignments other than 'S' alignments? Does this differ with specify --end-to-end.
when using Bowtie2?
I'm pretty sure that there are too many @PG
program lines so that the auto-detection doesn't do the right thing anymore. Related to FelixKrueger/Bismark#384?
Dear Felix,
My group is hoping to analyse some published data generated from a hybrid mouse train (C57BL/6J x JF1/Ms) in which SNP information for JF1/Ms is not available in the “mgp.v5.merged.indels.dbSNP142.normed.vcf.gz” vcf file from sanger institute website
We have however managed to find the SNP info stored in a .tsv file for this strain in the following format:
Chromosome | Start_position | End_position | Ref | Alt |
---|---|---|---|---|
1 | 3000176 | 3000176 | T | G |
Can I please check with you whether we can use this tsv file (or a reformatted version of this tsv file) for generation of the masked genome using the following command:
SNPsplit_genome_preparation --reference /ref_genome/mm10/ --strain JF1/Ms --skip_filtering
Many thanks!
Best,
Pui
Hello!
First of all thanks for your great work with this tool, it's helping a lot my analyses!
I was hoping you could help me figure out what am I doing wrong while using SNPsplit: for some reason many of the reads in the output genome1 and genome2 files are flagged as unassigned (UA). Is that suppose to be the case?
Here is my command line:
SNPsplit INPUT_FILE
--snp_file SNP_FILE
--paired
--bisulfite
--conflicting
The INPUT_FILE is the output from a bismark alignment, while the SNP_FILE is the all_STRAIN2_SNPs_STRAIN1_reference.based_on_GRCm38.txt generated during a SNPsplit_genome_preparation run.
Those are the percentages resulting from this run:
Reads were unassignable (not overlapping SNPs): 49377695 (48.29%)
Reads were specific for genome 1: 28090395 (27.47%)
Reads were specific for genome 2: 23789145 (23.26%)
Reads contained conflicting SNP information: 998916 (0.98%)
and in fact my genome1.bam file contains more than double 28090395 reads.
Is that anything you encountered before?
Thanks a lot for your support!
Would it be particularly difficult to support soft-clipping in reads? This would be very useful for use with alternative aligners, in my case I'm using minimap2 for Nanopore long reads.
To allow easier implementation of allele-specific sorting into ChIP-seq, RNA-seq or BS-seq pipelines workflows via Nextflow (e.g. our in-house Nextflow pipelines) can we add an option that will only run the allele-tagging process, and then stop at generating the allele_flagged.bam
files.
This will allow adding markdup
or deduplicate_bismark
steps, and then run the tag2sort
step at a later point in time.
Possibly not an issue, but wanted to double check. I'm getting very low rates mapping to either genome when looking at a number of different mouse lines.
I'm using SNPsplit to identify reads coming from specific genomes (GRCm39 vs C3H_HeJ).
Basic pipeline is:
./SNPsplit_genome_preparation --vcf_file file.vcf --strain C3H_HeJ --reference_genome
STAR --runMode genomeGenerate --genomeDir pathtodir --genomeFastaFiles pathtogenome/*.fa
STAR --genomeDir pathtogenome --sjdbGTFfile pathtoannotation/genes.gtf --readFilesIn file1.fastq file2.fastq --quantMode GeneCounts --alignEndsType EndToEnd --outSAMattributes NH HI NM MD --outSAMtype BAM Unsorted
./SNPsplit --SNP_file pathtoSNPsfile/file.txt Aligned.out.bam
I'm getting mapping to GRCm38 of around 0.6% and to C3H_HeJ around 0.6%. Has anyone done similar and is this in line with the sort of numbers you would expect for mouse strains? I guess there's not that much difference between the mouse lines, and the data is from very early development, so may not be that much expressed.
The documentation would benefit from markdownification, the old release notes documents could me merged into the Changelog.md file etc.
Hello Felix,
I'm working on hybrid maize and also want to identify reads from specific genome. Is SNPsplit suitable for the species?
I have mapped the paried-end reads to the maize genome by using Hisat2, and the call the alleles (stored in VCFv4.2 format) by freebayes.
I followed the instruction you mentioned in the document.
First, I prepared the genome:
SNPsplit_genome_preparation --vcf B73_Ki11_all.ref_B73.vcf --reference_genome Zea_mays.B73_RefGen_v4.50.dna.toplevel.fa --strain maize
But it turned out:
Strain name specified [maize] does not match any of the available strain names!
Available genomes to choose from are:
=====================================
unknown
=====================================
Please double check the name and try again (using '--strain NAME')
It may means I have set uncorrected genome name, then I found "unknown" in line #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown
in the vcf file, then I ran the command again with --strain
from "maize" to "unknown", :
SNPsplit_genome_preparation --vcf B73_Ki11_all.ref_B73.vcf --reference_genome Zea_mays.B73_RefGen_v4.50.dna.toplevel.fa --strain unknown
The exception came out:
Strain defined as 'unknown' (strain index: 9)
Failed to move to genome folder > Zea_mays.B73_RefGen_v4.50.dna.toplevel.fa/ <: Not a directory
SNPsplit_genome_preparation --help for more details
It may means I should put the genome in a fold, so I reran the command as below with the genome in maize_genome fold:
SNPsplit_genome_preparation --vcf B73_Ki11_all.ref_B73.vcf --reference_genome maize_genome --strain unknown
But there are so many exceptions like below:
Use of uninitialized value $fi in numeric eq (==) at /data/CrazyHsu_data/opts/SNPsplit_genome_preparation line 1041, <IN> line 125001.
There are many files generated in SNPs_unknown, but they are all empty. And the content of unknown_SNP_filtering_report.txt is:
SNP position summary for strain unknown (based on genome build GRCm38)
===========================================================================
Positions read in total: 302818
47016 SNP were homozygous. Of these:
0 SNP were homozygous and passed high confidence filters and were thus included into the unknown genome
Not included into unknown genome:
1296 had the same sequence as the reference
0 had no clearly defined alternative base
254501 Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
47016 were homozygous but the filtering call was low confidence
Printed a single list of all SNPs to >all_SNPs_unknown_GRCm38.txt.gz<...
Head 10 line of unknown_genome_preparation_report.txt is:
0 positions on chromosome 1 were changed to 'N'
0 positions on chromosome 10 were changed to 'N'
0 positions on chromosome 2 were changed to 'N'
0 positions on chromosome 3 were changed to 'N'
0 positions on chromosome 4 were changed to 'N'
0 positions on chromosome 5 were changed to 'N'
0 positions on chromosome 6 were changed to 'N'
0 positions on chromosome 7 were changed to 'N'
0 positions on chromosome 8 were changed to 'N'
0 positions on chromosome 9 were changed to 'N'
Is there anything wrong when I use the SNPsplit_genome_preparation? And how can I do?
Best,
Feng
I am experiencing processing error when using SNPsplit_genome_preparation.
SNPsplit Genome Preparation version: 0.4.0
SNPsplit_genome_preparation --vcf_file ~/project/zebrafish_resequencing/deepvariant/merged.vcf.gz --strain TU --dual_hybrid --strain2 TL --reference_genome ref_genome
Strain defined as 'TU' (strain index: 9)
Dual Hybrid strain selected
Strain2 defined as 'TL' (strain2 index: 10)
Reference genome folder provided is ref_genome/ (absolute path is '/mnt/Storage/home/wangwen/source/bySpecies/danRer11_2/strains/ref_genome/)'
Summarising SNPsplit Genome Preparation Parameters
==================================================
Processing SNPs from VCF file: /mnt/Storage/home/wangwen/project/zebrafish_resequencing/deepvariant/merged.vcf.gz
Reading/filtering VCF file: Yes (default)
Reference genome: /mnt/Storage/home/wangwen/source/bySpecies/danRer11_2/strains/ref_genome/
N-masking: Yes
Full SNP genome: Yes
SNP strain: TU
SNP strain 2: TL
Dual hybrid, new Ref/SNP: TU/TL
Using the following chromosomes (detected from VCF file >>/mnt/Storage/home/wangwen/project/zebrafish_resequencing/deepvariant/merged.vcf.gz<<):
chr10 chr8 chr17 chr21 chrM chr25 chr9 chr22 chr15 chr5 chr2 chr6 chr12 chr16 chr7 chr11 chr1 chr3 chr14 chr18 chr24 chr23 chr20 chr19 chr4 chr13
Analysing SNP fields for name >TU<
Use of uninitialized value $fi in string ne at /mnt/Storage/home/wangwen/bin/SNPsplit_genome_preparation line 976, <IN> line 44.
Use of uninitialized value $fi in string ne at /mnt/Storage/home/wangwen/bin/SNPsplit_genome_preparation line 976, <IN> line 45.
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="Whether a sample was a Pass(1) or fail (0) based on FILTER values">
##DeepVariant_version=1.0.0
##contig=<ID=chr1,length=59578282>
##contig=<ID=chr10,length=45420867>
##contig=<ID=chr11,length=45484837>
...
##bcftools_mergeVersion=1.6-37-g9a38c20+htslib-1.6-38-g8003166
##bcftools_mergeCommand=merge -Oz -o merged.vcf.gz TU.filtered.vcf.gz TL.filtered.vcf.gz; Date=Tue Apr 6 16:26:23 2021
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TU TL
chr1 119 . TG T 0 RefCall . GT:GQ:DP:AD:VAF:PL:FI ./.:.:.:.:.:.:. 0/0:29:27:24,3:0.111111:0,28,48:0
chr1 187 . A T 0 RefCall . GT:GQ:DP:AD:VAF:PL:FI 0/0:20:11:9,2:0.181818:0,20,53:0 ./.:.:.:.:.:.:.
chr1 235 . C T 0.4 RefCall . GT:GQ:DP:AD:VAF:PL:FI ./.:.:.:.:.:.:. ./.:10:38:32,5:0.131579:0,9,37:0
Could help me?
It is quite easy to forget to specify --paired
for paired-end files, so it would help to add a library strategy auto-detection, at least for the first file.
Users reported issues that no SNPs were incorporated when they ran SNPsplit. The issue turned out to be caused by the fact that the Ensembl version of the vcf
file uses a chromosome nomenclature like 1, 2, 3, X, Y, MT
whereas UCSC uses chr1, chr2, chr3, chrX, chrY, chrM
instead. This causes SNPsplit to look up chromosomes that do not actually exist.
Can we add a check to see if the chromosome names appear to be conflicting and make it die horribly.
Hi! SNPsplit seems like a perfect tool for allele specific research, nice work!
Here is my question:
I hope to look at RNA expression at individual chromosome, but although my reads are processed, no read were assigned.
my code:
#get SNPsplit format snp from sanger vcf
SNPsplit_genome_preparation --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --reference_genome ../raw_data/ --strain CAST_EiJ --strain2 C57BL_6NJ
#generate star index
STAR --runMode genomeGenerate --runThreadN 80 --genomeDir ./CAST_B6_star_index --genomeFastaFiles CAST_B6/CAST_EiJ_C57BL_6NJ_dual_hybrid.based_on_GRCm38_full_sequence/chr*.fa
#map RNA reads to ref genome
STAR --runThreadN 20 --genomeDir ./CAST_B6_star_index --readFilesIn RNA_all/umibycell.E752003.rna.R1.fq --outFileNamePrefix starAlign/star. --outSAMtype BAM Unsorted --alignEndsType EndToEnd --outSAMattributes NH HI NM MD --outReadsUnmapped Fastx
#use SNP_split
SNPsplit --snp_file ./CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt ./starAlign/star.Aligned.out.bam -o ./res
SNPsplit output:
Created output directory ./res/!
Output will be written into the directory: /shareb/zliu/test/RNAsplit/res/
Testing if input file './starAlign/star.sort.bam' looks like a Bisulfite-Seq file
Treating file(s) as single-end data (as extracted from @PG line)
Treating file(s) as single-end data (as extracted from @PG line)
Treating file(s) as single-end data (as extracted from @PG line)
Now starting to process file <<< './starAlign/star.sort.bam' >>>
Input file: 'star.sort.bam'
Writing SNPplit-tag report to: 'star.sort.SNPsplit_report.txt'
Writing allele-flagged output file to: 'star.sort.allele_flagged.bam'
Summary of parameters for SNPsplit-tag:
========================================
SNPsplit infile: ./starAlign/star.sort.bam
SNP annotation file: ./CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt
Output directory: >/shareb/zliu/test/RNAsplit/res/<
Parent directory: >/shareb/zliu/test/RNAsplit<
Samtools path: /share/home/zliu/miniconda3/envs/chip/bin/samtools
Output format: BAM (default)
Input format: Single-End
File specified as single-end BAM file
Storing SNP positions provided in './CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt'
Stored 20635313 positions in total
Reading from sorted mapping file './starAlign/star.sort.bam'
Allele-tagging report
=====================
Processed 195953 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
Reads were hard-clipped (CIGAR: H) and skipped: 0
195953 reads were unassignable (100.00%)
0 reads were specific for genome 1 (0.00%)
0 reads were specific for genome 2 (0.00%)
0 reads did not contain one of the expected bases at known SNP positions (0.00%)
0 contained conflicting allele-specific SNPs (0.00%)
SNP coverage report
===================
SNP annotation file: ./CAST_B6/all_C57BL_6NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt
SNPs stored in total: 20635313
N-containing reads: 0
non-N: 195953
total: 195953
Reads had a deletion of the N-masked position (and were thus called Conflicting): 0 (0.00%)
Of which had multiple deletions of N-masked positions within the same read: 0 (0.00%)
Of valid N containing reads,
N was present in the list of known SNPs: 0 (N/A%)
N was not present in the list of SNPs: 0 (N/A%)
Finished allele-tagging for file <<< './starAlign/star.sort.bam' >>>
Summary of parameters for SNPsplit-sort:
========================================
SNPsplit tagged infile: star.sort.allele_flagged.bam
Output directory: >/shareb/zliu/test/RNAsplit/res/<
Parent directory: >/shareb/zliu/test/RNAsplit<
Samtools path: /share/home/zliu/miniconda3/envs/chip/bin/samtools
Output format: BAM (default)
Input format: Single-End
Now processing input file <<< star.sort.allele_flagged.bam >>>
Writing YAML report to star.sort.SNPsplit_sort.yaml
Input file: 'star.sort.allele_flagged.bam'
Writing SNPsplit-sort report to: 'star.sort.SNPsplit_sort.txt'
Writing unassigned reads to: 'star.sort.unassigned.bam'
Writing genome 1-specific reads to: 'star.sort.genome1.bam'
Writing genome 2-specific reads to: 'star.sort.genome2.bam'
![image-20210703130704301](https://user-images.githubusercontent.com/16440619/124344013-8c7d4f80-dc02-11eb-8ed9-805019504a40.png)
![image-20210703130819568](https://user-images.githubusercontent.com/16440619/124344016-8e471300-dc02-11eb-9905-49ef8ba24d28.png)
Allele-specific single-end sorting report
=========================================
Read alignments processed in total: 195953
Reads were unassignable: 195953 (100.00%)
Reads were specific for genome 1: 0 (0.00%)
Reads were specific for genome 2: 0 (0.00%)
Reads contained conflicting SNP information: 0 (0.00)
Sorting finished successfully
SNPsplit processing finished... [Allele-tagging + tag2sort]
I have checked my bam and snp file, they seems right. And I'm confident about my mouse genotype.
Here is an example from igv:
And here are sample bam file and it's index:
example.zip
Any ideas why? Thank you very much for taking the time to help!
Update:
I think the problem is in the format of my SNP files. I used the one from mouse genome and replaced the content. Everything is good now. Sorry for the inconvenience and thank you very much!
Hi,
I'm trying to prepare an N-masked human genome with SNPsplit. I have read the issues #22 in the thread on a similar topic and I decided to follow your suggestion to use --skip_filtering option. I used the strain name 'SPRET_EiJ' for convenience and put my SNP files to a folder named 'SNPs_SPRET_EiJ'. However I still got problems as 0 positions are changed to N per chromosome. Could you help to check where is the problem?
My command is ../SNPsplit-0.3.4/SNPsplit_genome_preparation --nmasking --skip_filtering --vcf_file sample.vcf.gz --reference_genome ../hg38_genome/ --strain SPRET_EiJ --genome_build hg38_Nmasked
My vcf file is downloaded from https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-common_all.vcf.gz
My reference genome is downloaded from http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh38.primary_assembly.genome.fa.gz. All 'chr*' are replaced by '*'.
My SNP files are in the below format.
rs544419019 1 11012 C/G
rs561109771 1 11063 T/G
rs540538026 1 13110 G/A
rs62635286 1 13116 T/G
rs62028691 1 13118 A/G
rs531730856 1 13273 G/C
rs548333521 1 13284 G/A
rs571093408 1 13380 C/G
rs568927457 1 13453 T/C
rs546169444 1 14464 A/T
My log file is
Reading/filtering VCF file: No (skipped by user)
Reference genome: ../hg38_genome/
N-masking: Yes
Full SNP genome: No
SNP strain: SPRET_EiJ
Using the following chromosomes (HARCODED IN!!!):
1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22
X Y MT
Skipped reading the VCF file and filtering SNPs again (specified by user)
Now reading in and storing sequence information of the genome specified in: ../hg38_genome/
chr 1 (248956422 bp)
chr 2 (242193529 bp)
chr 3 (198295559 bp)
chr 4 (190214555 bp)
chr 5 (181538259 bp)
chr 6 (170805979 bp)
chr 7 (159345973 bp)
chr 8 (145138636 bp)
chr 9 (138394717 bp)
chr 10 (133797422 bp)
chr 11 (135086622 bp)
chr 12 (133275309 bp)
chr 13 (114364328 bp)
chr 14 (107043718 bp)
chr 15 (101991189 bp)
chr 16 (90338345 bp)
chr 17 (83257441 bp)
chr 18 (80373285 bp)
chr 19 (58617616 bp)
chr 20 (64444167 bp)
chr X (156040895 bp)
chr Y (57227415 bp)
chr M (16569 bp)
Processing chromosome 1 (for strain SPRET_EiJ)
Reading SNPs from file /net/noble/vol8/hefang2/hg38.Nmasked.common/SNPs_SPRET_EiJ/chr1.txt
Clearing SNP array...
Writing modified chromosome (N-masking)
Writing N-masked output to: /net/noble/vol8/hefang2/hg38.Nmasked.common/SPRET_EiJ_N-masked/chr1.N-masked.fa
0 SNPs total for chromosome 1
0 positions on chromosome 1 were changed to 'N'
Processing chromosome 2 (for strain SPRET_EiJ)
Reading SNPs from file /net/noble/vol8/hefang2/hg38.Nmasked.common/SNPs_SPRET_EiJ/chr2.txt
Clearing SNP array...
Writing modified chromosome (N-masking)
Writing N-masked output to: /net/noble/vol8/hefang2/hg38.Nmasked.common/SPRET_EiJ_N-masked/chr2.N-masked.fa
0 SNPs total for chromosome 2
0 positions on chromosome 2 were changed to 'N'
Hi,
Idea of improvment ;
Would be great if SNPsplit_genome_preparation could either take a folder or a fasta file directly in input.
Cheers
N
Hi, I'm running SNPsplit with the mouse genome GRCm38 and the strain CAST_EiJ.
The workflow was as such
# genome preparations (for comparison reason I tried both full_seqnece and N-masked)
SNPsplit_genome_preparation --strain CAST_EiJ --vcf Mmu.GrCm38/SNPsplit/mgp.v5.merged.snps_all.dbSNP142.vcf.gz --reference_genome GRCm38/ --nmasking --full_sequence --genome_build GRCm38
# genome indexing (command added only for full sequence, but is identical to N-masked run, onlt the directories differ).
STAR --runThreadN 10 --runMode genomeGenerate --genomeDir STARIndex_full_sequence/ --genomeFastaFiles CAST_EiJ_full_sequence/*.fa
# Mapping
STAR --runThreadN 10 --genomeDir Mmu.GrCm38/SNPsplit/STARIndex_full_sequence --readFilesCommand zcat --readFilesIn IP_H3K4m3_WK36.conc.R1.fastq.gz IP_H3K4m3_WK36.conc.R2.fastq.gz --outFileNamePrefix SNPsplit/W36_full_sequence. --outSAMtype BAM Unsorted --outSAMattributes NH HI NM MD --outReadsUnmapped Fastx --alignEndsType EndToEnd
# SNPsplit ( Am I using the correct SNP file?)
SNPsplit --snp_file Mmu.GrCm38/SNPsplit/all_SNPs_CAST_EiJ_GRCm38.txt.gz --paired --conflicting --verbose SNPsplit/W36_full_sequence.Aligned.out.bam
My first question is what is the difference in the results between a full_sequence
run and an N-masked
run?
Second question is my main problem, as I get only unassigned reads.
my SNP file after filtering:
cat CAST_EiJ_SNP_filtering_report.txt
SNP position summary for strain CAST_EiJ (based on genome build GRCm38)
===========================================================================
Positions read in total: 78772544
21331153 SNP were homozygous. Of these:
20668547 SNP were homozygous and passed high confidence filters and were thus included into the CAST_EiJ genome
Not included into CAST_EiJ genome:
54636251 had the same sequence as the reference
4808 had no clearly defined alternative base
2800332 Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
662606 were homozygous but the filtering call was low confidence
Printed a single list of all SNPs to >all_SNPs_CAST_EiJ_GRCm38.txt.gz<...
so ~20M SNP were founds and passed the threshold. But how can I check how many of them overlap an exon (like you did in #24)?
Can this be a problem of using the STAR aligner? Should I try bowtie2
?
Any ideas?
thanks
Hi @FelixKrueger ,
Thank you for the great tool! It is a great help to my research.
I have a question. Is it possible to add BWA support? Our lab is interested in doing population genetics with split reads. Basically, we will follow the GATK pipeline which recommends using BWA for read mapping. I have tried running bwa (with soft-clipping option off) followed by SNPsplit, but it failed. So I guess BWA output is not yet supported.
Also I'm wondering in a future release if we'll be able to run the tool with soft-clipping on? That would increase the mapping efficiency.
Thank you very much!
Hi~
Recently, I took over an analysis project in which the reads from ChIPseq are needed to be splited into paternal and maternal ones(C57BL6-NJ and PWK-PhJ strain mouse). I'm wondering whether my analysis pipleline is wrong. So, could you help me to check the whole process???
filter and trim the reads before alignment
prepare the N-masked genome: in this procedure, I found the SNPsplit_genome_preparation tools in SNPsplit can process one genome one by one by specifying the strain name. So, I firstly masked the mm10 genome with PWK-PhJ SNP information downloaded from sanger institute website (file name: mgp.v5.merged.snps_all.dbSNP142.vcf.gz), and then masked the fa file outputed by previous step with C57BL6-NJ SNP information. Finally, I merged the fa files in the output directory of last maskking step into the new genome masked with maternal and paternal SNP sites.
align the reads to the new genome by bowtie2 with options [--end-to-end -N1 -q]
remove duplicates by sambamba
split the allele reads: in this step, I'm not sure how to input the snp file. I'm planning to merge the txt files from the step 2 in which SNPsplit seem to output the SNP information as txt files in each chromosome. But there are some questions: Can SNPsplit split the alignment into paternal and maternal ones at once? If true, how to provide the snp file with parameter --snp_file? or Should I run twice and split the maternal and paternal alignments seperately?
Can you help me to solve these problems???
Thank you sincerely!
Hanwen
I had do WGS for my samples, and how to reform the vcf that suit for the SNPsplit ?
Greetings..,
Nice tool I wanted use SNP split in my chipseq data I am new to this tool Can you please suggest me which reference file I can use here I have downloaded the vcf file v5 from the Sanger database as mentioned in user guide and reference genome from the ensemble GRCm39 latest version with GTF.
Can you please suggest can I proceed with the genome preparation or please suggest link to download reference genome and GTF it will be very helpful
Thank you
Hi,
Recently, I found when I want to set two strains of mouse background (use "--dual_hybrid"), it had failed to mask the reference genome. Followed is my command:
########
./SNPsplit_genome_preparation_v2VCF --vcf_file ~/Annotation/SNP/mouse/mm9/mgp.v2.snps.annot.reformat.vcf --reference_genome ~/Annotation/SNP/mouse/mm9/mm9 --strain C57BL6NJ --strain2 PWKPhJ --dual_hybrid --genome_build mm9 --nmasking
########
I didn't know what's the problem?
Could you give me some advises?
Thanks,
Best,
Garen
Dear Felix,
I am using SNPsplit with WGS data from an F1 mouse cell (BL6/spretus) line with a very high SNP density (41mio). Mapping using the N-masked genome works very well and X-linked reads map only to the maternal chromosome, which would be expected in this cross. Also coverage is quite even across chromosomes.
I now want to separate the reads overlapping SNPs to specific alleles and am using SNPsplit for this. These are deep WGS samples (60x, paired-end 150) and this was taking forever with one sample, so I parsed the reads only overlapping unfiltered SNPs in the UV treated samples (~10k) and am running SNPsplit on them now to assign mutations to one genome for only this subset of the library. This is taking very long as well and am curious if you have a method for parallelizing SNPsplit to speed up the process.
Of course it would be possible to split reads again on chromosome and run one at a time in several processes and concatenate results, but was wondering if you had possibly developed an alternative method. I apologize in advance it I am missing something obvious and would greatly appreciate any input. Thanks!!
Cheers,
Paul
Hello,
Great tool I am trying with the v6 version of VCF( ftp://ftp-mouse.sanger.ac.uk/REL-1807-SNPs_Indels/mgp.v6.merged.norm.snp.indels.sfiltered.vcf.gz ) because of desired genome its not present in V5,
I tried with seperate vcf file for 129S1 and JF1 and tried to followed steps in issues #31 , #41 still get empty files as a output command I tried as below
SNPsplit_genome_preparation --reference /media//SNPsplit/ --skip_filtering --dual_hybrid --strain 129S1_SvImJ --strain2 JF1_MsJ
output
Summary
0 Ns were newly introduced into the N-masked genome for strain 2 [JF1_MsJ] in total
Determining new Ref [129S1_SvImJ] and SNP [JF1_MsJ] annotations
Writing 129S1_SvImJ specific SNPs (relative to the GRCm38 reference) to >>129S1_SvImJ_specific_SNPs.GRCm38.txt<<
Writing JF1_MsJ specific SNPs (relative to the GRCm38 reference) to >>JF1_MsJ_specific_SNPs.GRCm38.txt<<
Writing SNPs in common between 129S1_SvImJ and JF1_MsJ (relative to the GRCm38 reference) to >>129S1_SvImJ_JF1_MsJ_SNPs_in_common.GRCm38.txt<<
Writing all new SNPs >>129S1_SvImJ/JF1_MsJ to >>all_JF1_MsJ_SNPs_129S1_SvImJ_reference.based_on_GRCm38.txt<<
Expected SNP file [all_SNPs_129S1_SvImJ_GRCm38.txt.gz] for strain 129S1_SvImJ did not exist! Please make sure that it is present in the current working directory
Could you please tell me a bit about the format of the BAM files that you need in snpsplit for read sorting from HiC data? HiCUP manual is not clear about the output BAM, but I am assuming that
sample1_2.bam
, which can be taken by snpsplitIn that case I could in principle just map my R1 and 2 seperately using bowtie2, then interleave the output files, and this should serve as valid input for snpsplit. Am I correct?
Thanks,
Vivek
Historically, we used to filter out any positions from VCF files where the alternative allele was not defined as as a single base (probably a relic from the days when there was one VCF file for a single strain). For the current mouse genomes project VCF file this seems overly harsh though since there may be different strains that are homozygous for different bases but at the same position.
Here is an example:
chr // pos // REF // ALT // GT strain1 // GT strain2 // GT strain3
1 135446 G A,T 0/0 2/2 1/1
Here all three strains would be homozygous compared to the reference, strain1 would have the same sequence as the reference, i.e. G/G, strain2 would be T/T and strain3 would be A/A. Can we please include these multiple variants as valid positions for the genome preparation.
Dear Felix,
recently I tried to use the SNP_split to split BS-seq reads from a rice hybrid, in the user guide I see that for SNP_split it adatped the bam from bismark, and before using SNP_split, the SNP position should be masked with "N", but when using the bismark, it will firstly convert "C" to "T", "G" to "A" in the BS-seq genome preparation and mapping step, so is the SNP positions masked by "N" are excluded for methylation level caculation because of no conversion?
Waiting for your reply, thanks.
Dear Felix,
I am wondering if SNPsplit could be used to analyze Bisulfite-Seq data in plants? Or is SNPsplit designed for data analysis in mammalian systems?
I am asking this question because plant DNA methylation occurs in all cytosine sequence contexts: CG, CHG, and CHH. In contrast, mammalian DNA methylation is located almost exclusively in the CG context. If SNPsplit considers CG methylation only, SNPs of cytosine at CHG/CHH contexts can't be differentiated between a real SNP and methylation state. Does this make sense?
Thank you so much for your help!
Best,
Shan
Hi Felix,
I wanted to know how Bismark handle positions with SNPs. I am planning to remove SNPs (C to T) in each sample after identifying them with BIS-SNP. I want to know whether Bismark infer the actual methylation proportion at these genomic positions and therefore no need to remove them. But as I understood by reading the issue #17 we need to remove them. Am I correct?
Thank You
PS: Sorry I just noticed that I have added this issue to SNPsplit, not Bismark. But I cannot delete it now.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.