natsuhiko / rasqual Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 19.0 457 KB

Robust Allele Specific Quantification and quality controL

R 1.38% Makefile 0.71% Shell 2.24% C 95.67%

rasqual's People

Contributors

Stargazers

Watchers

Forkers

ramsinghlab q-kim xtmgah kauralasoo jharenza eivindgl rintukutum ghuls inambioinfo liangyy iamjli zhaouu eafyounian cdadams-harvard kerwin12580 s3bad zhongchao-chen

rasqual's Issues

segmentation fault with covariates file

Hey I am trying to run rasqual on chip-seq data and whenever i use my covariates file I get segmentation fault error. it works file without covariates file

tabix $VCF chr5:150219491-150290291 | rasqual -y $H3K4me3_counts_bin_file -k $H3K4me3_offsets_bin_file -s 150226191 150233313 150283713 -e 150226801 150233614 150285429 -j 30657 -l 922 -m 150 -n 30 -z -f "ACC_Plus_K4me3" -x $H3K4me3_covariates_bin_file -t
Segmentation fault

Also
for some features I am getting -nans in one of the column as an output

for example

tabix $VCF chr3:2102865-3016784 | rasqual -y $H3K4me3_counts_bin_file -k $H3K4me3_offsets_bin_file -s 2102865 2137383 2260304 2290182 2303466 2487799 2493511 2507834 2547112 2552350 2620889 2698716 2700440 2769101 2806883 2848422 2864056 2878623 2889979 3003245 3016088 -e 2103648 2144544 2260915 2290948 2304207 2488705 2494483 2508317 2548954 2557195 2621291 2699616 2700722 2769757 2807188 2849692 2865110 2879473 2890935 3003732 3016784 -j 15305 -l 13115 -m 1000 -n 14 -z -f "ACC_Plus_K4me3" -t

I get

ACC_Plus_K4me3 . chr3 2804401 A G 0.321429 3.623884 1.000000 -0.0001814579 2.3395617589 0.672682 0.010000 0.500000 1.679557 -0.102658 0 3333 4 2 2804401 -49.655566 0 -nan 0.994438

Can you help with this error?

Different output on example run when sample size is increased by 1

Hi Natsuhiko,

I am getting familiar with RASQUAL right now and testing the example run. If I increase the sample size from 24 to 25, the top SNP changes. Does it have something to do with setting prior or some internal parameters? So that I should always set the exact number of samples rather than giving rasqual a rough estimate of the minimum like -m and -l?

$ tabix data/chr11.gz 11:2315000-2340000 | rasqual -y data/Y.bin -k data/K.bin -n 24 -j 1 -l 378 -m 600     -s 2316875,2320655,2321750,2321914,2324112 -e 2319151,2320937,2321843,2323290,2324279     --fix-genotype -t
1	rs2521269	11	2321095	C	A	0.604167	0.041818	0.965100	-13.4603461799	66.1521580593	0.092827	0.000033	0.525289	2.934533	4.180899	11	83	7	4	2321095	-197.774783	0	0.995314	0.993467
$ tabix data/chr11.gz 11:2315000-2340000 | rasqual -y data/Y.bin -k data/K.bin -n 25 -j 1 -l 378 -m 600     -s 2316875,2320655,2321750,2321914,2324112 -e 2319151,2320937,2321843,2323290,2324279     --fix-genotype -t
1	rs12224967	11	2321284	G	A	0.340000	0.978724	0.978100	-11.0378792353	55.5774037017	0.887923	0.000037	0.494012	1.963603	4.263620	11	102	7	4	2321284	-201.188234	0	0.995295	0.992414

If I further increase the sample size, I got an error or some output like

$ tabix data/chr11.gz 11:2315000-2340000 | rasqual -y data/Y.bin -k data/K.bin -n 100 -j 1 -l 378 -m 600     -s 2316875,2320655,2321750,2321914,2324112 -e 2319151,2320937,2321843,2323290,2324279     --fix-genotype -t
gsl: gamma.c:1180: ERROR: error
Default GSL error handler invoked.
Aborted
$ tabix data/chr11.gz 11:2315000-2340000 | rasqual -y data/Y.bin -k data/K.bin -n 200 -j 1 -l 378 -m 600     -s 2316875,2320655,2321750,2321914,2324112 -e 2319151,2320937,2321843,2323290,2324279     --fix-genotype -t
1	SKIPPED	11	-1	N	N	-1.0	-1.0	-1.0	0.0	0.0	-1.0	-1.0	-1.0	-1.0	-1.0	0	0	-1	-1	-1	0.0	0	-1.0	-1.0

Are they expected?

Thanks!

Yanyu

Understanding Effect Sizes

Dear Natsuhiko,

I am going through the results of Rasqual and I found some QTLs with high effect sizes (0.8) and good Q-values ( 8.33E-13 ), but when I look at the coverage across the three groups I see no difference between my three groups ( homozygous reference, heterozygous, homozygous alternative). Could this be false positives or am I missing something ( should I check the values of other columns like Delta or Phi)?

Thanks a lot,
Ariel

Multiple comparisons correction

Dear Natsuhiko,

Thanks a lot for the work on Rasqual, it is a great tool. I am working with ChIP-seq data, and would like to compute FDR.

I saw in a recent paper ( Schwartzentruber 2017) that authors performed one permutation and then estimated FDR. Do you have a recommendation on how many permutations to run? Also, if I understand correctly, this FDR will be for each peak. So, for reporting the significance of a given variant the reported BH q-value from Rasqual is sufficient, and FDR is calculated only to get significant peaks, right?

I know that this is more a conceptual question so I appreciate a lot your help and time.

Best,
Ariel

Expected distribution of Pi

Hello!

In the supplement of Kumasaka et al 2016, you have figures depicting the expected and estimated distributions of RASQUAL parameters from simulations (Supplementary Figures 8-10).

I am really curious about the expected distribution of Pi (the "genetic effect"). If I understand Pi correctly, and assuming negligible mapping error and reference bias, I would have expected the distribution of Pi to be symmetric. In other words, I would expect the alternative allele at a common variant to be equally likely to increase (Pi>0.5) or decrease (Pi<0.5) expression/accessibility. But in the figures, it looks like there are more features where Pi<0.5 than Pi>0.5

Do you think this is due to some underlying biology (for example, alternative alleles tend to be minor alleles, and alleles that are less common in the population tend to be associated with less expression/accessibility than their higher frequency counterparts)? Or is the excess in P<0.5 (relative to P>0.5) driven by reference or mapping bias? Or is there another explanation I'm missing?

Thanks for your thoughts!
Cassie

Number of fSNP is 0 but lead QTL within peak

Hi, this is one example I got from my QTL mapping results using ATAC-seq with some of my own annotation

peak  chr peak.start peak.end        sid position ref alt allele.frq      HWE ia.score log10.BH.q  chi.sqr       Pi Delta Phi overdispersion sid.within.the.region no.fsnp no.rsnp no.iter.null no.iter.alt location.ties Log.lik.null convergence.status sqr.corr.fsnp sqr.corr.rsnp perm.log10.BH.q perm.chi.sqr     q.val    emp.p.val  emp.q.val midPoint dist qtl.type
chr4_87860220-87860720 chr4   87860220 87860720 rs79243012 87860512   T   C   0.214286 4.640955 0.999705  -1.873293 9.048585 0.305516  0.01 0.5       11.88503              3.654591       0 20            6           3      87860512     3.824137                  0            NA      0.994061      -0.0668922    0.8656526 0.9999999 0.0001357602 0.06012426 87860470   42 within

Basically the number of fSNP is 0 in the output, but the lead SNP found by RASQUAL is in the peak. Does this mean the fSNP is filtered out because there is not sufficient coverage, and then this test is only the linear QTL results just like SNPs that are outside of the peak?

Thanks

Rasquals Benjamini-Hochberg Q-value output

Hello,

I was wondering why the reported Q-value shows a value that is (always) negative?
To my knowledge a Q-value is an FDR adjusted p-value, and thus could only be between 0 and 1.

While my own results mainly range from a very small number (-0.0000001) to (-9.9999999), the data samples in the data directory resulted in a Q-value of -13.3941191185.

Would it be possible to get some additional information on how the BH Q-value was calculated ?

Thanks,

Danielle

Handling peaks with influential points

Hi Natsuhiko

I am looking to remove ATAC-seq peaks from my QTL analysis that have highly influential points. I'm thinking of using Cook's distance similar to what DESeq2 does:
http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

under the "Approach to count outliers" heading. Basically, if a peak has a sample with a Cook's distance that is too large, the peak is removed from the analysis.

Do you think it would be sufficient to fit negative binomial models to the peaks using the glm function in base R and then calculate Cook's distance from these fits? This doesn't quite capture the over dispersion adjustment that RASQUAL uses in model fitting, but it would be much faster. Is this OK given that this is just a data filtering step?
Thanks!

Kevin

QTL Enrichment Analysis

What is the accurate definition for QTL enrichment, and how do you define trans- region?

Thanks!

Use "#include <gsl/gsl_.h>" instead of "#include <gsl_.h>"

Example:
https://www.gnu.org/software/gsl/manual/html_node/GSL-CBLAS-Examples.html

diff --git a/src/util.h b/src/util.h
index 3f979cc..59dd073 100755
--- a/src/util.h
+++ b/src/util.h
@@ -4,13 +4,13 @@
 #include <string.h>
 //#include <cblas.h>
 #include <blaswrap.h>
-#include <gsl_cblas.h>
-#include <gsl_rng.h>
-#include <gsl_randist.h>
+#include <gsl/gsl_cblas.h>
+#include <gsl/gsl_rng.h>
+#include <gsl/gsl_randist.h>
 #include <f2c.h>
 #include <clapack.h>
-#include <gsl_sf_gamma.h>
-#include <gsl_sf_psi.h>
+#include <gsl/gsl_sf_gamma.h>
+#include <gsl/gsl_sf_psi.h>
 #include <zlib.h>
 #include <time.h>

This allows to not specify the include path for gsl header files, in case they are in a gsl subdir in the default include dir (e.g when installing gsl with the package manager of your distribution).

If gsl is installed by the package manager, the following should work:

RASQUALDIR=/path/to/rasqualdir/
cd $RASQUALDIR/src
# Not run!  Please export your environment.
export CFLAGS="-I/path/to/your/CLAPACK-*.*.*.*/INCLUDE -I/path/to/your/CLAPACK-*.*.*.*/F2CLIBS"
export LDFLAGS="-L/path/to/your/CLAPACK-*.*.*.* -L/path/to/your/CLAPACK-*.*.*.*/F2CLIBS"
make
make install

Else (without /gsl subdir in include path):

RASQUALDIR=/path/to/rasqualdir/
cd $RASQUALDIR/src
# Not run!  Please export your environment.
export CFLAGS="-I/path/to/your/CLAPACK-*.*.*.*/INCLUDE -I/path/to/your/CLAPACK-*.*.*.*/F2CLIBS -I/path/to/your/gsl-*.*"
export LDFLAGS="-L/path/to/your/CLAPACK-*.*.*.* -L/path/to/your/CLAPACK-*.*.*.*/F2CLIBS -I/path/to/your/gsl-*.*/lib"
make
make install

createASVCF.sh with 10X Genomics single cell bams

Hi Natsuhiko,

I have a question about using RASQUAL with 10X Genomics bam files. Specifically I am having issues at the createASVCF.sh step to create the allele specific vcf file. I have previously gotten this to work for 25 bulk ATAC-seq files. My output of this step normally looks like this:

I am now running createASVCF.sh on 40 single cell bam files generated using 10X Cell Ranger ATAC. bam files were extracted for each cell type. When I run this I keep getting core dumps and segmentation faults and traced this back to the qcFilterBam step. Instead of getting reads found, reads removed, and reads remaining printed there are continuous core dumps. My recent output looks like this with the single cell bam files:

I have used very large amounts of memory but keep getting segmentation faults at the qcFilterBam step. Also, the single cell bam files look normal and work with programs such as samtools. Do you have any suggestions how I can address this?

Thank you very much,
Adam

ASVCF Memory Usage

Are there any guidelines on the memory consumption of createASVCF.sh? I've run out of memory even when allocating 150GB. I don't really know how to best tell if this is normal or if I'm doing something wrong. There needs to be at least some usage guidelines on this I think.

Imputation quality score (IA)

Hi,

I was wondering if Rasqual re-calculates the imputation quality score or use the value in the INFO field of the VCF file.

Best,
Alper

condition specific eQTLs

Hi,
I would like to know if there is any way to identify condition specific eQTLs with RASQUAL. I have two conditions from same samples. I am interested to see the eQTLs that are detected in one condition but not in other and vice-versa.

If not by RASQUAL, is there an easy way of doing it ? Any pointers would be very useful.

Meta analysis of RASQUAL results from different datasets

Hi, I am looking for any recommendations to perform a meta analysis from RASQUAL considering effect size and p-values, from different studies. Its the same tissue but from different consortiums.

RASQUAL with biological replicates

Hi,
I was wondering. What is the best way to incorporate biological replicates in RASQUAL study design? I have RNA-Seq/ATAC-Seq data for 100 individuals (2 biological replicates per individual)
The scenarios I can think of:

Perform the analysis on two sets of replicates and compare the results
Include replicates in RASQUAL study design (not sure what is the best approach here though)
I suppose in eQTL scenario all samples with the same allele at the locus are considered replicates, so I can probably just lump all the data together, but I wonder if there is a better way.

What do you do in situations like that?

All the best,
Agnieszka

error building rasqual

Hello Natsuhiko,

I am having trouble installing rasqual on my local cluster. The error I have is as follows:

util.c: In function ‘lomax’:
util.c:264:9: error: non-floating-point argument in call to function ‘__builtin_isnan’
if(x[i]>th && isnan(x[i])==0 && isinf(x[i])==0){
^
util.c:264:9: error: non-floating-point argument in call to function ‘__builtin_isinf_sign’
Makefile:23: recipe for target 'util.o' failed
make: *** [util.o] Error 1

I have made sure that CLAPACK and GSL are installed on my system and changed the changed the CFLAGS and LDFLAGS to points to the right directory.

Any information would be appreciated.

Bosh

Direction of fragment count and allelic imbalance

Hello,

Does RASQUAL expect that the haplotype with higher between-sample fragment count also be the haplotype with more reads in the allelic imbalance test?

Thanks!

Kevin

sample specific offsets

Hi Natsuhiko,

I would like to know why the sample specific offsets is mandatory. I have the normalized expression values that are also corrected for technical variation ( PEER residuals ) and VCF file with allelic counts. I am trying to run rasqual just by using the residuals ( -y ) but its showing an error input files are not specified!, preseumably its expecting -k i.e sample specific offsets.

Doesnt it work with the residuals ? Does it expect to provide raw counts and it calculates sample specific offsets and needs a covariates file ( -x ) ?

Thanks,
Goutham A

Estimated computation time too long

I am currently using rasqual for ATAC data with features extended 50 000kb around each peak. Unfortunately for most of the features the computation is canceled, because the estimated computation time is too long.

Estimated computational time is too long for 8:29443453:29544444 (Sample Size=84, NofrSNPs=249, NoffSNPs=215)...aborted.
Estimated computational time is too long for 8:29466273:29567135 (Sample Size=84, NofrSNPs=233, NoffSNPs=200)...aborted.
Estimated computational time is too long for 8:29472642:29573108 (Sample Size=84, NofrSNPs=227, NoffSNPs=191)...aborted.
Estimated computational time is too long for 8:29530196:29630965 (Sample Size=84, NofrSNPs=210, NoffSNPs=185)...aborted.
Estimated computational time is too long for 8:29544525:29645849 (Sample Size=84, NofrSNPs=201, NoffSNPs=185)...aborted.
Estimated computational time is too long for 8:29607788:29708459 (Sample Size=84, NofrSNPs=195, NoffSNPs=180)...aborted.
Estimated computational time is too long for 8:29611621:29712319 (Sample Size=84, NofrSNPs=198, NoffSNPs=183)...aborted.

Is there a way to fix this or somehow parallalize the computation?

qvalue nan error

tabix $VCF chr5:150219491-150290291 | rasqual -y $H3K4me3_counts_bin_file -k $H3K4me3_offsets_bin_file -s 150226191 -e 150226801 -j 30657 -l 922 -m 55 -n 14 -z -f "ACC_Plus_K4me3" -x $H3K4me3_covariates_bin_file -t

Qval nan! 0.500000 0.500000
gsl: gamma.c:1180: ERROR: error
Default GSL error handler invoked.
Aborted

The rasqual output

Hello,

My issue is that rasqual doesn't seem to find any ASE sites. I have raw read counts for near 60K genes, from 56 samples. I have used the R script in the rasqual folder to create offset for the gene expression data (without using gc percentage), as well as the txt2bin R script to create the bin files necessary.

I created my own vcf files containing the allele specific reads, of which I've given an example below. The vcf files contain the genotype and allele specific read counts for 56 samples (not the 4 displayed as example). I created a vcf file for each chromosome, resulting in 200K to 800K snps per vcf file.
vcf file example:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10 11 12 13

chr18 10659 rs72854246 G C 100 PASS RSQ=0.9 GT:AS 0|1:0,0 0|1:0,0 0|1:0,0 0|1:0,0 ....

I have ran rasqual on 100 genes, but each output looked like:

ENSG0000000XXX.X_gene SKIPPED -1 N N -1.0 -1.0 -1.0 0.0 0.0 -1.0 -1.0 -1.0 -1.0 -1.0 0 0 -1 -1 -1 0.0 0 -1.0 -1.0

The command line I used for 1 of these 100 genes looked like this:

tabix chr_6.vcf.gz -f | bin/rasqual -y gene_expression.bin -k offSet.bin -n 56 -j 10 -l 657704 -m 100 -s 41040684 -e 41067715 -f ENSG00000XXXX.X_gene

eQTL and ASE analysis with other programs have been run on this data, which did provide results.
Could anyone maybe explain why I get no results with rasqual?

Regards,
Danielle

Wrong number of fSNPs using helper scripts

Hey, I am currently using rasqual with the helper tools from kauralasoo on some ATAC data and a lot of my peaks get skipped with this message:
The number of fSNPs is greater than that specified by '-m' option.
Aborted...
Since the -m option is calculated by the helper script, I was wondering if there are any common mistakes or special things I should look out for.

MAF filtering with -a uses > rather than >=

Hello,

When setting -a to 0.1 to get MAF >= 10%, the smallest and largest allele frequencies in my RASQUAL output files are 0.125 and 0.875, indicating that the MAF is 0.125. I have 20 samples, so 0.125 is the next highest MAF after 0.1. Would it be possible to change -a to select >= value rather than > value? Alternatively, updating the documentation to suggest using a value slightly lower than the target MAF could be a good work around. For example, -a 0.099 if 0.1 is required.

Thanks!

Kevin

QQ plot from rasqual P-values

Dear Natsuhiko,

I have a question about the results from my ATAC caQTL analysis:
If I plot the p-values to make a qq plot, I see something like this

where the P-values lay under the diagonal.
I use covariates and size factor files generated by rasqualtools , plus I add sex and 1K genomes first 4 PCs in the covariate file.
It seems that overall there is less deflation when I do not add the additional covariates.
Does it seem correct to you? DO you recommend any strategy to improve it?
Thank you very much,

Paola

Permutations and RASQUAL results

Hello,

Just want clarity on exactly what the outputs mean. My understanding is that the chi-square statistic comes from the likelihood ratio test, and thus can be converted to raw pvalues directly. What is the relation to the Q-value output?
Also, what exactly is the permutation test doing? And is it related to what is reported in the online methods multiple testing correction section?

Thanks!

Mandatory inputs

Hi,

Hope you are having a great day.

I had a question regarding the mandatory inputs for rasqual. In the documentaion, I found the followings as mandatory input for the rasqual package.

-y/--feature-counts File name of total feature counts in binary double
-k/--sample-offsets File name of sample specific offset terms in binary double
-s/--feature-starts Feature start position(s) in comma separated value (e.g., 100,300,500)
-e/--feature-ends Feature end position(s) in comma separated value (e.g., 150,350,450)
-j/--feature-id Row number of target feature in original count table
-l/--number-of-testing-snps Number of testing SNPs in VCF (candidates of a rSNP)
-m/--number-of-fsnps Number of feature SNPs (fSNPs) in VCF
-n/--sample-size Sample size

Among these, I wasn't sure how to get -s, -e, -l, and -m. I went through the page that explained how to get inputs for -y and -k, but wasn't sure about other ones.

I also found someone using rasqual with just -y, -k, and -x inputs, and was wondering if this is what I am supposed to do if I want to perform for all data, rather than few selected ones.

Thank you.

Best,
Joon Yuhl Soh

Include other covariates

Hi Natsuhiko,

For my caQTL analysis, in addition to the covariate file generated from PCA of the peak matrix, I would like to add other co-variates, such as male/female and PCA from genotype data (ancestry). How shall I do that? Shall I combine them in a unique file or do they have to be separate files?

Another question - in some cases I have outliers in the PCA[1 and 2] of the peak matrix - but I do not want to remove the samples because I only have 10 samples in total - how would you recommend to proceed in these cases?

Thank you very much for your great work!!

Paola

Issue with rasqual and/or txt2bin.R

Hello,
I'm having trouble running the rasqual program. When i ran the example command shown below with the provided sample data X.bin and Y.bin everything worked correctly. However when I used the X.txt file located in the data directory, re-created the K.txt file from the X.txt using makeOffset.R (then renamed the output to K.txt), and converted these files to .bin with txt2bin.R, and ran the example command again:

tabix data/chr11.gz 11:2315000-2340000 | bin/rasqual -y data/Y.bin -k data/K.bin -n 24 -j 1 -l 378 -m 62 -s 2316875,2320655,2321750,2321914,2324112 -e 2319151,2320937,2321843,2323290,2324279 -t -f C11orf21 -z

This error is displayed:

gsl: gamma.c:1180: ERROR: error
Default GSL error handler invoked.

I also tried running the program with my own dummy data, but this resulted in:

1 SKIPPED -1 N N -1.0 -1.0 -1.0 0.0 0.0 -1.0 -1.0 -1.0 -1.0 -1.0 0 0 -1 -1 -1 0.0 0 -1.0 -1.0

2707 caQTL in RASQUAL paper

Dear Natsuhiko,
Thanks so much for developing rasqual! Could you provide the 2707 caQTLs identified in RASQUAL paper?

best,
Ya

Number of test SNPs

Dear Natsuhiko,

When I run the sample code (after removing -t option to print all summary statistics)
tabix data/chr11.gz 11:2315000-2340000 | bin/rasqual -y data/Y.bin -k data/K.bin -n 24 -j 1 -l 378 -m 62 -s 2316875,2320655,2321750,2321914,2324112 -e 2319151,2320937,2321843,2323290,2324279 -f C11orf21 -z
I get 83 lines rather than 378 SNPs that are tested.

tabix data/chr11.gz 11:2315000-2340000 | wc -l
returns 378 for me.

I get some SNPs dropped in the results when I run my data on it, as well.

Is this expected?

Identical results for close SNP

Hi, I noticed that in my RASQUAL results, there are many cases for SNPs that are close to each other to have identical results, except for their SNP ID, REF and ALT columns. I am wondering whether this suggests there are some problems with the RASQUAL run, or is this to be expected?

Thanks so much in advance!

Do covariates need to be normalized?

Can they take all positive values for instance?

Also, if we want to model a categorical variable, say male/female, can this be accomplished by just assigning 0/1s?

Thanks,
Johnny

"nan" for squared correlation between prior and posterior genotypes (fSNPs)

Hi Natsuhiko,

I ran RASQUAL on ATAC-Seq data to get the lead SNP per peak. Out of 121,214 peaks I analyzed, 54,232 snps have -nan in the 24th column, Squared correlation between prior and posterior genotypes (fSNPs) . Is it because those SNPs are not fSNPs but rSNPs ?

Thanks,
Goutham A

Get p-value for each feature-variant test

Hi Natsuhiko,

I have a question about obtaining the raw p-value of the test. I am not sure if the q-value reported is always done by permutation. I am wondering if it is possible to get the raw p-value of each test (and presumably shut down permutation since it is not necessary for this setting).

Thank you very much!

Yanyu

Number of fSNPs greater that the specified with -m

I am getting this error

The number of fSNPs is greater than that specified by '-m' option.

I would like to know how RASQUAL determines the number of fSNPs. I calculated the fSNPs as the ones that overlaps the exons of that gene.

Output line without information

Hi Natsuhiko,

Occasionally, I got output like the following:

1	rs	1	0	N	N	-1.0	-1.0	-1.0	0.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1	2	1	-1	-1	-1	-1	-1	0.0

It seems that it ran through null and alternative model according to log file as follow (running on one test SNP with two feature SNP)

#################################
#          RASQUAL v 1.0        #
#################################

Hyper-parameters :
 ab_phi: 10.001000
 ab_pi : 10.001000
 kappa : 2.020000
 omega : 0.200000

 beta0 : 100000.000000

 sigma2: 10000.000000

Sample Size		   : 200
Init No. fSNPs	       : 2
Init No. testint SNPs : 3
Kth feature		   : 1

100
200
Effective feature length			: 10000


Memory has been allocated...
heter_var1 100 km=0.690664 af=0.312500 hwe=41.322314 asgaf=0.500000  asaf=0.498920
RSQ=1.000000 EP=0.000000
heter_var2 200 km=0.138730 af=0.060000 hwe=0.814848 asgaf=0.500000  asaf=0.493646
RSQ=1.000000 EP=0.000000

Data has been loaded...


No. fSNPs	   : 2
No. all SNPs      : 3 3
No. testing SNPs: 1
No ase/ase ratio: 0.000000


va0=5270392.205963 mu0=4915.938200 th0=4.589604

N0=200 P=1 beta=8.500238 th=4.589604 s=10000.000000 kappa=2.020000 omega=0.200000
beta0: 8.500238
init beta: 8.500238,
Init likelihood:  -1818.862952 gradb=0.000000
mean eta=8.500223 sd eta=-nan sdki=0.000507 mki=8.500223
convstatus=1 200.000000 stuck=0.000000Final likelihood: -1818.672115
Final theta: 4.874713
Final beta: 0 8.500223

Gradient th: 0.000000
Gradient be: 0.000000

Glm. results:
Lkhd GLM=-1818.672115 Theta=4.874713
8.500223
10011.000000 6268.000000 5228.000000 3378.000000 4848.000000 5189.000000 3109.000000 1797.000000 5188.000000 8745.000000 6099.000000 4713.000000 7091.000000 2145.000000 3142.000000 2216.000000 9349.000000 3249.000000 3847.000000 4452.000000 5076.000000 3476.000000 6405.000000 3852.000000 2297.000000 5627.000000 3275.000000 8271.000000 7235.000000 1294.000000 5247.000000 8161.000000 2705.000000 3953.000000 3942.000000 11778.000000 3933.000000 5353.000000 3769.000000 4687.000000 4709.000000 3020.000000 5400.000000 5575.000000 9009.000000 2114.000000 6091.000000 3217.000000 4772.000000 4711.000000 3368.000000 6272.000000 4484.000000 2819.000000 5823.000000 8490.000000 2885.000000 1626.000000 4039.000000 5681.000000 3014.000000 7146.000000 1501.000000 2398.000000 4408.000000 3926.000000 5643.000000 1557.000000 6734.000000 6684.000000 7690.000000 5468.000000 5364.000000 2433.000000 4430.000000 6716.000000 6386.000000 4930.000000 7985.000000 3847.000000 7014.000000 3062.000000 5613.000000 2767.000000
0.999713 0.999961 1.000873 0.999869 1.000015 0.999941 0.999536 0.998969 0.999522 0.999866 1.000039 0.999196 1.000505 0.999615 0.999243 1.000348 0.999589 0.999409 0.999669 0.999510 0.999800 0.999987 0.999513 0.999513 1.000216 1.000262 1.000264 1.000175 1.000510 1.000518 1.000250 0.999941 0.998979 0.999721 0.999212 0.999963 1.000535 1.000489 1.000356 0.999679 1.000227 1.000447 1.000436 1.000219 1.000168 0.999083 0.999921 0.999046 0.999778 0.999276 1.000727 0.999790 1.000329 1.000021 1.000269 1.000296 1.000017 1.000115 0.999637 1.000486 0.999774 1.000696 0.999758 0.999720 1.000804 1.000627 0.999569 1.000484 0.998979 0.999318 0.999253 0.999938 1.001283 1.000311 0.999999 1.000389 1.000409 1.000829 1.000671 0.999817 0.999706 1.000265 0.999421 0.999998
4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999 4915.864999

Grand Null w/o AS [1] pi=0.500000 delta=0.010000 phi=0.500000 beta=8.500223 theta=4.874713 theta2=4.874713 asr=0.000000 lkhd=-1843.684188 ldiff=0.000000 gpi=0.000000 gd=0.000000 gh=0.000000 gb=-0.000000 gt=0.000000 gR=0.000000 tval=0.000000 kld=0.000000
Grand Null with AS [4] pi=0.500000 delta=0.010066 phi=0.503711 beta=8.500223 theta=7.358195 theta2=7.358195 asr=0.000000 lkhd=-2355.031560 ldiff=0.000000 gpi=0.000000 gd=0.000000 gh=-0.000000 gb=-0.000000 gt=-0.000000 gR=0.000000 tval=0.000000 kld=0.000000
Grand Null with AS [231] pi=0.500000 delta=0.000000 phi=0.503971 beta=8.500223 theta=7.373650 theta2=7.373650 asr=0.000000 lkhd=-2371.397468 ldiff=0.055857 gpi=0.000000 gd=-0.990000 gh=-12.903892 gb=0.000000 gt=-10.653832 gR=0.000000 tval=0.000000 kld=0.000000
Grand Null with AS [231] pi=0.500000 delta=0.000000 phi=0.478140 beta=8.500223 theta=9.494437 theta2=9.494437 asr=0.000000 lkhd=-2364.384132 ldiff=0.065054 gpi=0.000000 gd=0.010000 gh=60.307564 gb=0.000000 gt=-52.711117 gR=0.000000 tval=0.000000 kld=0.000000
Grand Null with AS [12] pi=0.500000 delta=0.010068 phi=0.503711 beta=8.500223 theta=7.358195 theta2=7.358195 asr=0.000000 lkhd=-2355.031560 ldiff=0.000000 gpi=0.000000 gd=-0.000001 gh=0.000000 gb=-0.000000 gt=0.000002 gR=0.000000 tval=0.000000 kld=0.000000
Grand Null with AS [17] pi=0.500000 delta=0.010071 phi=0.503711 beta=8.500223 theta=7.358195 theta2=7.358195 asr=0.000000 lkhd=-2355.031560 ldiff=0.000002 gpi=0.000000 gd=-0.000004 gh=0.000000 gb=-0.000000 gt=0.000006 gR=0.000000 tval=0.000000 kld=0.000000
Grand Null with AS [15] pi=0.500000 delta=0.010067 phi=0.503711 beta=8.500223 theta=7.358195 theta2=7.358195 asr=0.000000 lkhd=-2355.031560 ldiff=0.000000 gpi=0.000000 gd=-0.000001 gh=0.000000 gb=-0.000000 gt=0.000002 gR=0.000000 tval=0.000000 kld=0.000000
Fitting null hypothesis finished.


Alternative Start

2	testSNP 4.874713 [4] pi=0.537175 delta=0.010000 phi=0.500000 beta=8.555522 theta=5.194722 theta2=5.194722 asr=0.000000 lkhd=-1842.902025 ldiff=0.000000 gpi=-0.000004 gd=0.000000 gh=0.000000 gb=0.000000 gt=-0.000001 gR=0.000000 tval=0.000094 kld=0.982743
2	testSNP [231] pi=0.493666 delta=0.000000 phi=0.505819 beta=8.490609 theta=7.206939 theta2=7.206939 asr=0.000000 lkhd=-2355.800577 ldiff=0.063030 gpi=29.188328 gd=0.010000 gh=-4.318574 gb=0.699464 gt=1.396312 gR=0.000000 tval=0.000001 kld=0.982709
ASEQTLALL finished.

Could you clarify what happens here?

Thanks!

Yanyu

pasteFiles puts the AS counts at end of FORMAT field, when not all sample info is present

We have VCF files created by the GATK genotyping pipeline. This includes multiple columns other than GT in the FORMAT field, such as AD, DP GQ, PL (GT:AD:DP:GQ:PGT:PID:PL). When we use pasteFiles to add AS counts this gets appended to the end (GT:AD:DP:GQ:PGT:PID:PL:AS). Missing fields can be set to ., but trailing fields are dropped according to the VCF format.

E.g. a sample field has DP, GQ, PGT, PID, and PL missing, so the value is ./.:0,0:0. Then the AS gets appended to ./.:0,0:0:0,0, but now it does not match the FORMAT field.

First row of RASQUAL output

Hi Natsuhiko,

RASQUAL has been helpful so far for caQTL analysis and ATAC-seq data. It seems to be working when I use a wrapper script to run RASQUAL on all peaks in our dataset. However, I get 1.000000, 1.000000, etc. all across the first row of each output file (I attached a picture of what I get). The rest of the output all looks correct according to the Output section in the tutorial. I've tried a lot of things but always get this in the output.

Do you think is due to some memory issue or something else? Also I assume this would not affect our results?

Any help would be much appreciated.

Thank you!
Adam

Minor allele frequency filtering

Hello,

For minor allele frequency filtering with -a/--minor-allele-frequency, does RASQUAL calculate the minor allele frequency or just take it from the VCF info file? I have some VCF files where I have removed a few individuals, so the MAF in the info field is not correct for the new set of samples. If RASQUAL takes MAF from info, then I will need to make a new set of VCFs when subsetting samples.

Thanks!

Kevin

Using -m and -l

Dear Natsuhiko,

I would like to get a clarification on using -m and -l for RNA-Seq data.

I am primarily interested in allele-specific gene expression. So I am planning to use the SNPs with in exons with "--as-only" option so that I end up with genes showing allele specific expression. Is this right approach ?

In a hypothetical scenario, there could be 200 SNPs that span the genomic space of the gene ( exons + introns ) but only 10 SNPs might overlap the exons. In this case, should I say -m 10 -l 200 ?

Also, sometimes I get the "-nan" in column 12 (Squared correlation between prior and posterior genotypes (fSNPs). What does this mean ?

Thanks,
Goutham A

AS counts and permutation test

Hi All,

I am using RASQUAL on RNA-Seq data. Its been doing an amazing job so far.

I have couple of questions:

I saw in a paper that they split the genes in to "Genes with fSNPs" i.e likely to have allelic counts and "Genes without fSNPs" and use different options. For "Genes without fSNPs" is it recommended to use "--population-only" ? Is it very important to do it that way ? Also, how does it effect results if we use default options on all genes ?

Regarding "--random-permutation", when I use this option, I often noticed that the results doesnt make much sense compared with the ones ran without random permutations. I would like to get some insights into this.

Final question is: How should we filter RASQUAL output ?
RASQUAL gives a column "Log_10 Benjamini-Hochberg Q-value". Can we use this column to filter the results ? I am using all the default values to run RASQUAL.

How can reference allele bias be calculated per SNP

Hi Natsuhiko,

Could you explain how reference allele bias can be calculated per SNP? I understand how reference allele bias can be calculated genome-wide per individual by taking the average reference allele ratio across all SNPs per individual. However, I don't understand how this can be done individually per SNP. We don't know if the reference allele bias is real or technical for a given SNP unless we remove technical bias with a program like WASP. Estimating the bias between individuals for a given SNP also seems problematic because all individuals should have allelic imbalance if it is real.

Thanks,

Kevin

fSNP not tested?

I am running RASQUAL on scATAC-seq data from 172 individuals and noticed that the fSNP are never present in the output. No matter how I adjust filters (for both fSNP and rSNP), the SNPs within the ATAC peak are never included in the output. How can I ensure that SNPs with the ATAC are tested? I have poured over the documentation and past tickets and cannot seem to find an answer.

tutorial for using RASQUAL on ATAC-seq data to find ATAC-QTL

I am wondering if you have a step-by-step guide, clearly written to find ATAC-QTL? I find it difficult to use the tool to analyze based on the current documentation provided..

Thank you.

Visualizing QTLs (coverage boxplot)

Hi,

In order to visualize the association between genotype of a specific locus and the expression/chip height, I am wondering if "unprocessed raw count" should be used ?

For the detection of QTLs, I have included covariates in the analysis. Is there any function included in the package that would generate covariate-adjusted expression count?

Wilson

Correlated peaks not sharing QTLs

Hi Natsuhiko,

We have a very strong chromatin QTL that looks great between variant 1:109817590 and peak11858 in our data. After looking in the browser, we found three nearby peaks that are strongly correlated with peak 11858 (peaks 11859-61). However, RASQUAL doesn't identify any significant associations for these 3 additional peaks, including with variant 1:109817590. We are trying to figure out what could be causing this. We would be greatful for your input. Things I have tried so far are below:

I began by making sure that these peaks are correlated after normalizing for library size and adjusting for the first 4 PCs (which is what I do for the QTL analysis). I then calculated
correlations:
Pearson correlation (r) of peaks adjusted for library size and first 4 PCs
peaks 11858 and 11859:
0.7953069

peaks 11858 and 11860:
0.7622177

peaks 11858 and 11861:
0.7843578

So they are very strongly correlated even after all the transformations.

I then regressed each of these peak counts against the genotype of 1:109817590 using negative binomial regression (so no allelic imbalance). I did this using the glm.nb function in R. I normalized
the peak counts by library size and used first 4 PCs as covariates in the regression:
Peak11858:
beta: 0.70
p-value: < 2e-16

peak 11859:
beta: 0.32
p-value: 9.65e-09

peak 11860:
beta: 0.26
p-value: 8.67e-08

peak 11861:
beta: 0.44
p-value: 8.21e-11

All are very significant, with 11858 (the peak that RASQUAL identified as having a significant cQTL with 1:109817590) being the strongest by p-value and beta. However, the RASQUAL p-values are weak for everything except 11858:
(All of these are raw p-values not adjusted for multiple testing)
peak11858: 8.529386e-11
peak11859: 0.03805979
peak11860: 0.1320015
peak11861: 0.0252746

I next ran RASQUAL with these 4 peaks using --population-only to see what impact allelic imbalance has on the results. Variant 1:109817 is no longer the lead variant for any of the 4 peaks, but the p-values are below:
peak11858: 0.00209815
peak11859: 0.1014721
peak11860: 0.3730409
peak11861: 0.02556151

Removing allelic imbalance greatly reduced the significance of the association between 1:109817590 and peak11858. However, the p-value for peak11858 when --population-only is used is still much stronger than the other 3 peaks.

When --population-only is used, 1:109817590 is no longer the lead variant for any of the 4 peaks. The new lead variant for peak11858 is
1:109814880
p-value: 0.0007529742

However, variant 1:109814880 is not strongly associated with the other three peaks. The p-values are below:
peak11859: 0.08443492
peak11860: 0.2146224
peak11861: 0.1164564

Do you have any ideas about why these 4 peaks don't share cQTL variants given that they are highly correlated. I'm curious as to why RASQUAL ran in --population-only mode differs so much from the glm.nb function in R.

Thanks for any help you can provide!

Kevin

Estimated computational time is too long ...

Hi Natsuhiko,
I got a warning "Estimated computational time is too long for STPG1 (Sample Size=48, NofrSNPs=2795, NoffSNPs=13)...aborted" . I looked all open/closed issues, there are no solutions yet. Is this due to too many tested SNP, or only a problem for me?

My full command is
tabix my_as_fin.vcf.gz chr1:24183489-25243424 | rasqual -y Y_expmat.bin -k K_sizefactor.bin -n 48 -j 8 -f STPG1 -l 2796 -m 61 -s 24683489,24683490,24683495,24683495,24683527,24687341,24687341,24695194,24695533,24696164,24696164,24700192,24700192,24705763,24706143,24706143,24710357,24710392,24710392,24717693,24717748,24718051,24718051,24718052,24718414,24718414,24727569,24727809,24727809,24727809,24727809,24727809,24737193,24737193,24738131,24740164,24740164,24740164,24741401,24741401,24742493,24742967 -e 24685109,24685109,24685109,24685109,24685109,24687531,24687531,24696329,24695897,24696329,24696329,24700300,24700300,24706313,24706313,24706313,24710493,24710493,24710493,24718169,24718169,24718169,24718169,24718169,24718561,24718557,24727946,24727949,24727946,24727946,24727946,24727821,24737307,24737269,24738517,24743424,24740230,24740215,24741588,24741587,24742643,24743085 -x x_covar.bin --n-threads 5

Thanks a lot.

All best,
Liuyang

Can we use non sequencing based inputs?

This may be a dumb question but I have RT-PCR expression data or methyl array data for a set of genes that I would like to look at, could I just supplement this for “counts”? Or is there another tool to look at QTLs based on these inputs?

if Indel can be used in VCF input now

Dear Natsuhiko,
I want to run rasqual on my data. Now I want to know,if Indel can be used in VCF input now.

natsuhiko / rasqual Goto Github PK

rasqual's People

Contributors

Stargazers

Watchers

Forkers

rasqual's Issues

Thanks a lot, Ariel

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10 11 12 13

Recommend Projects

Recommend Topics

Recommend Org

Thanks a lot,
Ariel