Giter Site home page Giter Site logo

xihaoli / staarpipeline Goto Github PK

View Code? Open in Web Editor NEW
51.0 51.0 16.0 2.15 MB

An R package for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using STAARpipeline

License: GNU General Public License v3.0

R 93.25% C 0.04% C++ 6.71%
functional-annotation rare-variant-analysis staar-pipeline statistical-genetics whole-exome-sequencing whole-genome-sequencing

staarpipeline's Introduction

Xihao's GitHub stats

staarpipeline's People

Contributors

lvmehinovic avatar xihaoli avatar zilinli1988 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

staarpipeline's Issues

preinstall packages

I would suggest add following lines for the preinstall packages. These packages seem need to be installed manually in the order (like STAAR depend on SeqArray).

BiocManager::install("SeqArray")
BiocManager::install("SeqVarTools")
devtools::install_github("hanchenphd/GMMAT")
BiocManager::install("GENESIS")
devtools::install_github("xihaoli/STAAR",ref="main")
BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
BiocManager::install("GenomicFeatures")
devtools::install_github("zilinli1988/SCANG")

I am also get confused for the Intel Math Kernel Library.

docker for STAARpipeline

Hello,

I am trying to see if there have been an implementation of this pipeline in the google cloud platform or if any docker has been available for this to perform the same in google cloud as I have seen the same in the Dnanexus platform for ukbiobank.

Regards
Akhil

STAARPipeline for imbalanced binary phenotypes

Hello and first many thanks for generating such a nice and useful pipeline!

I've seen that you recently improved STAAR according to imbalanced binary scenarios. Is this already integrated in the STAARPipeline approach?

Many thanks in advance!

Best
Andi

Minimum number of samples?

I have a dataset with ~700 samples, eventually to increase to ~1000 but I'm testing out STAARpipeline on what I have so far. I know that is very small for a human GWAS study, but I thought the gene centric and/or sliding window analysis, in combination with the weights from annotations used in STAARpipeline, might bolster the power enough to be worthwhile.

I noticed that my results files from the STAARpipeline_Gene_Centric_Coding.R script just contained a list of NULL values. I went back and stepped through the script manually, and got the following message printed for the first gene:

# of selected samples: 721
# of selected variants: 103
# of selected samples: 721
# of selected variants: 12
# of selected samples: 721
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  genotype is not a matrix!
# of selected samples: 721
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  genotype is not a matrix!
# of selected samples: 721
# of selected variants: 3
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
# of selected samples: 721
# of selected variants: 9
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
# of selected samples: 721
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  : 
  genotype is not a matrix!
# of selected samples: 721
# of selected variants: 3,724,472

If I do str(results):

List of 5
 $ plof               : NULL
 $ plof_ds            : NULL
 $ missense           : NULL
 $ disruptive_missense: NULL
 $ synonymous         : NULL

Is my dataset just too small, or is there some other issue to be fixed? I am running 0.9.6 from the Docker container because I was working on this back in October, but can change to 0.9.7 if you think that would help.

Are there provide a toy data?

Hi, Dr. li

thanks you for the tool, It looks handy for users to go through the GWAS analysis. And I want to know there are provide a toy data for a new guy, like me.

Looking forward your reply!

Account for population frequency

Hi Dr. Li

In determining rare variant in the gene-centric coding pipeline, does STAARpipeline use allelic frequency information from population database such as gnomAD, 1000 Genome, ESP6500, etc to determine whether the variant in exonic region is also rare in all populations? Also, how do you define 'rare' in all other pipeline? Like rare in the cohort or rare in the population databases?

Thank you very much

How do other species generate the variants list to be annotated?

Hi,

I want to use STAARpipeline to detect rare variation information in plants. I have my own vcf file and its annotation information. How can I generate the variants list to be annotated? I read the FAVORdatabase_chrsplit.csv in the example, and I don't know what it means, so I don't know how to start.

Looking forward to your reply!

Ayn

Null output in Dynamic window analysis

Hi,

Recently I am using your pipeline on Docker to do burden test with dynamic window analysis.

After generating aGDS file, I conducted step0-step1-step5. I tested with WES data of 6 people (2 trios) with a chr1 region. Here is my original vcf file, generated aGDS file and commands:
dynamic_window_test.zip

As I only tested one chromosome, I changed some lines of codes:

#### Number of jobs for each chromosome
jobs_num <- matrix(rep(0,3),nrow=1)
for(chr in 1:1)
{
	print(chr)
	gds.path <- agds_dir[1] # agds_dir[1]
	genofile <- seqOpen(gds.path)
	
	filter <- seqGetData(genofile, QC_label)
	SNVlist <- filter == "PASS" 

	position <- as.numeric(seqGetData(genofile, "position"))
	position_SNV <- position[SNVlist]
  
	jobs_num[chr,1] <- chr
	jobs_num[chr,2] <- min(position[SNVlist])
	jobs_num[chr,3] <- max(position[SNVlist])

	seqClose(genofile)
}

About groupid and arraryid in step 0, I directly used groupid = arraryid = scang_num.

Though no errors were reported,the output was null. I am not sure if these changes were correct.
Have no idea which step went wrong. Could you give me some advice?

Thanks!

Provide test data

Hi, I'm interested in your software. I would like to use this software on other species, but there are no directly available annotation files, only vcf files. Can you provide test data in the tutorial? It would be even better to provide a pipeline that starts with the raw vcf.
Thank you.

Integration of annotations

Hello I have a question based on the STAAR code and its underlying publications.

In the code it says:
#For each noncoding functional category, the conditional STAAR-B p-value is a p-value from an omnibus test #' that aggregated conditional Burden(1,25) and Burden(1,1), #' together with conditional p-values of each test weighted by each annotation using Cauchy method.
--> So this seems to me like calculate e.g. different burden tests by annotation weights, af weights etc. and integrating them afterwards

But based on your publication it seems to me that you integrated the beta-allele frequency weighting directly with the functional variant annotation and the variant score to generate e.g. QBurden.

Which one is correct?

Best
Andi

Results are not consistent with what we got before

Hello,

I have performed RVAS analysis using STAARpipeline using TOPMed dataset and got the results but when I performed the analysis on the updated dataset, I am not getting the same results as before and Genes that are reported as significant in the original analysis is not significant or not present in the results files. Can you help me out regarding this?

Thank you so much for help.

Gene_Centic_Coding Unable to Analyze Gene

Hello,

While running theGene_Centic_Coding function, I noticed a strange issue while processing through a list of genes for a specific chromosome.

On any given gene, the function seems to work properly until the internal coding function attempts to run the STAAR function:

try(pvalues <- STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, 
    rare_maf_cutoff = rare_maf_cutoff, rv_num_cutoff = rv_num_cutoff), 
    silent = silent)

I am receiving the following error, and thus no results from the current gene:

Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff,  :     Dimensions don't match for genotype and annotation!

This error occurs virtually for all genes. Looking into this, it appears the issue is how the annotation data is subset for the final list of variants that are lof in plof:

Anno.Int.PHRED.sub.category <- Anno.Int.PHRED.sub[lof.in.plof, ]

When I run this, lof.in.plof is a vector of NAs, TRUEs, and FALSEs, with the number of TRUEs corresponding to the final filtered number of variants to use (in my case, 5). When the annotation data in Anno.Int.PHRED.sub is subset using this vector, however, the final dimensions of the table still contain the number of rows that correspond to the previous number of variants (which, in my case, was 129).

The Geno matrix has the dimensions [n samples x 5 variants]. When Anno.Int.PHRED.sub.category is passed to the STAAR function, however, its dimensions are still [n samples x 129 variants], causing the error.

If I wrap the which function around lof.in.plof, the dimensions of the resulting table are [n samples x 5] and STAAR is able to run properly and gives no error:

Anno.Int.PHRED.sub.category <- Anno.Int.PHRED.sub[which(lof.in.plof),]

I assume this fix makes sense and there shouldn't be a reason Anno.Int.PHRED.sub.category should still contain rows with NA data..? The final dimensions of this annotation table should indeed match that of the genotype matrix, no?

Forced relatedness in null model despite NULL kins

Hi all,

from reading and trying to understand/run the pipeline, the relatedness parameter is set to TRUE during the creation of the null model object, irrespective of whether kins is NULL or a matrix, c.f. lines 159 in fit_nullmodel.R

As a direct consequence, STAAR will call STAAR_O_SMMAT (since sparse_kins is set to TRUE, c.f. line 81 in fit_nullmodel.R) rather than STAAR_O. From the outside, it would thus appear that a function designed to account for population structure/relatedness is called despite no information about the population structure/relatedness being provided, which is surprising.

Was this intended? Or should the relatedness element of obj_nullmodel be set to FALSE, such that STAAR_O would be called?

Running docker image

Hi everyone!
Is there any possibility of providing a small tutorial on how to use the docker image from STAAR-pipeline?
Without any directions I'm not sure how to execute the correct steps with it and I'm not sure how to proceed. I'm working on implementing the staar-pipeline for our cluster and the docker image would be the best way to do so.
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.