mojaveazure / angsd-wrapper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from arundurvasula/angsd-wrapper

17.0 17.0 3.0 13.91 MB

Utilities for analyzing next generation sequencing data

Home Page: https://github.com/ANGSD-wrapper/angsd-wrapper

License: MIT License

R 34.85% Shell 65.15%

angsd-wrapper's Introduction

Hello World

I am Paul Hoffman, a senior bioinformatician with the Satija and Lappalainen Labs at the New York Genome Center. My work revolves around building software and for bioinformatic analyses, including an R package for analyzing single-cell data, along with several extension packages, and a pipeline for processing RNA-seq data. I work primarily in R and Python, but am always learning.

Social

Technologies Used

Basic Stats

Recent Activity

angsd-wrapper's People

Contributors

Stargazers

Watchers

Forkers

tvkent linhua-sun jinfengchen

angsd-wrapper's Issues

Help.sh?

I don't see any mention of this in the tutorial, but it looks useful. Add a mention on the front page of the wiki?

Thetas windowed analysis does not specify an output file

ANGSD documentation lists an -outnames argument that is missing in this part of thetas wrapper:

"${ANGSD_DIR}"/misc/thetaStat do_stat \ "${OUT}"/"${PROJECT}"_Diversity.thetas.gz \ -nChr "${N_CHROM}" \ -win "${WIN}" \ -step "${STEP}"

I'm guessing this is just in case you do windowed and non-windowed and don't want to overwrite, but just checking

Consistency

There's a few consistency problems, some of which I had raised before and were fixed in the previous version of AW.

The wrapper calls and filenames don't match. The calls seem to be some sort of abbreviated name, the shell scripts they're calling have a longer name, and the config files have a long name as well, but I didn't look to see if they match the shell scripts or not. These need to match. The idea behind this wrapper is that it is easier to use than angsd and inconsistencies, especially in things like commands and files you have to call in the same line, makes this non-intuitive to use. I know it seems nit picky but this is an actual problem. Change to the shorter names to be consistent and be cleaner.
The tutorial says to run setup, and when I run this I get a response to run setup dependencies and setup data, but this doesn't match the installation tutorial. These should also match for sake of ease-of-use. (Either add to the tutorial or drop from the response)

2DSFS Fst output file mystery

Thank you for writing and updating angsd-wrapper - it has been a big help for me.

I am running the 2DSFS Fst estimation on MSI (mesabi). Although the process starts correctly, it ends before calculating Fst, with the error mv: cannot stat /panfs/roc/groups/3/grossb/grossb/scratch/dom_si_Fst_01/Fst/shared.pos.gz: No such file or directory. When I go to the project directory in scratch, the 'shared.pos.gz' file is present, but it is outside of the 'Fst' folder where all of the other output files have been created. Any idea how I am managing to get this one output file to go to the wrong place, or how to fix it? Thank you for your time.

bash_profile

Does the tutorial say it assumes the user is using bash? Might be worth mentioning. I use zsh for example, so adding:

alias angsd-wrapper='/home/jri/teaching/ecl243/angsd-wrapper/angsd-wrapper'

to my bash_profile during setup doesn't do much does it?

-Jeff

Modify Thetas Shiny Graphing file loading behavior

Feature suggestion mentioned at the end of issue #47:

Because I'm working with a non-model organism with ~11,500 scaffolds (as opposed to something nice like 23 chromosomes...) when you upload the {name}_Thetas.graph.me to Shiny it's a super long list that's initially populated due to the unique call within the server.R script (line 109).
Is there a way so that you can start by loading the file and insert the Scaffold of interest, rather than removing the 11,499 scaffolds you don't want, as a default?
My current solution is to create a subsetted graph.me file with just the scaffold(s) I want, and that works fine, but thought it would be a feature to consider.

pipefail fails in zsh

rerun/rerun#160

STABLE branch

Can I suggest a stable branch that people (e.g. my class) can use that we know works but may still need tweaks in naming of things etc., and that we do those fixes on a dev branch? It seems some of recent fixes are breaking things, which will cause havoc with users/students. I suggest making sure things work on a new install of AW before committing to master.

Help with FST in Angsd-wrapper

I am trying to do Fst with Angsd-wrapper but I run into a problem.
"WRAPPER: realSFS 2dsfs...
-> Problem opening file: '/home/claire/Softwares/angsd-wrapper/Essai_20AA20AB20BB/Results/Essai20AA_20BB_C38/Fst/AA_Intergenic.saf.idx'
WRAPPER: realSFS fst prep...
-> Problem opening file: '/home/claire/Softwares/angsd-wrapper/Essai_20AA20AB20BB/Results/Essai20AA_20BB_C38/Fst/AA_Intergenic.saf.idx'
WRAPPER: estimating global fst...
-> Assuming idxname:/home/claire/Softwares/angsd-wrapper/Essai_20AA20AB20BB/Results/Essai20AA_20BB_C38/Fst/AA.BB.fst.idx
-> Problem opening file: '/home/claire/Softwares/angsd-wrapper/Essai_20AA20AB20BB/Results/Essai20AA_20BB_C38/Fst/AA.BB.fst.idx'
WRAPPER: estimating windowed fst...
-> Assuming idxname:/home/claire/Softwares/angsd-wrapper/Essai_20AA20AB20BB/Results/Essai20AA_20BB_C38/Fst/AA.BB.fst.idx
-> Problem opening file: '/home/claire/Softwares/angsd-wrapper/Essai_20AA20AB20BB/Results/Essai20AA_20BB_C38/Fst/AA.BB.fst.idx'"

I think it comes from an earlier error in which it tells me "inbreeding_20BB.indF' should contain 19 values, but has 20". It has 20, because (i check) file bam list also has 20 files.

If I change it to 19 in the indF, I have the error "Mismatch between number of samples and inbreeding coefficients".

Now, I have edited the Fst.sh to make it ignore the mismatch error and I am giving it 19 inbreeding coefficients (for 20 samples), which does not make sense but just to see. It runs.

Yet, we have another error (I am giving it a fasta ref with only one scaffold, the same indicated by regions).
"Sorting the regions file
Error in FUN(X[[i]], ...) : objet 'ordered.positions' introuvable"

I found a dirty alternative: running the 2DSFS with the other git branch works for SFS (ignoring the mismatch nb samples/nb of inbreegind coeffisients).This one is fine with using the region given. Yet, after calculating SFS, it does not manage to finish the FST because it lost the folder shared.pos.gz outside the FST folder.
Coming back to the main Fst git branch, ignoring the mismatch nb samples/nb of inbreegind coeffisients, we can now finish FST calculation because it reuses the SFS stored in the folder. Yet, I don't think this is a very nice way to do that and I wonder whether it uses 20 individuals or 19.

Could you help me?? What am I doing wrong?
Thanks a lot
Claire

Fst graphing unused argument (col = "red")

When clicking Fst Lowess I get an error that says unused argument (col = "red")

Problems with the tutorial

Hi,
Here's the list of problems I encountered with the tutorial (though I never even got that far, as you will see):

Minor point, but Zea mays ssp. mexicana is listed as Zea mexicana--could be wrong but pretty sure this is a ssp.
I believe git comes installed on OSX, does it not? If so, there's no need to install it in the tutorial.
brew install samtools comes up empty, but it suggests an alternative. This should be fixed so no one has issues installing. Maybe link to the samtools setup if users have problems using homebrew here?
May just be a personal preference, but the command setup please is a bit off-putting/patronizing? Seems out-of-convention to use something other than setup, anyway.
Command name aside, the setup failed when I tried it on a lab computer. Nothing was setup. I can try it again and give you the output, but you should try your setup command on a few computers that you haven't used for angsd-wrapper yet and see if you can get it to work.

change span of lowess

Modify shiny code so users can change the span (f) of the lowess fit.

upgrade dependencies fails

I run "angsd-wrapper upgrade dependencies" and get:

Your branch is behind 'origin/master' by 2 commits, and can be fast-forwarded.
(use "git pull" to update your local branch)
Updating abeabb7..356cae7
Fast-forward
README.md | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
HEAD is now at abeabb7 Update README.md
make: Nothing to be done for `all'.
Your branch is up-to-date with 'origin/master'.
Already up-to-date.
fatal: Could not parse object 'c39b6ad35c8512d29f09dc4ffd7b6c30afcebd16'.

samtools dependency

This dependency setup is a bit confusing. I understand not automatically downloading this, but it seems like the wiki is avoiding telling people how to get it. brew samtools comes up with multiple options, and it's unclear if any of these are the most up-to-date version. The wiki mentions needing to install HSTlib as well, but the samtools download includes this. The instructions to downloading this are a bit confusing and could probably be more direct. Installing this locally myself was a bit confusing because the samtools site has updated since the last update and it redirects a lot. Perhaps add a prompt that asks the user if they want it installed and automate it? Or just lay it out clearly in the wiki.

ncores

N_CORES=32 seems a bad default. or at least we need to make users aware of this somewhere.

correct the path in the tutorial

The path" SAMPLE_INBREEDING=${HOME}/software/angsd-wrapper/iplant/Maize_Inbreeding.indF" in tutorial (https://github.com/mojaveazure/angsd-wrapper/wiki/Tutorial) should be "${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Inbreeding.indF"

A error when run "./angsd-wrapper setup data"

When I ran "./angsd-wrapper setup data", I got the error message like this"cat: Teosinte_Samples.txt: No such file or directory". I checked the "Teosinte_Samples.txt" and found that it actually in the "/Example_Data/Teosinte".

2DSFS issue

Error message when running 2DSFS:

/panfs/roc/groups/9/morrellp/liux1299/ANGSD-wrapper_beta_testing/angsd-wrapper/Wrappers/2D_SFS_Fst.sh: line 187:  7536 Killed

Line 187 of 2D_SFS_Fst.sh:

    187 fi

Example_Data/Sequences/Tripsacum_TDD39103.fai looks older than .fa

Following instructions from this tutorial:

https://github.com/mojaveazure/angsd-wrapper/wiki/Tutorial

I received the error below. I fixed the problem with samtools faidx Tripsacum_TDD39103.fa but thought it was strange that Example Data provided needed to be fixed first thing.


ljcohen@farm:~/ECL243/Assignment3/angsd-wrapper/Example_Data/Maize/Configuration_Files$ angsd-wrapper SFS ./Site_Frequency_Spectrum_Config



        -> Reading fasta: /home/ljcohen/ECL243/Assignment3/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa



        -> fai index file: '/home/ljcohen/ECL243/Assignment3/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa.fai' looks older than corresponding fastafile: '/home/ljcohen/ECL243/Assignment3/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa'.



        -> Please reindex fasta file

When it did run, warnings appeared, but the script finished. Will this affect the output?


        -> Region lookup 1/1

Warning: The index file is older than the data file: /home/ljcohen/ECL243/Assignment3/angsd-wrapper/Example_Data/Maize/BKN022_ZEAHRFRAKDIAAPE.bam.bai

Warning: The index file is older than the data file: /home/ljcohen/ECL243/Assignment3/angsd-wrapper/Example_Data/Maize/BKN011_ZEAHRFRACDIAAPE.bam.bai

Warning: The index file is older than the data file: /home/ljcohen/ECL243/Assignment3/angsd-wrapper/Example_Data/Maize/BKN014_ZEAHRFRADDIAAPE.bam.bai

Warning: The index file is older than the data file: /home/ljcohen/ECL243/Assignment3/angsd-wrapper/Example_Data/Maize/BKN025_ZEAHRFRAMDIAAPE.bam.bai

Warning: The index file is older than the data file: /home/ljcohen/ECL243/Assignment3/angsd-wrapper/Example_Data/Maize/BKN026_ZEAHRFRANDIAAPE.bam.bai

Shiny - admix plot issue

Admix Plot

Currently, Shiny asks users to input Number of K ancestral populations to graph as shown:

The issue is when inputting any number between 1 and 10, there are no visible changes that occur. This may or may not be an issue, but is something to note.

I will try it with different datasets to see if I get the same results.

saf file not interchangeable between wrappers

at least to my understanding, the saf file created in one wrapper will end up in that wrapper's out directory, so if you wanted to run a wrapper on the same population in the future, the overwrite=false flag doesn't really do anything because it's looking in a dir that does not have any files yet. So it's not clear to me if there's a simple way right now to save time by utilizing .saf files across wrappers

setup fail

Just downloaded recent (commit eb3d0b9) AW and ran setup. I get:

gcc -c -Wall -O3 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -D_CURSES_LIB=1 -I. knetfile.c -o knetfile.o
gcc -Wall -O3 -o bgzip bgzf.o bgzip.o knetfile.o -lz
g++ -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -c parse_args.cpp
g++ -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -c read_data.cpp
read_data.cpp:1:10: fatal error: 'gsl/gsl_rng.h' file not found

include <gsl/gsl_rng.h>

1 error generated.
make: *** [read_data.o] Error 1

test.sh?

Admittedly I didn't check, but is test.sh supposed to be in the download? It looks like an artifact to me

line numbers in common_config

I think they're off in the tutorial/wiki.

shiny file names

The tutorial still lists e.g. pestPG files rather than the graph.me files. Need to check the shiny tutorial lists the correct file names.

Example data in setup

Not all users may want the example data when they run setup on angsd-wrapper so it may be a good idea to have a separate command to install and configure it.

.txt vs .indF

Teosinte_Inbreeding.txt vs. Maize_Inbreeding.indF is inconsistent

export option feature request for Shiny app

Hi everyone,
The Shiny apps are amazing and incredibly helpful to sift through large amounts of information interactively. I can't stress how happy and excited I am to have a chance to use them.
Very simple request that perhaps doesn't require much of a change - is there a way to export any of the frames in view by using something other than a "right click + save as" option within the GUI? What I am specifically hoping for is a feature where I can export any of the plots being viewed in the GUI as a .pdf file, rather than the current (and only available) .png option.
Thanks again for so many wonderful tools,
Devon

install using 'angsd-wrapper shiny graphing'

angsd-=_shiny_graphing_install_error.txt
Hi everyone,

Loving all the detailed info provided with angsd-wrapper's Tutorial pages. However, I'm currently struggling with getting angsd-wrapper shiny graphing to work properly. For what it's worth, I've had no trouble running any of the commands (Genotype_likelihoods, SFS, etc.) using a cluster - my only issue at the moment is taking these data and getting Shiny to work with them.

The problem is that I can't get past the execution of the angsd-wrapper shiny graphing script. Here's a breakdown of what I've tried:

Installed angsd-wrapper to my laptop home directory, following Tutorial instructions.

I can run angsd-wrapper successfully. The Usage menu appears upon entering "angsd-wrapper".

Ran angsd-wrapper setup data to install all data on my laptop.

All data appears installed and accessible. Looks just like what I see installed on the cluster.

Ran * angsd-wrapper shiny graphing*.

The output from that installation is attached as a .txt file to this message. The complication appears to be focused on a handful of R packages (that is, many packages appear to install without issue, but due to some packages failing to install I never see a new window pop up allowing me to load the files of interest into the Shiny app).

I tried running the same command from the cluster just to see what would happen and I don't get the same error message - things all appear to load properly without issue. Instead, it seems to stall, as I would imagine, when it's trying to open up a new Shiny window and can't (Listening on http://127.0.0.1:4915)... because it's a cluster and not my local machine?

This makes me think that I've done something fundamentally wrong with the install, or need to find a way to sync my existing R library on my laptop with the places that this Shiny app wants to put newly installed packages.

Thanks for any advice you can offer,

Devon

Example data on figshare

I uploaded the example data to figshare so we don't need to worry about iPlant issues: https://dx.doi.org/10.6084/m9.figshare.2063442

Fst does not return actual base pair coordinates

We need the x axis of the Fst graph to correspond to base pair positions instead of the index.

Rplots folder documentation

Are the scripts inside documented anywhere? I couldn't find it in the wiki. My understanding is that these scripts output a PDF plot of admix, thetas, and SFS in case you don't want to use the shiny graphing app?

Wiki link to thetas

In the table of contents on the wiki it links to the thetas page on my repo instead of your repo. E.g.

https://github.com/arundurvasula/angsd-wrapper/wiki/Thetas

instead of

https://github.com/mojaveazure/angsd-wrapper/wiki/Thetas

Don't have access to fix but should be simple.

got this error running fst...

WRAPPER: creating files for Shiny graphing...
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 1411892658 rows; more than 434377244 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
Calls: merge -> merge.data.table -> [ -> [.data.table -> vecseq
Execution halted

add a paramter to define if the samples are haploid or diploid?

Is that possible to add a parameter to define if the samples are haploid or diploid?

xlim error when selecting " Toggle gff annotations"

Hi again folks,

Problem 1

I noticed a minor tweak needed to load the example data in the shinyGraphing directory in the server.R script on line 36:

thetas <- fread(input = path, sep = " ", header = TRUE)

While this command will work for outputs labelled graph.me, this fails to load the example provided in the tutorial package (BKN_Diversity.thetas.gz.pestPG ), nor will it work for any angsd-wrapper {out}.pest.PG output from ones own data. However, a short fix by substituting the delimiter to tab-separated will do the trick:

thetas <- fread(input = path, sep = "\t", header = TRUE)

...

Problem 2

Once the files were able to be uploaded, I could use all the various interactive features, clicking on different tests without issue, and zooming in on certain regions without any problem (what a cool tool!). While I can also upload a .gff file without registering an error, the second I click the box "Toggle GFF annotations" I get an error message as follows:

Warning: Error in plot.window: invalid 'xlim' value Stack trace (innermost first): 106: plot.window 105: localWindow 104: plot.default 103: plot 102: plot 101: renderPlot [/Users/devonorourke/Documents/Lab.Foster/bat_genomics/angsd_analyses/my_app/server.R#250] 91: <reactive:plotObj> 80: plotObj 79: origRenderFunc 78: output$thetaPlot2 1: runApp

Once the gff file is read, I can see the interval lines appear in the top panel of the interactive screen (blue hash marks along the x-axis), however the second panel which would normally generate the zoomed regions is no longer displayed. A message on the webpage displays "Error: invalid 'xlim' value".

I receive this error using the provided example Zea_mays.AGPv3.23.chromosome.10.gff3.gz file, and the same error when I select my own GFF file.

Thanks for any advice you can offer,

Cheers

Why can I only get single value of theta and Tajima's D when I tested the Example_Data?

Is that because "Maize_Regions.txt" contains only single region?

args

should add a line with the arguments or the input that writes to stderr or stdout. would make checking which jobs had problems easier

Incorporating Arun's Shiny Graphing with Paul's angsd-wrapper scripts

After using Arun's shiny graphing script with Paul's angsd-wrapper scripts and generating a thetas graph, shiny graphing plots each point for each contig on separate graphs instead of all on the same graph.

Calculate SFS without reference genome

Dear

I relatively new using ANGSD and I found ANGSD-wrapper a really easy
well to start working with ANGSD methods. I followed the tutorial to
familiarize with the pipeline but I didn’t see anything about how to
run the pipeline when you don’t have an ancestral reference. Is there
a way to use ANGSD-wrapper in these cases?

Thank you!

Error in installing ngsF on Mac OS X

read_data.cpp:1:10: fatal error: 'gsl/gsl_rng.h' file not found
#include <gsl/gsl_rng.h>
         ^
1 error generated.
make: *** [read_data.o] Error 1

two pieces: changing delimiter for fread, and an x-lim error

angsd-wrapper ngsF.sh error

This is the error message when I ran the ngsF wrapper:

line 68: [[: ngsF.sh: syntax error: invalid arithmetic operator (error token is ".sh")

Fst is not up-to-date with the version of ANGSD we're using

Turns out Fst calculation was completely overhauled and the version in AW does not work with the wrapper in its current form. I'm going to rewrite it to do the 2 population fst estimation I need it to do since I'm already running behind because of this issue, but there's also a 3 population estimation now that we should include.

"N_CHROM" not used

in this script "angsd-wrapper/Wrappers/Site_Frequency_Spectrum.sh", "N_CHROM" was clarified but not used. I am not sure if there is a parameter "N_CHROM" missed.

another error in the tutorial

"REF_SEQ=${HOME}/software/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.22.dna.chromosome.fa" should be "REF_SEQ=${HOME}/software/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa"

config sourcing order

looks like the common config file is sourced after the method-specific one--this seems backwards to me but I think it's better than before bc it can save some file-making time if doing analyses on different pops. However, this should be made very explicit somewhere prominent, bc I almost changed file names in the method-specific config file that would have been overridden by the common config file.