biojulia / popgen.jl Goto Github PK

View Code? Open in Web Editor NEW

43.0 7.0 15.0 77.85 MB

Population Genetics in Julia

Home Page: https://biojulia.github.io/PopGen.jl/

License: MIT License

Julia 100.00%

genetics genomics snps microsatellites fst structure popgen

popgen.jl's Introduction

Population Genetics in Julia.

How to install:

Invoke the package manager by pressing ] on an empty line and add PopGen

Cite As

Pavel V. Dimens, & Jason Selwyn. (2022). BioJulia/PopGen.jl: v0.8.0 (v0.8.0). Zenodo. https://doi.org/10.5281/zenodo.6450254

Authors

Pavel Dimens

Jason Selwyn

popgen.jl's People

Contributors

Stargazers

Watchers

Forkers

stjordanis transgirlcodes logankilpatrick ashtonsbradley adamkemberling pablorn teikjun anshulrgoyal tomkellygenetics gibsonmatt sergeantmike67 vjalili quinnj thomasbrazier eirianop

popgen.jl's Issues

Request: Explain populations!(::Vector) better

It's quite unclear what it actually does. Currently, it says:

Vector of new unique population names in the order that they appear in the PopData.meta

, which all of my students took to mean that the input should be a vector of the same length as the dataframe, such that the n'th entry in the vector became the name of the n'th sample in the dataframe.

Two suggestions for making it better:

Do not rename missing values implicitly
Be more explicit about precisely what it does

Thanks for otherwise great docs!

Replace rand() and sample() use of MT

As per discussion on the Julia lang Slack, the Mersenne Twister rng needs to be replaced with (possibly) Xorshift

[bug] parallel addition must use atomic values (or Locks)

description
In some of the OnlineStats-like code that's parallelized, do a one-over and make sure that there aren't data races and replace them with Atomics.

add recode flag to V/BCF importer

Following the design of PGDSpider2 output, the V/BCF importer should have an optional flag recode::Bool = false (name pending) to rename all loci a simple generic name like SNP_1...SNP_n.

[feature] add PLINK file import support

Given the ubiquity, it would be sensible to have file IO support for PLINK formatted files.

[bug] PopData.meta.name incorrectly typed from vcf import

description
The type of PopData.meta.name should be Vector{String}, but it is incorrectly interpreted as PooledArray when importing vcf data.

minimal example to reproduce

using GeneticVariation

x = vcf("some_file.vcf);

x.meta.name |> typeof
PooledArrays.PooledVector{String, UInt32, Vector{UInt32}}

expected behavior

x.meta.name |> typeof
Vector{String}

screenshots (optional)

additional info

add FST unit tests [feature]

This package needs unit tests for the FST things.

[feature] consolidate file import info text

Is your feature request related to a problem and which?
Consolidate the printing of file information a bit. The idea is to have fewer lines and be more succinct overall.

Describe the solution/feature you'd like (with examples)

julia> @info "\n path_to_filename.gen\n formatting: delimiter = tab , loci = horizontal\n data: samples = xxx, populations = yy, loci = zzzz"
┌ Info: 
│  path_to_filename.gen
│  formatting: delimiter = tab , loci = horizontal
└  data: samples = xxx, populations = yy, loci = zzzz

** screenshot **
Proposed:

Current:

[feature] add NaturalSort.jl as dep

Is your feature request related to a problem and which?
Not a problem per se, but it would make sense to sort the loci dataframe using NaturalSort.jl, so this doesn't happen:

julia> x.loci
3488310×4 DataFrame
     Row │ name      population  locus     genotype 
         │ String    String      String    Tuple…?  
─────────┼──────────────────────────────────────────
       1 │ ATL_1988  missing     snp_1     missing  
       2 │ ATL_1988  missing     snp_10    (4, 4)
       3 │ ATL_1988  missing     snp_100   (3, 3)
       4 │ ATL_1988  missing     snp_1000  (3, 3)
       5 │ ATL_1988  missing     snp_1001  (4, 4)
       6 │ ATL_1988  missing     snp_1002  (1, 4)
       7 │ ATL_1988  missing     snp_1003  (4, 4)
       8 │ ATL_1988  missing     snp_1004  (4, 4)
       9 │ ATL_1988  missing     snp_1005  (4, 4)
      10 │ ATL_1988  missing     snp_1006  (3, 3)
      11 │ ATL_1988  missing     snp_1007  (3, 3)

Describe the solution/feature you'd like (with examples)
Add NaturalSort.jl as a dependency, configure the read_xxx functions to use sort(__, [:name, :locus], lt = natural) for the loci dataframe before returning the PopData object. It would also make writing to files consistent with how the snps are likley arranged in the source data (and congruent with output from e.g. PDGSpider2)

julia> tst = sort(x.loci, [:name, :locus], lt = natural)
3488310×4 DataFrame
     Row │ name      population  locus     genotype 
         │ String    String      String    Tuple…?  
─────────┼──────────────────────────────────────────
       1 │ ATL_1988  missing     snp_1     missing  
       2 │ ATL_1988  missing     snp_2     missing  
       3 │ ATL_1988  missing     snp_3     missing  
       4 │ ATL_1988  missing     snp_4     (4, 4)
       5 │ ATL_1988  missing     snp_5     (4, 4)
       6 │ ATL_1988  missing     snp_6     (3, 3)
       7 │ ATL_1988  missing     snp_7     (2, 3)
       8 │ ATL_1988  missing     snp_8     (4, 4)
       9 │ ATL_1988  missing     snp_9     (3, 3)
      10 │ ATL_1988  missing     snp_10    (4, 4)
      11 │ ATL_1988  missing     snp_11    (1, 1)
      12 │ ATL_1988  missing     snp_12    (3, 3)
      13 │ ATL_1988  missing     snp_13    (4, 4)
      14 │ ATL_1988  missing     snp_14    (4, 4)
      15 │ ATL_1988  missing     snp_15    (1, 1)
      16 │ ATL_1988  missing     snp_16    (3, 3)
      17 │ ATL_1988  missing     snp_17    (2, 3)

permutations for fst shuffle indices and return views

rather than have _permute_FST take a matrix and 2 sizes, shuffle the row indices of the vcat'd merged matrix and index it twice (pop1, pop2) with views

back of envelope example

new_idx = shuffle(1:size(merged)[1])
pop1 = @views merged[newidx[1:npop1],:]
pop2 = @views merged[newidx[npop1+1:end],:]

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

[feature] Split file IO into separate package

Is your feature request related to a problem and which?
Not a problem, but it would be easier to treat file IO as a separate package that is required by and re-exported by PopGen.jl

Benefits

Simple maintenance because it will only require basic PopData functions
PopGen.jl codebase will be smaller
IO development can be independent from other package components
Contributions will be independent from main PopGen.jl codebase because it will only feature IO-specific things
cool new logo
precompile read functions with test data

are there alternatives?
Keep the package monolithic as it is now

additional info

[feature] remove `release` branch

As I'm learning more about GitHub, CI, and the Julia TagBot and Registrator, I'm learning that the release branch is redundant and makes the entire workflow cumbersome. Will be deleted with 0.7.0 release, which will address #82

[feature] Compatibility with DataFrames v1

Hi there,
Great to see this project! Thanks for implementing this!

Is your feature request related to a problem and which?
At the moment, PopGen does not work with (fast and snazzy) DataFrames v1:

(popgen) pkg> add PopGen DataFrames@1
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package DataFrames [a93c6f00]:
 DataFrames [a93c6f00] log:
 ├─possible versions are: 0.11.7-1.1.0 or uninstalled
 ├─restricted to versions 1 by an explicit requirement, leaving only versions 1.0.0-1.1.0
 └─restricted by compatibility requirements with PopGen [af524d12] to versions: 0.11.7-0.22.7 — no versions left
   └─PopGen [af524d12] log:
     ├─possible versions are: 0.0.3-0.6.3 or uninstalled
     └─restricted to versions * by an explicit requirement, leaving only versions 0.0.3-0.6.3

Describe the solution/feature you'd like (with examples)
Would it be possible (within reasonable effort) to make them work together?

Many thanks!
Hannes

inconsistent VCF importing

Testing with some data, the VCF importer is not working 100% correctly. Some individuals are imported with 100% missing genotypes. This needs to be investigated to make it at least consistent with the data produced from VCF => Genepop conversion using PDGSpider2.

[feature] standardize function names

Is your feature request related to a problem and which?

internal functions that will never be used by users should start with _
user-facing functions should not have underscores separating words.

e.g. missing_data() => missingdata()

consolidate file io APIs to use multiple dispatch

Rather than having genepop and popdata2genepop, consolidate each io function to have an input and output method, i.e.:

# file reading
function genepop(infile::String; kwargs...)
...
end

# file writing
function genepop(data::PopData; kwargs...)
...
end

[bug] isbiallelic(::PopData) returns incorrect answer

description
The function isbiallelic(::PopData) returns false even if all isbiallelic(::GenoArray) for the PopData are true

minimal example to reproduce

x = vcf("some_file.vcf", rename_loci = true)
PopData Object
  Markers: SNP
  Ploidy: 2
  Samples: 441
  Loci: 7910
  Populations: 1
  Coordinates: absent

julia> isbiallelic(x)
false

julia> tmp = DataFrames.combine(
    groupby(x.loci, :locus),
    :genotype => isbiallelic => :bial
) ;

julia> all(tmp.bial)
true

expected behavior

julia> isbiallelic(x)
true

julia> tmp = DataFrames.combine(
    groupby(x.loci, :locus),
    :genotype => isbiallelic => :bial
) ;

julia> all(tmp.bial)
true

[bug] export keep and keep!

PopGen.jl/src/Manipulate.jl

Line 1 in 9d72d25

    
           export add_meta!, locations, locations!, loci, genotypes, get_genotypes, get_genotype, populations, population, populations!, population!, exclude, remove, omit, exclude!, remove!, omit!, samples

keep and keep! need to be exported

[feature] locus-by-locus pairwise FST

Is your feature request related to a problem and which?
pairwise FST only returns an average across loci, but not the values for each locus

Describe the solution/feature you'd like (with examples)

pairwise_fst(::PopData; method::String, by::String = "locus" | "global" (default), iterations::Int)

[feature] replace ProgressMeter with Term.jl ProgressBar

Replace the existing dependency on ProgressMeter.jl with the clean/lean/beautiful ones provided by Term.jl.

[feature] Merge all PopGen_.jl packages under PopGen.jl monorepo

The goal is to make PopGen.jl a monorepo like Makie

Benefits:

One repository, obviously. Currently, there is PopGen, PopGenCore and PopGenSims, the last of which lives as a repo under my personal account.
It might make CI easier, since everything depends on PopGenSims, and upstream <-> downstream testing would be super helpful.

[feature] PCA and DAPC

Is your feature request related to a problem and which?
n/a

Describe the solution/feature you'd like (with examples)

PCA
DAPC a la adegenet
rLDA (regularized LDA)
cross validation on DAPC

docs pages not loading

So far VCF and data_exclusion aren't loading
@pdimens

[feature] speed up fst permutations

Is your feature request related to a problem and which?
Not a problem, but it might be cheaper to just shuffle all the indices and partition them into two vectors of size [np1, p2]
and return the indices. The indices will then be used in the main loop of the fst to index the matrices

Describe the solution/feature you'd like (with examples)

are there alternatives?
keep it as it is

biojulia / popgen.jl Goto Github PK

popgen.jl's Introduction

How to install:

Cite As

Authors

popgen.jl's People

Contributors

Stargazers

Watchers

Forkers

popgen.jl's Issues

back of envelope example

Benefits

Recommend Projects

Recommend Topics

Recommend Org