Population Genetics in Julia.
Invoke the package manager by pressing ]
on an empty line and add PopGen
Pavel V. Dimens, & Jason Selwyn. (2022). BioJulia/PopGen.jl: v0.8.0 (v0.8.0). Zenodo. https://doi.org/10.5281/zenodo.6450254
Population Genetics in Julia
Home Page: https://biojulia.github.io/PopGen.jl/
License: MIT License
Population Genetics in Julia.
Invoke the package manager by pressing ]
on an empty line and add PopGen
It's quite unclear what it actually does. Currently, it says:
Vector of new unique population names in the order that they appear in the PopData.meta
, which all of my students took to mean that the input should be a vector of the same length as the dataframe, such that the n'th entry in the vector became the name of the n'th sample in the dataframe.
Two suggestions for making it better:
Thanks for otherwise great docs!
As per discussion on the Julia lang Slack, the Mersenne Twister rng needs to be replaced with (possibly) Xorshift
description
In some of the OnlineStats-like code that's parallelized, do a one-over and make sure that there aren't data races and replace them with Atomics.
Following the design of PGDSpider2 output, the V/BCF importer should have an optional flag recode::Bool = false
(name pending) to rename all loci a simple generic name like SNP_1...SNP_n
.
Given the ubiquity, it would be sensible to have file IO support for PLINK formatted files.
description
The type of PopData.meta.name
should be Vector{String}
, but it is incorrectly interpreted as PooledArray
when importing vcf data.
minimal example to reproduce
using GeneticVariation
x = vcf("some_file.vcf);
x.meta.name |> typeof
PooledArrays.PooledVector{String, UInt32, Vector{UInt32}}
expected behavior
x.meta.name |> typeof
Vector{String}
screenshots (optional)
additional info
This package needs unit tests for the FST things.
Is your feature request related to a problem and which?
Consolidate the printing of file information a bit. The idea is to have fewer lines and be more succinct overall.
Describe the solution/feature you'd like (with examples)
julia> @info "\n path_to_filename.gen\n formatting: delimiter = tab , loci = horizontal\n data: samples = xxx, populations = yy, loci = zzzz"
┌ Info:
│ path_to_filename.gen
│ formatting: delimiter = tab , loci = horizontal
└ data: samples = xxx, populations = yy, loci = zzzz
Is your feature request related to a problem and which?
Not a problem per se, but it would make sense to sort the loci
dataframe using NaturalSort.jl
, so this doesn't happen:
julia> x.loci
3488310×4 DataFrame
Row │ name population locus genotype
│ String String String Tuple…?
─────────┼──────────────────────────────────────────
1 │ ATL_1988 missing snp_1 missing
2 │ ATL_1988 missing snp_10 (4, 4)
3 │ ATL_1988 missing snp_100 (3, 3)
4 │ ATL_1988 missing snp_1000 (3, 3)
5 │ ATL_1988 missing snp_1001 (4, 4)
6 │ ATL_1988 missing snp_1002 (1, 4)
7 │ ATL_1988 missing snp_1003 (4, 4)
8 │ ATL_1988 missing snp_1004 (4, 4)
9 │ ATL_1988 missing snp_1005 (4, 4)
10 │ ATL_1988 missing snp_1006 (3, 3)
11 │ ATL_1988 missing snp_1007 (3, 3)
Describe the solution/feature you'd like (with examples)
Add NaturalSort.jl
as a dependency, configure the read_xxx
functions to use sort(__, [:name, :locus], lt = natural)
for the loci
dataframe before returning the PopData
object. It would also make writing to files consistent with how the snps are likley arranged in the source data (and congruent with output from e.g. PDGSpider2)
julia> tst = sort(x.loci, [:name, :locus], lt = natural)
3488310×4 DataFrame
Row │ name population locus genotype
│ String String String Tuple…?
─────────┼──────────────────────────────────────────
1 │ ATL_1988 missing snp_1 missing
2 │ ATL_1988 missing snp_2 missing
3 │ ATL_1988 missing snp_3 missing
4 │ ATL_1988 missing snp_4 (4, 4)
5 │ ATL_1988 missing snp_5 (4, 4)
6 │ ATL_1988 missing snp_6 (3, 3)
7 │ ATL_1988 missing snp_7 (2, 3)
8 │ ATL_1988 missing snp_8 (4, 4)
9 │ ATL_1988 missing snp_9 (3, 3)
10 │ ATL_1988 missing snp_10 (4, 4)
11 │ ATL_1988 missing snp_11 (1, 1)
12 │ ATL_1988 missing snp_12 (3, 3)
13 │ ATL_1988 missing snp_13 (4, 4)
14 │ ATL_1988 missing snp_14 (4, 4)
15 │ ATL_1988 missing snp_15 (1, 1)
16 │ ATL_1988 missing snp_16 (3, 3)
17 │ ATL_1988 missing snp_17 (2, 3)
rather than have _permute_FST
take a matrix and 2 sizes, shuffle the row indices of the vcat'd merged matrix and index it twice (pop1, pop2) with views
new_idx = shuffle(1:size(merged)[1])
pop1 = @views merged[newidx[1:npop1],:]
pop2 = @views merged[newidx[npop1+1:end],:]
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
Is your feature request related to a problem and which?
Not a problem, but it would be easier to treat file IO as a separate package that is required by and re-exported by PopGen.jl
PopData
functionsPopGen.jl
codebase will be smallerPopGen.jl
codebase because it will only feature IO-specific thingsare there alternatives?
Keep the package monolithic as it is now
additional info
As I'm learning more about GitHub, CI, and the Julia TagBot and Registrator, I'm learning that the release
branch is redundant and makes the entire workflow cumbersome. Will be deleted with 0.7.0
release, which will address #82
Hi there,
Great to see this project! Thanks for implementing this!
Is your feature request related to a problem and which?
At the moment, PopGen does not work with (fast and snazzy) DataFrames v1:
(popgen) pkg> add PopGen DataFrames@1
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package DataFrames [a93c6f00]:
DataFrames [a93c6f00] log:
├─possible versions are: 0.11.7-1.1.0 or uninstalled
├─restricted to versions 1 by an explicit requirement, leaving only versions 1.0.0-1.1.0
└─restricted by compatibility requirements with PopGen [af524d12] to versions: 0.11.7-0.22.7 — no versions left
└─PopGen [af524d12] log:
├─possible versions are: 0.0.3-0.6.3 or uninstalled
└─restricted to versions * by an explicit requirement, leaving only versions 0.0.3-0.6.3
Describe the solution/feature you'd like (with examples)
Would it be possible (within reasonable effort) to make them work together?
Many thanks!
Hannes
Testing with some data, the VCF importer is not working 100% correctly. Some individuals are imported with 100% missing genotypes. This needs to be investigated to make it at least consistent with the data produced from VCF => Genepop conversion using PDGSpider2.
Is your feature request related to a problem and which?
_
missing_data()
=> missingdata()
Rather than having genepop
and popdata2genepop
, consolidate each io function to have an input and output method, i.e.:
# file reading
function genepop(infile::String; kwargs...)
...
end
# file writing
function genepop(data::PopData; kwargs...)
...
end
description
The function isbiallelic(::PopData)
returns false
even if all isbiallelic(::GenoArray)
for the PopData
are true
minimal example to reproduce
x = vcf("some_file.vcf", rename_loci = true)
PopData Object
Markers: SNP
Ploidy: 2
Samples: 441
Loci: 7910
Populations: 1
Coordinates: absent
julia> isbiallelic(x)
false
julia> tmp = DataFrames.combine(
groupby(x.loci, :locus),
:genotype => isbiallelic => :bial
) ;
julia> all(tmp.bial)
true
expected behavior
julia> isbiallelic(x)
true
julia> tmp = DataFrames.combine(
groupby(x.loci, :locus),
:genotype => isbiallelic => :bial
) ;
julia> all(tmp.bial)
true
Line 1 in 9d72d25
keep
and keep!
need to be exported
Is your feature request related to a problem and which?
pairwise FST only returns an average across loci, but not the values for each locus
Describe the solution/feature you'd like (with examples)
pairwise_fst(::PopData; method::String, by::String = "locus" | "global" (default), iterations::Int)
Replace the existing dependency on ProgressMeter.jl with the clean/lean/beautiful ones provided by Term.jl.
The goal is to make PopGen.jl a monorepo like Makie
Benefits:
Is your feature request related to a problem and which?
n/a
Describe the solution/feature you'd like (with examples)
So far VCF and data_exclusion aren't loading
@pdimens
Is your feature request related to a problem and which?
Not a problem, but it might be cheaper to just shuffle
all the indices and partition them into two vectors of size [np1, p2]
and return the indices. The indices will then be used in the main loop of the fst to index the matrices
Describe the solution/feature you'd like (with examples)
are there alternatives?
keep it as it is
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.