Giter Site home page Giter Site logo

popgen.awk's People

Contributors

janxkoci avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

popgen.awk's Issues

new tool: eigselect

Since there is no tool to select samples from a dataset in eigenstrat format, it may be useful to write such tool. Currently, the only option, aside from converting to other format and back, is to use convertf with the poplistname option, which allows selecting populations - modified ind file can be used to select on the individual, rather than population level.

The new tool needs to properly subset samples from both ind and geno files. By default, it could use the ind file for marking which samples to select, or even a custom popfile (similar to derivedsfs tool), while it could optionally select random samples, given a number.

Possible options:

  • list of samples or populations from the ind file (-l)
  • a custom popfile (-p)
  • a random set of samples (-n, or perhaps -r?)

add options for more genotype formats

The current vcfGTcount.gawk script can be expanded to report not just the basic GT summaries, but also e.g. translated genotypes (TGT in the terminology of bcftools) or even IUPAC version (or IUPACGT in bcftools). For example:

  • -t = count translated genotypes
  • -i = count IUPAC-formated genotypes
  • -g = count numeric-style genotypes (default)

This would be handled by a function that gets called after extracting a genotype, using some if checking.

function translate(gt, ref, alt, iupac)
{
    gsub(/0/, ref, gt)
    gsub(/1/, alt, gt)
    if (iupac == 1)
        gt = iupacdict[gt] # needs a dict of iupac codes
    return gt
}

It can be handled by a single function, but maybe more efficient would be to have two functions, so that the if (iupac == 1) is called once rather than on every genotype.

eigenstrat2vcf parse string as prefix

Don't make users type the whole variable assignment syntax. Just ask for a string and put it in the variable yourself. Explain it in usage function (see gawk manual for "awksed" example).

  • add usage function
  • rewrite the prefix assignment code

tweak derived.gawk to make it portable

The script currently uses only one gawk-specific feature - arrays of arrays. I may be able to rewrite it to use regular "fake" awk multidimensional array instead.

It would allow using faster mawk (even though the script takes only a few minutes to process 4.5M sites for 50 samples) or bioawk for compressed data. Not a high priority though (it's already fast enough and input can be piped from zcat).

Links:

https://stackoverflow.com/questions/3060600/awk-array-iteration-for-multi-dimensional-arrays

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.