Giter Site home page Giter Site logo

oneliners_and_scripts's Introduction

Onliners and Scripts

This repository is an attempt at collecting tips and tricks in bash and Unix that can be of use to most people in the Human Evolution program.

Please add your onliners to onliners.md

If you have standaline scripts then please add them to the repository. Then add following information to the Standalone Scripts Readme:

  • The purpose of the script
  • What's the input for the script
  • What's the output
  • A typical command, i.e:
bash my_script.py -i my_input.txt -o my_output.txt

PSMCfrombam.sh

Script to run PSMC on UPPMAX. loops over a file of sample names. Starts from bam file, creates vcf with samtools and outputs final MSMC files. You need generate_multihetsep.py, bamCaller.py and utils.py from msmc tools (https://github.com/stschiff/msmc-tools).
You also need mask files which are available in the umbrella project (for hg19).

explore_admixture_graphs.R

Minimal script for using admixtools2 and its find_graphs() function to systematically search for admixture graphs. Doing this in ad systematic way is a major advance compared to the manual approach of crafting input graphs for qpGraph which can easily be biased by someones own ideas or end up with models that fit okay-ish but do not make sense in a temporal or geographic way.

It reads Plink bed/bim/fam files only containing the populations one wants to use for the reconstruction of admixture graphs. More description can be found in the admixtools2 documentation.

Some thoughts:

  • A single run of find_graphs() only takes a couple of minutes, so one can run it multiple times to explore the search space (especially on Uppmax). The script is using 100 runs per number of migrations but my observation is that the higher m gets, the more local optima exist. For m=4, e.g. only ~5% of the runs ended up with the best likelihood. Consequently, it might make sense to run more iterations for higher values of m.
  • In my specific case, I had eight populations (7 plus outgroup). From the literature, this seems to be about the sample size that is feasbile in other studies. I haven't tried other settings (yet). I could imagine that the search space just becomes so big at some point, that even if an individual run of find_graphs() is rather fast, it won't be feasible to exhaustively explore the entire search space.
  • I am plotting graphs with scores close to the currently best solution and the best graph per m is stored as best.Mm. Other graphs can be reproduced since I am setting random seed = iteration number.
  • qpgraph_resample_multi() and compare_fits() are useful functions to compare how well two different graphs fit the data. A graph with higher $m$ naturally has a higher chance to produce a better likelihood. Using these functions, one can compare the fit of graphs with different m whether the higher m is actually a significantly better fit -- if not, one should pick the parsimonious option and use the one with lower m.
  • While writing this, a preprint dropped describing and benchmarking the find_graphs() function and providing some more detailled recommendations on how to use it.

oneliners_and_scripts's People

Contributors

nnneeennn avatar cbernhardsson avatar tgue avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.