Giter Site home page Giter Site logo

ms_pike_popgen's Introduction

Northern pike population genetics study

A repository to accompany the analysis of the northern pike popgen study.

Citation:

Johnson, HA, Rondeau, EB, Minkley, DR, Leong, JS, Whitehead, J, Despins, CA, et al. (2020). Population genomics of North American northern pike: variation and sex-specific signals from a chromosome-level, long read genome assembly. bioRxiv, 2020.2006.2018.157701. doi:10.1101/2020.06.18.157701

Requirements

  • samtools
  • bedtools
  • bcftools
  • vcftools
  • Eric Normandeau's scripts repository

00. Data preparation

Download the following and put in 02_input_data:

01. Characterize overall variant counts across chromosomes

Prepare and characterize the chromosomes:

# Change directory
cd 02_input_data

# Trim scaffold names to match formats observed in the VCF e.g., LG03    
sed 's/.*LG/>LG/g' GCA_004634155.1_Eluc_v4_genomic.fna | sed 's/,.*//g' > GCA_004634155.1_Eluc_v4_genomic_renamed.fna    

# Unwrap fasta
fasta_unwrap.py GCA_004634155.1_Eluc_v4_genomic_renamed.fna GCA_004634155.1_Eluc_v4_genomic_renamed_unwrap.fna

# Keep chromosomes only
grep -E -A1 '^>LG' GCA_004634155.1_Eluc_v4_genomic_renamed_unwrap.fna | grep -vE '^--$' - > GCA_004634155.1_Eluc_v4_genomic_renamed_unwrap_chr_only.fna  

# Determine lengths of chromosomes
`fasta_lengths.py GCA_004634155.1_Eluc_v4_genomic_renamed_unwrap_chr_only.fna`    

# Index the reference
samtools faidx ./GCA_004634155.1_Eluc_v4_genomic_renamed_unwrap_chr_only.fna 

Determine windows positions per chromosome and extract variants:

# Prepare a windows file      
bedtools makewindows -g ./GCA_004634155.1_Eluc_v4_genomic_renamed_unwrap_chr_only.fna.fai -w 1000000 > windows.bed

# Extract variants per window from each chromosome
bedtools coverage -a windows.bed -b ./Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode.vcf -counts > coverage.txt    

Use the following script interactively in R to analyze the variants per window and generate plots:
01_scripts/plotting_chr_poly_levels.R

02. Identify fixed differences across and within population clusters

Use the following script interactively in R to analyze fixed genotypic differences across and within the major groupings in the dataset:
01_scripts/fixed_diffs_across_NACD.R .

03. Per sample genotypes

Produce a stacked barplot of genotypes per sample, including REF/REF, REF/ALT, ALT/ALT, missing using the following script:
01_scripts/per_sample_genotypes.R

04. Per population statistics

Variant sites with more than two alleles are causing issues, so first, remove these:
bcftools view --max-alleles 2 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode.vcf > 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode_biallelic_only.vcf
This file should now have 1,117,361 variants remaining.

Use the RScript 01_scripts/per_pop_stats.R interactively to calculate per population average HOBS and FIS, and the number of private alleles per population.

Tajima's D

Using the VCF above with only biallelic variants, determine the names of the samples in the VCF:
bcftools query --list-samples 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode_biallelic_only.vcf > 02_input_data/samples.txt

Create per population sample names files:

grep '\-CR' 02_input_data/samples.txt > 02_input_data/samples_CHT.txt
grep '\-YR' 02_input_data/samples.txt > 02_input_data/samples_HOO.txt
grep '\-PL' 02_input_data/samples.txt > 02_input_data/samples_PAL.txt
grep '\-Mb' 02_input_data/samples.txt > 02_input_data/samples_WHI.txt
grep '\-S' 02_input_data/samples.txt > 02_input_data/samples_SLA.txt
grep '\-NJ' 02_input_data/samples.txt > 02_input_data/samples_HCK.txt

Create population-specific VCFs:

vcftools --vcf 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode_biallelic_only.vcf --keep 02_input_data/samples_CHT.txt --recode --recode-INFO-all --out 02_input_data/Eluc_CHT.vcf
vcftools --vcf 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode_biallelic_only.vcf --keep 02_input_data/samples_HOO.txt --recode --recode-INFO-all --out 02_input_data/Eluc_HOO.vcf
vcftools --vcf 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode_biallelic_only.vcf --keep 02_input_data/samples_PAL.txt --recode --recode-INFO-all --out 02_input_data/Eluc_PAL.vcf
vcftools --vcf 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode_biallelic_only.vcf --keep 02_input_data/samples_WHI.txt --recode --recode-INFO-all --out 02_input_data/Eluc_WHI.vcf
vcftools --vcf 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode_biallelic_only.vcf --keep 02_input_data/samples_SLA.txt --recode --recode-INFO-all --out 02_input_data/Eluc_SLA.vcf
vcftools --vcf 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode_biallelic_only.vcf --keep 02_input_data/samples_HCK.txt --recode --recode-INFO-all --out 02_input_data/Eluc_HCK.vcf
vcftools --vcf 02_input_data/Eluc.variants.GATK.iteration.2.b.Full.SNP.GATK_HF_removed.minQ20.mmdp10.mxmdp60.mmc10.mac1.homsumfilt.new.samp.names.recode_biallelic_only.vcf --keep 02_input_data/samples_greater_NA.txt --recode --recode-INFO-all --out 02_input_data/Eluc_GNA.vcf

Run Tajima's D calculation on the population-specific VCF:

vcftools --vcf 02_input_data/Eluc_CHT.vcf.recode.vcf --TajimaD 10000 --out 03_results/Eluc_CHT
vcftools --vcf 02_input_data/Eluc_HOO.vcf.recode.vcf --TajimaD 10000 --out 03_results/Eluc_HOO
vcftools --vcf 02_input_data/Eluc_PAL.vcf.recode.vcf --TajimaD 10000 --out 03_results/Eluc_PAL
vcftools --vcf 02_input_data/Eluc_WHI.vcf.recode.vcf --TajimaD 10000 --out 03_results/Eluc_WHI
vcftools --vcf 02_input_data/Eluc_SLA.vcf.recode.vcf --TajimaD 10000 --out 03_results/Eluc_SLA
vcftools --vcf 02_input_data/Eluc_HCK.vcf.recode.vcf --TajimaD 10000 --out 03_results/Eluc_HCK
vcftools --vcf 02_input_data/Eluc_GNA.vcf.recode.vcf --TajimaD 10000 --out 03_results/Eluc_GNA

note: output will be, for example, Eluc_CHT.Tajima.D.

Run Tajima's D calculation on the population-specific VCF:

vcftools --vcf 02_input_data/Eluc_CHT.vcf.recode.vcf --site-pi --out 03_results/Eluc_CHT
vcftools --vcf 02_input_data/Eluc_HOO.vcf.recode.vcf --site-pi --out 03_results/Eluc_HOO
vcftools --vcf 02_input_data/Eluc_PAL.vcf.recode.vcf --site-pi --out 03_results/Eluc_PAL
vcftools --vcf 02_input_data/Eluc_WHI.vcf.recode.vcf --site-pi --out 03_results/Eluc_WHI
vcftools --vcf 02_input_data/Eluc_SLA.vcf.recode.vcf --site-pi --out 03_results/Eluc_SLA
vcftools --vcf 02_input_data/Eluc_HCK.vcf.recode.vcf --site-pi --out 03_results/Eluc_HCK
vcftools --vcf 02_input_data/Eluc_GNA.vcf.recode.vcf --site-pi --out 03_results/Eluc_GNA

Use Rscript X.R to find means for each of the Tajima's D files.

ms_pike_popgen's People

Contributors

bensutherland avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.