Giter Site home page Giter Site logo

laduplessis / treponema_pallidum_in_early_modern_europe Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 49.29 MB

Data and scripts to reproduce the molecular clock dating analyses presented in "Ancient bacterial genomes reveal a formerly unknown diversity of Treponema pallidum strains in early modern Europe (Majander et al.)

Home Page: https://doi.org/10.1101/2020.06.09.142547

License: GNU General Public License v3.0

R 47.19% Python 52.81%
molecular-clock ancient-dna-analysis beast2 raxml

treponema_pallidum_in_early_modern_europe's Introduction

Ancient bacterial genomes reveal a high diversity of Treponema pallidum strains in early modern Europe

Kerttu Majander, Saskia Pfrengle, Judith Neukamm, Arthur Kocher, Louis du Plessis, Marta Pla-Díaz, Natasha Arora, Gülfirde Akgül, Kati Salo, Rachel Schats, Sarah Inskip, Markku Oinonen, Heiki Valk, Martin Malve, Aivar Kriiska, Päivi Onkamo, Fernando González-Candelas, Denise Kühnert, Johannes Krause, Verena J. Schuenemann

DOI


This repository contains the data files, configuration files, log files, tree files and scripts necessary to reproduce the molecular clock dating analyses and figures presented in Majander et al. https://doi.org/10.1101/2020.06.09.142547. Some of the scripts may need some adjustment depending on the local setup.

Table of Contents

  1. Abstract
  2. Reports
  3. Dependencies
  4. Data
  5. Workflow

Abstract

Syphilis is a globally re-emerging disease, which has marked European history with a devastating epidemic at the end of the 15th century. Together with non-venereal treponemal diseases, like bejel and yaws, which are found in subtropical and tropical regions, it currently poses a substantial health threat worldwide. The origins and spread of treponemal diseases remain unresolved, including syphilis’ potential introduction into Europe from the Americas. Here, we present the first genetic data from archaeological human remains reflecting a high diversity of Treponema pallidum in early modern Europe. Our study demonstrates that a variety of strains related to both venereal syphilis and yaws-causing T. pallidum subspecies were already present in Northern Europe in the early modern period. We also discovered a previously unknown T. pallidum lineage recovered as a sister group to yaws and bejel-causing lineages. These findings imply a more complex pattern of geographical distribution and etiology of early treponemal epidemics than previously understood.

Reports

Run the RMarkdown notebooks to generate the reports below:

  1. Temporal signal from root-to-tip regression: reports/TemporalSignal.Rmd
  2. BEAST2 Bayesian DRT (Dataset D only): reports/DateShuffling-D.Rmd
  3. BEAST2 molecular clock dating (Dataset D only): reports/DatingAnalysis-D.Rmd (this report also generates figures of more MCC trees that are not displayed in the report).
  4. Manuscript figures: reports/Figures-D.Rmd
  • Reports 2-4 can be easily modified for other datasets.
  • To generate PDF figures simply change the output to "pdf_document" and the device to "pdf".

Dependencies

  • Biopython, numpy
  • snp-sites
  • RAxML
  • BEAST v2.6
  • ggplot2, ggtree, treeio, ggsci, ggpubr, coda, cowplot, phytools, phangorn, ape
  • beastio: Commit #b18caa6

Data

Input data for the molecular clock dating analyses are in data/.

Workflow

The workflow and results are only given for the alignment with genomes mapped to the TPA (Nichols) reference sequence, with recombining and hypervariable genes removed, as well as all sites with >25% missing data. To use a different dataset simply change the input dataset in the workflow (and to avoid confusion all output filenames and directories).

Datasets

Define the following datasets for downstream analyses:

  • Dataset A: Full dataset (33 genomes)
  • Dataset B: Remove genomes without stable placement for finding recombining loci: KM14-7 (Kampen) (32 genomes)
  • Dataset C: Remove genomes problematic for clock-dating due to passaging: KM14-7,NIC2,Nichols (30 genomes)
  • Dataset D: Remove genomes problematic for clock-dating due to unique SNPs: KM14-7,NIC2,Nichols,94A,94B (28 genomes)
  • Dataset E: Remove genomes problematic for clock-dating due to underrepresentation: KKM14-7,NIC2,Nichols,94A,94B,BosniaA (27 genomes)
  • Dataset F: Remove genomes that could also be problematic for clock-dating due to passaging: KKM14-7,NIC2,Nichols,94A,94B,BosniaA,Fribourg (26 genomes)

Only results for dataset D are provided in the manuscript.

Create alignments

  • Add metadata to alignment files and subsample to create datasets A-F.
  • Use snp-sites to get the constant site breakdown for each dataset.
mkdir results/alignments/

# Dataset A
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/A/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25

# Dataset B
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/B/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7

# Dataset C
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/C/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols

# Dataset D
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/D/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols,94A,94B

# Dataset E
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/E/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols,94A,94B,BosniaA

# Dataset F
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/F/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols,94A,94B,BosniaA,Fribourg

# Extract SNP alignments and constant sites
for DATASET in A B C D E F
do 
	snp-sites -o results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas results/alignments/${DATASET}/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas
	snp-sites -C -o results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.csv results/alignments/${DATASET}/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas
	tr ',' ' ' < results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.csv > results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.txt
done
  • Only use trimmed SNP alignments for subsequent analyses.

Partition files for RAxML

  • Create partition and constant sites files by hand and save in the directory where RAxML will be run, e.g. results/RAxML/D/ for dataset D.

Partition file contents for Dataset D:

[asc~../../alignments/D/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.txt],ASC_DNA, p1=1-1500

Partition files for other datasets only differ in the path to the filename and the number of sites.

Build trees in RAxML

  • Use RAxML to build genetic distance trees (run from the directory where RAxML output should be stored, e.g. results/RAxML/D/).
  • After RAxML finishes midpoint root trees in FigTree and export as displayed with NEXUS block.
# Dataset A
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/A/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset B
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/B/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset C
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/C/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset D
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/D/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset E 
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/E/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset F
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/F/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

BEAST2 molecular clock dating

  • Create NEXUS files with SNP sequences, sampling dates and tip date priors.
for DATASET in B C D E F
do 

	# Narrow uniform priors
	python scripts/fasta2nexus.py -i results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -o results/alignments/${DATASET}/narrow -d 5 -l 3 -u 4

	# Normally distributed priors
	python scripts/fasta2nexus.py -i results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -o results/alignments/${DATASET}/normal -d 5 -l 3 -u 4 -n

	# Wide uniform priors 
	python scripts/fasta2nexus.py -i results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -o results/alignments/${DATASET}/wide -d 5 -l 3 -u 4 -L 1000 -U 2016 -w

done
  • The NEXUS files above can be dragged into BEAUti v2.6 to immediately load the alignment and set tip date priors.
  • Create the XML files in BEAUti v2.6 and adjust by hand (only the XML files for dataset D are provided on the repository).
  • Run on remote server using commands below (from the directory where XML files are stored, e.g. results/beast/D/).
  • Build MCC trees.
# On remote server (from directory with XML files)
mkdir output 

ls *.xml | parallel --delay 1 --jobs 30% --results outdir -I% --max-args 1 ~/BEASTv2.6.0/bin/beast -overwrite -seed 127 % & 		

cd output
mkdir mcctrees

for i in `ls *.trees | cut -f 1,2,3,4 -d "."`; do ~/BEASTv2.6.0/bin/treeannotator -burnin 30 ${i}.trees mcctrees/${i}.MCC.tree; done

Bayesian Date Randomisation Test

  • Create config files in results/beast/shuffleddates/config/ by hand.
  • Run Python scripts to produce XML files (only the XML files for dataset D are provided on the repository).
  • Run on remote server using commands below (from the directory where XML files are stored, e.g. results/beast/shuffleddates/D/).
# On local (from project root)
python scripts/MakeBEASTXML.py -i results/beast/shuffleddates/config/D/

# On remote server (from directory with XML files)
mkdir output

ls *R{0..50}.xml | parallel --delay 1 --jobs 75% --results outdir -I% --max-args 1 ~/BEASTv2.6.0/bin/beast -overwrite -seed 127 % & 		

treponema_pallidum_in_early_modern_europe's People

Contributors

laduplessis avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

treponema_pallidum_in_early_modern_europe's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.