Giter Site home page Giter Site logo

wsdewitt / sars-cov-2_prjna612766 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jbloom/sars-cov-2_prjna612766

0.0 1.0 0.0 248.04 MB

Analysis of early Wuhan SARS-CoV-2 sequences from deleted SRA BioProject PRJNA612766

Python 0.06% Jupyter Notebook 0.22% Shell 0.01% TeX 0.12% HTML 99.60%

sars-cov-2_prjna612766's Introduction

Recovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemic

This GitHub repository analyzes SARS-CoV-2 deep sequencing data recovered from the deleted BioProject PRJNA612766. This analysis corresponds to the work described in this pre-print.

Running the analysis

The analysis is nearly fully automated by the snakemake pipeline included in Snakefile. The configuration for the analysis is in config.yaml. Note that the pipeline is somewhat convoluted and performs a variety of steps only tangentially related to the paper corresponding to this study. The reason is that the study started simply as an effort to validate the analyses in the joint WHO-China report on COVID-19 origins, but then gradually shifted in goal upon the discovery of the deleted data set. For this reason, there are still some vestigial parts of the code and analysis structure.

The only required manual step is to download existing coronavirus sequences from GISAID, which must be done manually after creating a GISAID account since GISAID data sharing terms prevent distribution of their sequences. To get these sequences, download both the *.metadata.tsv.xz and *.fasta.xz files for the accessions in data/gisaid_sequences_through_Feb2020/accessions.txt to the subdirectory data/gisaid_sequences_through_Feb2020/, and the same two files for the accessions in data/comparator_genomes_gisaid/accessions.txt to the subdirectory data/comparator_genomes_gisaid/.

After downloading these sequences and ensuring you have installed conda, build the main conda environment for the pipeline with:

conda env create -f environment.yml

Then activate the conda environment with:

conda activate SARS-CoV-2_PRJNA612766

You can then run the entire analysis with:

snakemake -j 1 --use-conda

Note that you need the --use-conda command because one of the rules in Snakefile uses a separate environment as specified in environment_ete3.yml.

The above command will run the snakemake pipeline using just one computing core. If you want to use more cores, adjust the value passed by -j appropriately. If you have access to a computing cluster you can distribute the run across the cluster. For the Fred Hutch computing cluster, that can be done using cluster.yaml by running the pipeline with the commands in run_Hutch_cluster.bash.

Input data, results, etc

The input data needed for the analysis are all available in the ./data/ subdirectory, which contains a README describing the files therein.

The results of running the pipeline are placed in the ./results/ subdirectory. Most of these results are not tracked in this GitHub repo, but some key files are as described in the Methods of the paper associated with this study.

The code used to process the Excel supplementary table of accessions from project PRJNA612766 to generate the information found in config.yaml is in ./manual_analyses/PRJNA612766/.

Paper

The LaTex source for the paper and its figures are found in the ./paper/ subdirectory.

sars-cov-2_prjna612766's People

Contributors

jbloom avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.