Giter Site home page Giter Site logo

flukit's Introduction

Flukit - simple variant caller for influenza

Used interally at WHOFLUCC. Not recommended for external usage.

install

  1. clone this repo
  2. install using pip
cd flukit && python -m pip install .
  1. Install nextclade:
conda install -c bioconda nextclade

usage

Run flukit --help to see detailed instructions

  • input
    • fasta, file with headers ending in .1... .4 which represent the gene. For example: MySample.4 is the HA gene of MySample.
    • lineage, one of h1n1, h3n2, vic
    • the batch name used as prefix for output files
    • the output path

Example:

flukit -s flu.fasta -l h3n2 -b batch150 -o ~/Desktop/
  • output
    • tsv file in the following format:
     seqno - the fasta header
     ha_aa - HA mutations called against vacc_ref
     na - H275Y mutation
     mp - S31N mutation
     pa - I38X mutation
     vacc_ref - called against ancestral strains
    

Output is formatted specifically for internal database.

flukit's People

Contributors

ammaraziz avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

vikash84 aradahir

flukit's Issues

subcommand find/rename

Subcommand for finding fasta files specific to a batch.

Example usage:

flukit find \
    --input-dir {Path} \
    --input-meta {tsv or csv} \ 
    --batch-num {num} \
    --output-dir {Path} \
    --split-by gene
  • optional split-by can be split by gene where output is ha.fasta na.fasta etc or all returning all files individually split
  • optional input-dir to the directory of fasta files. Encode default paths (see below)
  • optional input-meta from fuzee output containing seqno to retreive

Operation:

  1. Get a list all fasta files from input directory
  2. Parse meta file to extract Seq No
  3. Match lists, subset, get complete path
  4. Concat files together and write out
  5. If split-by is specified, split as appropriate
  6. Create folder in output-dir with named batch-number
  7. Write out fasta files

Default paths to search in:

  • S:/Shared/WHOFLU/mol_biol/00-New Sequences/
  • S:/Shared/WHOFLU/mol_biol/Sequencing-NGS/
  • /mnt/Sdrive/WHOFLU/mol_biol/00-New Sequences/
  • /mnt/Sdrive/WHOFLU/mol_biol/Sequencing-NGS/

Considerations:

  • How should NGS runs be handled? These files are located in a different directory and in a single fasta file. Possible solution is to search in /00-New Sequences/ first and if no matches are returned, search in .../Sequencing-NGS/ for any files matching the batch number then split files into .../00-New Sequences/ but do not overwrite existing files.
  • Use fuzee api to retreive batch metadata related to #4

Add support for access fuzee api

The fuzee 'api' is written in R. Accessing this in python for flukit would be a major boost to productivity.

The urllib is the goto package for sending GET and POST requests. Mimic the fuzee R package.

Access to the worksheets and the gisaid page would be especially handy.

new subcommand - renaming fasta files

Current method for renaming fasta files is very repetitive and cumbersome. The process is:

  1. get meta data
  2. split meta data by subtype, gene, passage
  3. grep for sequences of interest for each group above
  4. rename

This should be automated. New subcommand: rename.

Example:

flukit rename --sequences input.fasta --meta-file meta.tsv --output-dir output/ --prefix prefix

Input:

  • meta.tsv contents: seqno, passage, gene, date, designation.
  • fasta with headers: N1000.4 or N1000.ha

Output:

  • Renamed fasta file: {designation}{passage_short}_{month_abbr}.
  • clean formatted metafile

Potential additions:

  • flags --include-month, --include-passage

subcommand - gisaid/ncbi reformatter

A subcommand to handle metadata reformatting for gisaid or ncbi submission.

Possible names for subcommand:

metadata
meta-db
meta-formatter
reformat-meta

Example usage:

flukit metadata --meta input_meta.csv --format gisaid --output output_meta.tsv

flags:

-m, --meta
-f, --format
-o, --output

Formats:

  • gisaid flu
  • gisaid rsv (or 'new gisaid')
  • ncbi rsv

Input is the output of fuzee - gisaid csv.

For gisaid use Rscript fuzee2gisaid.R script.

subcommand for phylo construction and plotting

For each batch processed an annotated tree must be constructed for each gene. Currently an R script handles the alignment, tree construction and plotting.

Example usage:

flukit treeplot --sequences {Path/input.fasta} --lineage {lineage} --output-dir {Path}

Input:

  • multifasta with headers such as XXXX.4 XXXX.6
    Output:
  • pdf of annotated tree(s). references colored red, samples colored blue. Extra info such as how the trees were generated (methods), date of generation would be useful.
  • tsv of Closest Prototypic Virus (CPV) - headers are seqno, result where Result is the CPV

The subcommand handles the reference fasta (detected from the lineage). Must include fasta datasets in the package for each subtype and each gene.

Notes:

  • Reuse the alignment functions from align_frames.py
  • Use Biopython for tree generation to avoid extra/external deps https://biopython.org/wiki/Phylo
  • Use toyplot for tree plotting

Questions:

  • How to find the CPV?
    • Given a list of known CPV per lineage, first calculate the distance of all samples to known CPV to generate a matrix. The closest ancestor per sample is the CPV. For tie breakers, a priority list is needed, probably the oldest vaccine strain or CPV is used.
    • CPV.tsv with the headers cpv, gene, priority where priority is per gene. Use previous plots to create priority list.

Tasks:

  • Parse fasta file and combine with reference set
  • Parse or get from fuzee meta files
  • Get low reactors from fuzee
  • Rename sequences
  • Plot trees using toyplot
  • Calculate CPV using .get_distance method from ete3 package

subcommand split

New subcommand: split

split subcommand takes in a mutifasta file and splits into either individual or gene.
No renaming is performed.
Write protection is needed (never overwrite a file).

Usage:

flukit split -i input.fasta -o output_dir --split-by gene

The function write_sequences performs this operation currently, however there is no subcommand to handle splitting individually.

Output should be a directory and the function should NOT overwrite existing files. It should through a warning via print to the user.

write_sequences for ind needs to be changed to handle a dict or a list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.