ammaraziz / flukit Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 2.0 910 KB

License: GNU Lesser General Public License v3.0

Python 100.00%

flukit's Introduction

Flukit - simple variant caller for influenza

Used interally at WHOFLUCC. Not recommended for external usage.

install

clone this repo
install using pip

cd flukit && python -m pip install .

Install nextclade:

conda install -c bioconda nextclade

usage

Run flukit --help to see detailed instructions

input
- fasta, file with headers ending in .1... .4 which represent the gene. For example: MySample.4 is the HA gene of MySample.
- lineage, one of h1n1, h3n2, vic
- the batch name used as prefix for output files
- the output path

Example:

flukit -s flu.fasta -l h3n2 -b batch150 -o ~/Desktop/

output

tsv file in the following format:

 seqno - the fasta header
 ha_aa - HA mutations called against vacc_ref
 na - H275Y mutation
 mp - S31N mutation
 pa - I38X mutation
 vacc_ref - called against ancestral strains

Output is formatted specifically for internal database.

flukit's People

Contributors

Stargazers

Watchers

Forkers

vikash84 aradahir

flukit's Issues

subcommand find/rename

Subcommand for finding fasta files specific to a batch.

Example usage:

flukit find \
    --input-dir {Path} \
    --input-meta {tsv or csv} \ 
    --batch-num {num} \
    --output-dir {Path} \
    --split-by gene

optional split-by can be split by gene where output is ha.fasta na.fasta etc or all returning all files individually split
optional input-dir to the directory of fasta files. Encode default paths (see below)
optional input-meta from fuzee output containing seqno to retreive

Operation:

Get a list all fasta files from input directory
Parse meta file to extract Seq No
Match lists, subset, get complete path
Concat files together and write out
If split-by is specified, split as appropriate
Create folder in output-dir with named batch-number
Write out fasta files

Default paths to search in:

S:/Shared/WHOFLU/mol_biol/00-New Sequences/
S:/Shared/WHOFLU/mol_biol/Sequencing-NGS/
/mnt/Sdrive/WHOFLU/mol_biol/00-New Sequences/
/mnt/Sdrive/WHOFLU/mol_biol/Sequencing-NGS/

Considerations:

How should NGS runs be handled? These files are located in a different directory and in a single fasta file. Possible solution is to search in /00-New Sequences/ first and if no matches are returned, search in .../Sequencing-NGS/ for any files matching the batch number then split files into .../00-New Sequences/ but do not overwrite existing files.
Use fuzee api to retreive batch metadata related to #4

flusurver data for antiviral resistance

Add the antiviral resistance from flusurver for NA, MP and PA. Only appears for these genes.

Add support for access fuzee api

The fuzee 'api' is written in R. Accessing this in python for flukit would be a major boost to productivity.

The urllib is the goto package for sending GET and POST requests. Mimic the fuzee R package.

Access to the worksheets and the gisaid page would be especially handy.

new subcommand - renaming fasta files

Current method for renaming fasta files is very repetitive and cumbersome. The process is:

get meta data
split meta data by subtype, gene, passage
grep for sequences of interest for each group above
rename

This should be automated. New subcommand: rename.

Example:

flukit rename --sequences input.fasta --meta-file meta.tsv --output-dir output/ --prefix prefix

Input:

meta.tsv contents: seqno, passage, gene, date, designation.
fasta with headers: N1000.4 or N1000.ha

Output:

Renamed fasta file: {designation}{passage_short}_{month_abbr}.
clean formatted metafile

Potential additions:

flags --include-month, --include-passage

subcommand - gisaid/ncbi reformatter

A subcommand to handle metadata reformatting for gisaid or ncbi submission.

Possible names for subcommand:

metadata
meta-db
meta-formatter
reformat-meta

Example usage:

flukit metadata --meta input_meta.csv --format gisaid --output output_meta.tsv

flags:

-m, --meta
-f, --format
-o, --output

Formats:

gisaid flu
gisaid rsv (or 'new gisaid')
ncbi rsv

Input is the output of fuzee - gisaid csv.

For gisaid use Rscript fuzee2gisaid.R script.

subcommand for phylo construction and plotting

For each batch processed an annotated tree must be constructed for each gene. Currently an R script handles the alignment, tree construction and plotting.

Example usage:

flukit treeplot --sequences {Path/input.fasta} --lineage {lineage} --output-dir {Path}

Input:

multifasta with headers such as XXXX.4 XXXX.6
Output:
pdf of annotated tree(s). references colored red, samples colored blue. Extra info such as how the trees were generated (methods), date of generation would be useful.
tsv of Closest Prototypic Virus (CPV) - headers are seqno, result where Result is the CPV

The subcommand handles the reference fasta (detected from the lineage). Must include fasta datasets in the package for each subtype and each gene.

Notes:

Reuse the alignment functions from align_frames.py
Use Biopython for tree generation to avoid extra/external deps https://biopython.org/wiki/Phylo
Use toyplot for tree plotting

Questions:

How to find the CPV?
- Given a list of known CPV per lineage, first calculate the distance of all samples to known CPV to generate a matrix. The closest ancestor per sample is the CPV. For tie breakers, a priority list is needed, probably the oldest vaccine strain or CPV is used.
- CPV.tsv with the headers cpv, gene, priority where priority is per gene. Use previous plots to create priority list.

Tasks:

Parse fasta file and combine with reference set
Parse or get from fuzee meta files
Get low reactors from fuzee
Rename sequences
Plot trees using toyplot
Calculate CPV using .get_distance method from ete3 package

subcommand split

New subcommand: split

split subcommand takes in a mutifasta file and splits into either individual or gene.
No renaming is performed.
Write protection is needed (never overwrite a file).

Usage:

flukit split -i input.fasta -o output_dir --split-by gene

The function write_sequences performs this operation currently, however there is no subcommand to handle splitting individually.

Output should be a directory and the function should NOT overwrite existing files. It should through a warning via print to the user.

write_sequences for ind needs to be changed to handle a dict or a list.

ammaraziz / flukit Goto Github PK

flukit's Introduction

Flukit - simple variant caller for influenza

install

usage

flukit's People

Contributors

Stargazers

Watchers

Forkers

flukit's Issues

subcommand find/rename

flusurver data for antiviral resistance

Add support for access fuzee api

new subcommand - renaming fasta files

subcommand - gisaid/ncbi reformatter

subcommand for phylo construction and plotting

subcommand split

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent