Giter Site home page Giter Site logo

qhery's Introduction

qhery

Identification of mutations in SARS-CoV-2 associated with resistance to treatment.


release license

qhery was developed by the Q-PHIRE Genomics team at Forensic and Scientific Services, Queensland Health.


Table of contents

Installation

git

While in development qhery can only be installed by downloading from git

git clone https://github.com/mjsull/qhery.git

requirements:

  • Python >= 3.9.12
  • bcftools >= 1.10.2
  • curl >= 7.83.1
  • wget >= 1.20.3

Optional requirements:

  • ncbi-blast+ >= 2.9.0+ - This will generate a BLASTx alignment of the genome for visualization
  • lofreq >= 2.1.5 - if provided with a BAM file qhery will look for minor alleles in the alignment with lofreq
  • samtools >= 1.7 - samtools is used to determine the depth of sequence along the genome, and which resistance mutations cannot be reported on due to lack of coverage.

Example usage:

qhery run --database_dir database_dir --vcf sample.vcf --pipeline_dir output_dir --lineage Omicron/BA.1 --sample_name mysample --rx_list Sotrovimab

Determines the amino acid changes caused by the mutations listed in sample.vcf and then compares them to a list of mutations that cause a reduction in Sotrovimab binding.

qhery run --database_dir database_dir --vcf sample.vcf --pipeline_dir output_dir --lineage Omicron/BA.1 --sample_name mysample --rx_list Sotrovimab Remdesivir --fasta sample.consensus.fasta --bam sample.primertrimmed.rg.sorted.bam

Determines the amino acid changes caused by the mutations listed in sample.vcf. Additionally will use lofreq to find minor alleles in the BAM file. Finally they are compared to a list of mutations that cause a reduction in Sotrovimab binding or reduction in Remdesivir efficiency.

qhery list_rx --database_dir database_dir

List treatments for which resistance information exists.

Output

qhery produces two tables.

<sample_name>.full.tsv

Contains all mutations detected in the sample and all mutations associated with the treatments listed by the user.

<sample_name>.final.tsv

Both tables have the same format (described below in example output).

Contains all mutations detected in the sample that are both in genes assosciated with reistance to the treatments listed and are not lineage defining mutations. It also contains resistance mutation that do not have enough read depth to be called as present or absent (default 20x read depth).

Finally qhery will also produce a BLASTx alignment of the query to mature proteins and a bammix plot of the epitopes of the treatments the user listed (if available).

Example output:

Mutation alt_names in_sample in_variant covered resistance_mutation Remdesivir_average_fold_reduction Remdesivir_fold_reductions Remdesivir_in_epitope Sotrovimab_average_fold_reduction Sotrovimab_fold_reductions Sotrovimab_in_epitope
E:T9I - True True True False 0 - False 0 - False
M:D3G - True True True False 0 - False 0 - False
N:ERS31-33∆ - True True False False 0 - False 0 - False
ORF3a:L52F - True False True False 0 - False 0 - False
RdRP:802D - False False True True 2.54 =2.54 False 0 - False
S:R214ins S:R214R_EPE True True True True 0 - False 3.00 =3.0 False
S:P337T - True False True True 0 - False 8.00 =5.4,=10.6 True

Columns

column header description
1 mutation The mutation name. Gene name comes before the colon, then reference amino acid, position and sample amino acid
2 alt_names Discrepency between database mutation name and csq mutation name
3 in_sample Is mutation in the query
4 in_variant Is mutation a lineage defining mutation
5 covered Is the mutation covered by 20 or more reads
6 resistance_mutation Is there evidence the mutation may confer some resistance to one of the treatments listed
7 rx1_average_fold_reduction Average fold reduction of listed fold reductions
8 rx1_fold_reductions Fold reductions listed in the database
9 rx1_in_epitope Is the mutation in the epitope of this treatment (MABs only)
10 rx2_average_fold_reduction The previous 3 columns repeate for each treatment provided by the user
11 rx2_fold_reductions ...
12 rx2_in_epitope ...

TODO

  1. Need a phasing step between lofreq and bcftools csq (or to switch to a vcf caller that does phasing)
  2. Add allele frequency information to output table

Arguments:

-h, --help

show this help message and exit

subcommands

list_rx


List all drugs for which resistance information is available.

Takes no arguments


run


Determines mutations in samples and then checks against resistance data.

arguments

-n, --sample_name <sample_name>

Sample name, output files will be prefixed with this.

-v, --vcf <sample.vcf>

vcf file, variants called against the Wuhan-Hu-1 reference (MN908947.3)

-b, --bam <sample.sorted.bam>

Sorted bam file. File of read alignments for the sample mapped against the Wuhan-Hu-1 reference (MN908947.3)

-d, --database_dir <path/to/database_dir>

Directory with the latest version of the Stanford resistance database. If the latest version is not in this folder it will be downloaded to this location.

-p, --pipeline_dir <path/to/pipeline_dir>

All script output and intermediated files will be put here. Script will create a directory if none exists.

-l, --lineage <BA.1>

Lineage of the query (BA.1/BA.2/BA.3/Delta etc.)

-rx, --rx_list <Sotrovimab Remdesevir>

List of treatments to interrogate.

--fasta, --fasta <sample.fasta>

Fasta file of the consensus sequence of the sample, only used to generate a BLASTx alignment for double checking mutations.


mutations


arguments

Only list mutations and not resistance information.

-n, --sample_name <sample_name>

Sample name, output files will be prefixed with this.

-v, --vcf <sample.vcf>

vcf file, variants called against the Wuhan-Hu-1 reference (MN908947.3)

-b, --bam <sample.sorted.bam>

Sorted bam file. File of read alignments for the sample mapped against the Wuhan-Hu-1 reference (MN908947.3)

-d, --database_dir <path/to/database_dir>

Directory with the latest version of the Stanford resistance database. If the latest version is not in this folder it will be downloaded to this location.

-p, --pipeline_dir <path/to/pipeline_dir>

All script output and intermediated files will be put here. Script will create a directory if none exists.

-l, --lineage <BA.1>

Lineage of the query (BA.1/BA.2/BA.3/Delta etc.)

-k, --keep_lineage

report lineage defining mutations as well


Method

A flowchart of how qhery run works

flowchart

qhery's People

Contributors

wytamma avatar mjsull avatar

Stargazers

 avatar  avatar Torsten Seemann avatar Son Nguyen avatar

Watchers

 avatar

Forkers

wytamma

qhery's Issues

Contig '' is not defined in the header

I'm running qhery on ~30k covid samples and getting a few errors. Contig '' implies a missing sequence ID?

Latest covid-drdb is https://github.com/hivdb/covid-drdb-payload/releases/download/20220818/covid-drdb-20220818.db.                                                                                                                                       
Latest version already downloaded.                                                                                                                                                                                                                        
database_dir/covid-drdb-20220818.db                                                                                                                                                                                                                       
['E:T9I', 'M:D3N', 'M:Q19E', 'M:A63T', 'ORF3a:T223I', 'PLpro:T24I', 'PLpro:G489S', 'RdRP:P323L', 'S:T19I', 'S:L24S', 'S:PPA25-27∆', 'S:HV69-70∆', 'S:G142D', 'S:V213G', 'S:G339D', 'S:S371F', 'S:S373P', 'S:S375F', 'S:T376A', 'S:D405N', 'S:R408S', 'S:K4
17N', 'S:N440K', 'S:L452R', 'S:S477N', 'S:T478K', 'S:E484A', 'S:F486V', 'S:Q498R', 'S:N501Y', 'S:Y505H', 'S:T547I', 'S:D614G', 'S:H655Y', 'S:N679K', 'S:P681H', 'S:N764K', 'S:D796Y', 'S:Q954H', 'S:N969K', '_3CLpro:P132H', 'nsp1:S135R', 'nsp13:R392C', 
'nsp14:I42V', 'nsp15:T112I', 'nsp4:L264F', 'nsp6:SGF106-108∆']                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
Variant not in database.                                                                                                                                                                                                                                  
Parsing /home/wwirth/.conda/envs/qhery/lib/python3.9/site-packages/qhery/data/Sars_cov_2.ASM985889v3.101.gff3 ...                                                                                                                                         
ignored: MN908947.3     ASM985889v3     region  1       29903   .       .       .       ID=region:MN908947.3;Alias=NC_045512.2,NC_045512v2                                                                                                                
Indexed 12 transcripts, 12 exons, 12 CDSs, 0 UTRs                                                                                                                                                                                                         
Calling...                                                                                                                                                                                                                                                
[W::vcf_parse] Contig '' is not defined in the header. (Quick workaround: index the file with tabix.)                                                                                                                                                     
Error: the chromosome "" is not present in /home/wwirth/.conda/envs/qhery/lib/python3.9/site-packages/qhery/data/nCoV-2019.reference.fasta     

ValueError: not enough values to unpack (expected 7, got 5)

Some of my VCF files don't have every element in the Consequence column. E.g. for the start_loss mutations...

CSQ	-	-	MN908947.3	27395	start_lost|ORF7a|ENSSAST00005000009|protein_coding|+

Thus, prot_mut and nuc_mut are not returned from consequence.split("|").

Traceback (most recent call last):
  File "/home/wwirth/.conda/envs/qhery/bin/qhery", line 8, in <module>
    sys.exit(main())
  File "/home/wwirth/.conda/envs/qhery/lib/python3.9/site-packages/qhery/__main__.py", line 114, in main
    mut_list_sample = mf.parse_csq()
  File "/home/wwirth/.conda/envs/qhery/lib/python3.9/site-packages/qhery/get_mutants.py", line 76, in parse_csq
    mut_type, gene, acc, gene_type, strand, prot_mut, nuc_mut = consequence.split("|")
ValueError: not enough values to unpack (expected 7, got 5)

sqlite3.OperationalError: unable to open database file

Sometimes the DB fails to load? maybe it is being rate limited by github?

Latest covid-drdb is .                                                                                                                                                                                                                                    
Latest version already downloaded.                                                                                                                                                                                                                        
database_dir/                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                                                                                                        
  File "/home/wwirth/.conda/envs/qhery/bin/qhery", line 8, in <module>                                                                                                                                                                                    
    sys.exit(main())                                                                                                                                                                                                                                      
  File "/home/wwirth/.conda/envs/qhery/lib/python3.9/site-packages/qhery/__main__.py", line 92, in main                                                                                                                                                   
    gt.connect()                                                                                                                                                                                                                                          
  File "/home/wwirth/.conda/envs/qhery/lib/python3.9/site-packages/qhery/get_tables_sql.py", line 39, in connect                                                                                                                                          
    self.con = sqlite3.connect(self.database)                                                                                                                                                                                                             
sqlite3.OperationalError: unable to open database file  

🐛 ValueError: no index available for pileup

I'm getting this error when I try to run qhery with a BAM file. Do I need to have an index in the same location as the BAM file?

Could not retrieve index file for '2022-112557.bam'
Traceback (most recent call last):
  File "/home/wwirth/.conda/envs/qhery/bin/qhery", line 8, in <module>
    sys.exit(main())
  File "/home/wwirth/.conda/envs/qhery/lib/python3.9/site-packages/qhery/__main__.py", line 121, in main
    make_output.make_final_tables(mut_list_sample, gt.resistances, mut_list_var, gt.epitopes, args.pipeline_dir, args.sample_name, args.bam, aa_to_nuc_dict)
  File "/home/wwirth/.conda/envs/qhery/lib/python3.9/site-packages/qhery/make_output.py", line 73, in make_final_tables
    codon_usage, codon_depth = get_codons(bam_file, aa_to_nuc_dict[gene][start])
  File "/home/wwirth/.conda/envs/qhery/lib/python3.9/site-packages/qhery/make_output.py", line 226, in get_codons
    for pileupcolumn in samfile.pileup("MN908947.3", start_position):
  File "pysam/libcalignmentfile.pyx", line 1340, in pysam.libcalignmentfile.AlignmentFile.pileup
ValueError: no index available for pileup

Consensus sequence

Is is possible to use a consensus sequence with qhery? Is it valid to produce a VCF from the consensus and use that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.