carmensheppard / pneumokity Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 2.0 27.47 MB

PneumoKITy - Fast sensitive Pneumococcal Capsular Serotype screening from WGS data

Python 100.00%

pneumokity's People

Contributors

Stargazers

Watchers

Forkers

xiangyang1984 samhorsfield96

pneumokity's Issues

Add feature to create % for mixture multiplicities

convert median multiplicities for mixed samples to % of total to facilitate mixture detection

Collation not correctly collating all data

Some samples not making it into collated data properly. Check object outputs for :

304271
311485
642768
655411
717880
717976
720983
753435
756628
756666
758019

Median multiplicity as quality metric

Look at using Median multiplicity as a measure of mixtures in mixed samples. PneumoCaT2 more sensitive to mixtures and MM cut off is set quite low, could be set higher to reduce number of mixtures found. OR could potentially call both serotypes fully ??

eg 11D type strain appears mixed in both runs 709662 and 917067 . Not called in PneumoCaT1 but appears in coverage summary... median multiplcity of second type (15C) is only 5 but hit hashes is very high.

Probably should report median multiplicity of top hits in addition to hit % as a quality metric.

18F reference sequence failing with <70% hit RED rag status

Ref sequence for 18F from PneumoCaT ENA set of sequences is failing with RED rag status <70% hit. Investigate references in DB.

Check if this happens with other 18F isolates (if available) but given ref seq shoudl be from the very organism this is tested against it should work 100%

Check 25FA determination from ENA panel

RED rag status - database not interpreted wcyD detection correctly ??

List index out of range interpreting serotype 39 result

Analysing mash screen tsv for 709775 serotype 39 type strain:

Analysing mash screen output. Traceback (most recent call last): File "/home/carmen.sheppard/gitrepository/PneumoCaT2/pneumocat2.py", line 80, in <module> main(args, version) File "/home/carmen.sheppard/gitrepository/PneumoCaT2/pneumocat2.py", line 38, in main run_parse(analysis, tsvfile) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/run_stage1.py", line 160, in run_parse group_check(filtered_df, analysis.database) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/run_stage1.py", line 92, in group_check stage1_result = list(get_pheno_list(results, session)) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/run_stage1.py", line 34, in get_pheno_list out_res.append(pheno[0][0]) IndexError: list index out of range

Check Mixed serotype determination - for hit and group in mix.

Missed mixed serotype determination for one sample with top hits 17F 99.7% and 15B 94.5% (ID 300858) check whether this is because of missing stage 2 for 15BC or not correctly handled ie, stage 2 determination run when should not be or overriding stage1 result.

Update outputs to suit "pneumoKity"

Update output to suit this "lite" version. eg can be run stand alone and give types/groups as appropriate.

Filename checking for read 1 and read 2

implement filename checking using glob regex (user input if different from standard) to check that read 1 and read 2 are present and to get them stored correctly in file (if -i used). (plus maybe remove from sample ID).

add sample ID to stdout

Errors can give no indication of which sample failed - eg:

Running PneumoCaT 2.0a - Development Reference CTV.db database at /phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/ctvdb selected. WARNING: Existing files in output dir will be overwritten Used Mash version 2.2 Analysing mash screen output. 6A_6B_6C_6D Screen reference: /phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/ctvdb/6A_6B_6C_6D/wciN.msh Analysing mash screen output. Allele wciN unrecognised possible variant Traceback (most recent call last): File "/home/carmen.sheppard/gitrepository/pneumocat2/pneumocat2.py", line 79, in <module> main(args, version) File "/home/carmen.sheppard/gitrepository/pneumocat2/pneumocat2.py", line 49, in main start_analysis(analysis) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/run_stage2.py", line 39, in start_analysis sort_genes(gene, analysis, gene.var_type, session) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/screen_genes.py", line 40, in sort_genes stage2_var = get_variant_ids(hit_genes, allele_or_gene, analysis.grp_id, session)[0] File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/utilities.py", line 231, in get_variant_ids raise CtvdbError exceptions.CtvdbError: CtvdbError: check CTV.db and folder integrity, missing or mismatching information may be present.

Add sample info to stdout early in run

df.append() depreciated

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

->Update to concat method . Does not crash run but gives annoying screen outputs!

singularity image

Hello, I am just checking to see whether you are going to have pneumokity singularity image? thanks

Error writing empty dataframe

Some samples fail when writing output files

Sample: 308597_RR1600088809-1
Traceback (most recent call last):
File "/home/carmen.sheppard/gitrepository/PneumoCaT2/pneumocat2.py", line 80, in
main(args, version)
File "/home/carmen.sheppard/gitrepository/PneumoCaT2/pneumocat2.py", line 67, in main
handle_results(analysis)
File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/tools.py", line 314, in handle_results
collate_results(analysis.csv_collate, results)
File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/tools.py", line 292, in collate_results
df = pd.read_csv(collate_file)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in init
self._make_engine(self.engine)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 1917, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.cinit
pandas.errors.EmptyDataError: No columns to parse from file

This is in handle_results as last part of script.

Add stage 2 for serogroup 25F/25A

25F and 25A unable to be distinguished in stage 1 alone, needs addition of second stage

onscreen info for serotype within groups

Excessive onsscreen repeats of same info for serotype within group eg 35A_35C_42

investigate why this is printing 2 x during initial screen

Remove text from "predicted phenotype" field for group level detemination

Remove the extra text from predicted phenotype for thos which are determined to group level. Will just cause headaches for analysis downstream.

eg: Serotype within 35A_35C_42 unable to determine serotype using PneumoKITy due to requirement for sensitive sequence analysis.

Interpret to similar to others eg 24F/B or 12A/12B/44/46

raise KeyError(f"{not_found} not in index")

KeyError: "['Serotype', 'shared-hashes', 'median-multiplicity', 'p-value'] not in index"
encountering this error and dont know how to fix, please help

"mixed serotypes" for 19F and 32F/A when not mixed

Some serotypes giving "mixed serotypes " when not - probably due to detection of more than one sequence at stage 1 (as expected) when the sample is then picked out later as a type rather than mixed.

Stop low hits in stage 2 being flagged amber when gene not present"

add cut off below 20% for "GREEN" status as currently they are flagged amber

False Ctvdberror

wciN with unrecognised allele is leading to a false ctvdb error:
Running PneumoCaT 2.0a - Development Reference CTV.db database at /phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/ctvdb selected. WARNING: Existing files in output dir will be overwritten Used Mash version 2.2 Analysing mash screen output. 6A_6B_6C_6D Screen reference: /phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/ctvdb/6A_6B_6C_6D/wciN.msh Analysing mash screen output. Allele wciN unrecognised possible variant Traceback (most recent call last): File "/home/carmen.sheppard/gitrepository/pneumocat2/pneumocat2.py", line 79, in <module> main(args, version) File "/home/carmen.sheppard/gitrepository/pneumocat2/pneumocat2.py", line 49, in main start_analysis(analysis) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/run_stage2.py", line 39, in start_analysis sort_genes(gene, analysis, gene.var_type, session) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/screen_genes.py", line 40, in sort_genes stage2_var = get_variant_ids(hit_genes, allele_or_gene, analysis.grp_id, session)[0] File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/utilities.py", line 231, in get_variant_ids raise CtvdbError exceptions.CtvdbError: CtvdbError: check CTV.db and folder integrity, missing or mismatching information may be present.