carmensheppard / pneumokity Goto Github PK
View Code? Open in Web Editor NEWPneumoKITy - Fast sensitive Pneumococcal Capsular Serotype screening from WGS data
PneumoKITy - Fast sensitive Pneumococcal Capsular Serotype screening from WGS data
convert median multiplicities for mixed samples to % of total to facilitate mixture detection
Some samples not making it into collated data properly. Check object outputs for :
304271
311485
642768
655411
717880
717976
720983
753435
756628
756666
758019
Look at using Median multiplicity as a measure of mixtures in mixed samples. PneumoCaT2 more sensitive to mixtures and MM cut off is set quite low, could be set higher to reduce number of mixtures found. OR could potentially call both serotypes fully ??
eg 11D type strain appears mixed in both runs 709662 and 917067 . Not called in PneumoCaT1 but appears in coverage summary... median multiplcity of second type (15C) is only 5 but hit hashes is very high.
Probably should report median multiplicity of top hits in addition to hit % as a quality metric.
Ref sequence for 18F from PneumoCaT ENA set of sequences is failing with RED rag status <70% hit. Investigate references in DB.
Check if this happens with other 18F isolates (if available) but given ref seq shoudl be from the very organism this is tested against it should work 100%
RED rag status - database not interpreted wcyD detection correctly ??
Analysing mash screen tsv for 709775 serotype 39 type strain:
Analysing mash screen output. Traceback (most recent call last): File "/home/carmen.sheppard/gitrepository/PneumoCaT2/pneumocat2.py", line 80, in <module> main(args, version) File "/home/carmen.sheppard/gitrepository/PneumoCaT2/pneumocat2.py", line 38, in main run_parse(analysis, tsvfile) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/run_stage1.py", line 160, in run_parse group_check(filtered_df, analysis.database) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/run_stage1.py", line 92, in group_check stage1_result = list(get_pheno_list(results, session)) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/run_stage1.py", line 34, in get_pheno_list out_res.append(pheno[0][0]) IndexError: list index out of range
Missed mixed serotype determination for one sample with top hits 17F 99.7% and 15B 94.5% (ID 300858) check whether this is because of missing stage 2 for 15BC or not correctly handled ie, stage 2 determination run when should not be or overriding stage1 result.
Update output to suit this "lite" version. eg can be run stand alone and give types/groups as appropriate.
implement filename checking using glob regex (user input if different from standard) to check that read 1 and read 2 are present and to get them stored correctly in file (if -i used). (plus maybe remove from sample ID).
Errors can give no indication of which sample failed - eg:
Running PneumoCaT 2.0a - Development Reference CTV.db database at /phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/ctvdb selected. WARNING: Existing files in output dir will be overwritten Used Mash version 2.2 Analysing mash screen output. 6A_6B_6C_6D Screen reference: /phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/ctvdb/6A_6B_6C_6D/wciN.msh Analysing mash screen output. Allele wciN unrecognised possible variant Traceback (most recent call last): File "/home/carmen.sheppard/gitrepository/pneumocat2/pneumocat2.py", line 79, in <module> main(args, version) File "/home/carmen.sheppard/gitrepository/pneumocat2/pneumocat2.py", line 49, in main start_analysis(analysis) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/run_stage2.py", line 39, in start_analysis sort_genes(gene, analysis, gene.var_type, session) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/screen_genes.py", line 40, in sort_genes stage2_var = get_variant_ids(hit_genes, allele_or_gene, analysis.grp_id, session)[0] File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/utilities.py", line 231, in get_variant_ids raise CtvdbError exceptions.CtvdbError: CtvdbError: check CTV.db and folder integrity, missing or mismatching information may be present.
Add sample info to stdout early in run
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
->Update to concat method . Does not crash run but gives annoying screen outputs!
Hello, I am just checking to see whether you are going to have pneumokity singularity image? thanks
Some samples fail when writing output files
Sample: 308597_RR1600088809-1
Traceback (most recent call last):
File "/home/carmen.sheppard/gitrepository/PneumoCaT2/pneumocat2.py", line 80, in
main(args, version)
File "/home/carmen.sheppard/gitrepository/PneumoCaT2/pneumocat2.py", line 67, in main
handle_results(analysis)
File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/tools.py", line 314, in handle_results
collate_results(analysis.csv_collate, results)
File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/PneumoCaT2/run_scripts/tools.py", line 292, in collate_results
df = pd.read_csv(collate_file)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in init
self._make_engine(self.engine)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/carmen.sheppard/.conda/envs/pneumocat2/lib/python3.7/site-packages/pandas/io/parsers.py", line 1917, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.cinit
pandas.errors.EmptyDataError: No columns to parse from file
This is in handle_results as last part of script.
25F and 25A unable to be distinguished in stage 1 alone, needs addition of second stage
Excessive onsscreen repeats of same info for serotype within group eg 35A_35C_42
investigate why this is printing 2 x during initial screen
Remove the extra text from predicted phenotype for thos which are determined to group level. Will just cause headaches for analysis downstream.
eg: Serotype within 35A_35C_42 unable to determine serotype using PneumoKITy due to requirement for sensitive sequence analysis.
Interpret to similar to others eg 24F/B or 12A/12B/44/46
KeyError: "['Serotype', 'shared-hashes', 'median-multiplicity', 'p-value'] not in index"
encountering this error and dont know how to fix, please help
Some serotypes giving "mixed serotypes " when not - probably due to detection of more than one sequence at stage 1 (as expected) when the sample is then picked out later as a type rather than mixed.
add cut off below 20% for "GREEN" status as currently they are flagged amber
wciN with unrecognised allele is leading to a false ctvdb error:
Running PneumoCaT 2.0a - Development Reference CTV.db database at /phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/ctvdb selected. WARNING: Existing files in output dir will be overwritten Used Mash version 2.2 Analysing mash screen output. 6A_6B_6C_6D Screen reference: /phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/ctvdb/6A_6B_6C_6D/wciN.msh Analysing mash screen output. Allele wciN unrecognised possible variant Traceback (most recent call last): File "/home/carmen.sheppard/gitrepository/pneumocat2/pneumocat2.py", line 79, in <module> main(args, version) File "/home/carmen.sheppard/gitrepository/pneumocat2/pneumocat2.py", line 49, in main start_analysis(analysis) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/run_stage2.py", line 39, in start_analysis sort_genes(gene, analysis, gene.var_type, session) File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/screen_genes.py", line 40, in sort_genes stage2_var = get_variant_ids(hit_genes, allele_or_gene, analysis.grp_id, session)[0] File "/phengs/hpc_storage/home/carmen.sheppard/gitrepository/pneumocat2/run_scripts/utilities.py", line 231, in get_variant_ids raise CtvdbError exceptions.CtvdbError: CtvdbError: check CTV.db and folder integrity, missing or mismatching information may be present.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.