nuruddinkhoiry / blastmining Goto Github PK
View Code? Open in Web Editor NEWMining NCBI BLAST output
License: GNU General Public License v3.0
Mining NCBI BLAST output
License: GNU General Public License v3.0
This looks like an interesting tool and I'm looking forward to testing it out. Before going ahead with the testing, I was just curious if your tool has a coverage cutoff parameter too, for example query coverage or query/hit coverage?
Hi,
I quickly tested your tool and observed that the Method A (majority vote with percent identity cut-off) doesn't give expected results as stated in the README. All results in lca_method.tsv
indicate the species name rather than the taxon names (reflecting kingdom,phylum,class,order,family & genus taxonomy ranks) as informed by the percent identity cutoff. I made this testing on the conda installation of your package and below are the command-lines that I had used.
cat test_zotus.fasta | parallel --gnu -j 100 --recstart '>' -N 10 --pipe blastn -task blastn -query - -db /home/antonycp/tools/ncbi-blast-2.3.0+/bin/nt -outfmt \'6 qseqid sseqid pident length mismatch gapopen evalue bitscore staxid\' -qcov_hsp_perc 70 -max_target_seqs 10 > test_zotus_coverage_blastn_maxtarget10_screenresults.txt
Nucleotide-Nucleotide BLAST 2.12.0+ was used above
conda activate conda_blastmining_env/
blastMining vote -i test_zotus_coverage_blastn_maxtarget10_screenresults.txt -o vote_method -e 0.001 -txl 99,97,95,90,85,80,75 -n 10 -sm 'Sample' -j 100 -p lca_method -kp -rm
head -n10 vote_method/lca_method.tsv
qseqid Kingdom Phylum Class Order Family Genus Species
Zotu1680 k__Eukaryota p__Rhodophyta c__Florideophyceae o__Ceramiales f__Dasyaceaeg__Heterosiphonia s__Heterosiphonia sp. 1densiuscula
Zotu1625 k__Eukaryota p__Chordata c__Actinopteri o__ f__Centropomidae g__Lates s__Lates calcarifer
Zotu933 k__Eukaryota p__Chordata c__Actinopteri o__Scombriformes f__Scombridae g__Thunnus s__Thunnus albacares
Zotu1317 k__Eukaryota p__Annelida c__Clitellata o__Crassiclitellata f__Megascolecidae g__Pontodrilus s__Pontodrilus litoralis
Zotu791 k__Eukaryota p__Arthropoda c__Hexanauplia o__Harpacticoida f__Canthocamptidae g__Australocamptus s__Australocamptus hamondi
Zotu1561 k__Eukaryota p__Chordata c__Actinopteri o__Pempheriformes f__Lateolabracidae g__Lateolabrax s__Lateolabrax maculatus
Zotu1611 k__Eukaryota p__Arthropoda c__Insecta o__Diptera f__Phoridae g__Megaselia s__Megaselia sp. BOLD-2016
Zotu942 k__Eukaryota p__Arthropoda c__Hexanauplia o__Calanoida f__Paracalanidae g__Paracalanus s__Paracalanus aculeatus
Zotu958 k__Eukaryota p__Mollusca c__Gastropoda o__Pteropoda f__Cymbuliidae g__Corolla s__Corolla spectabilis
I'm attaching here the input files (in .txt format so as to comply github upload rules)
test_zotus_coverage_blastn_maxtarget10_screenresults.txt
test_zotus.fasta.txt
and the output files I got
lca_method.summary.txt
lca_method.tsv.txt
Also, is there any way to get the taxonomy ids in the final output files? I'm sure many users would really require this info. Thank you very much in advance!
I was wondering why some sequence results are completely excluded after running Method A (vote method) as opposed to Method D (besthit method)? I have 25 sequences in my BLAST input file and yet only 18 of them appear in the result tsv file after running Method A. In the result tsv file after running Method D, all 25 of them are present. I would appreciate it very much if you could let me know the reason for this discrepancy? For example why are Zotu1 & Zotu2 sequence results missing from my vote method output tsv file but not from the besthit output tsv file? I'm attaching here all the input and output files for your kind perusal.
Below are the command-lines that I had used
blastMining besthit -i newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt -o new_besthit_method -e 0.001 -pi 97 -n 10 -sm test_zotus -j 100 -p besthit_method -kp
blastMining vote -i newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt -o vote_method -e 0.001 -txl 99,97,95,90,85,80,75 -n 10 -sm test_zotus -j 100 -p vote_method -kp
newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt
besthit_method.tsv.txt
vote_method.tsv.txt
Here's the sequences fasta file that I had used to perform the BLASTn analysis
test_select_zotus.fasta.txt
Hi @NuruddinKhoiry I pip installed blastMining v1.2.0 as per your suggestion and tried running Method A but it still gives me an error (possibly related to taxonkit flag?). Below are the command-lines I used and the error I got. I have also attached below the input file that I had used. Thank you very much in advance for your help here.
blastMining --version
blastMining v.1.2.0
blastMining vote -i newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt -o vote_method -e 0.001 -txl 99,97,95,90,85,80,75 -n 10 -sm test_zotus -j 100 -p vote_method -kp
Error: unknown shorthand flag: 'P' in -P
Usage:
taxonkit reformat [flags]
Flags:
-d, --delimiter string field delimiter in input lineage (default ";")
-F, --fill-miss-rank fill missing rank with original lineage information (experimental)
-f, --format string output format, placeholders of rank are needed (default "{k};{p};{c};{o};{f};{g};{s}")
-h, --help help for reformat
-i, --lineage-field int field index of lineage. data should be tab-separated (default 2)
-r, --miss-rank-repl string replacement string for missing rank, if given "", "unclassified xxx xxx" will used, where "unclassified " is settable by flag -p/--miss-rank-repl-prefix
-p, --miss-rank-repl-prefix string prefix for estimated taxon level (default "unclassified ")
-R, --miss-taxid-repl string replacement string for missing taxid
-t, --show-lineage-taxids show corresponding taxids of reformated lineage
Global Flags:
--data-dir string directory containing nodes.dmp and names.dmp (default "/home/jubin/.taxonkit")
--line-buffered use line buffering on output, i.e., immediately writing to stdin/file for every line of output
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
--verbose print verbose information
unknown shorthand flag: 'P' in -P
[ERRO] empty file: -
Traceback (most recent call last):
File "/home/jubin/tools/conda_env_for_blastmining/bin/blastMining", line 8, in <module>
sys.exit(main())
File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/blastMining/blastMining.py", line 65, in main
args.func(args)
File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/blastMining/vote/vote.py", line 85, in main
tax = pd.read_csv(str('./'+os.path.join(args.outdir, "TMPDIR")+"/TAXONKIT.out"), sep='\t', header=None, dtype=str)
File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
self._engine = self._make_engine(f, self.engine)
File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
return mapping[engine](f, **self.options)
File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Here's the input file I had used
newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.