nuruddinkhoiry / blastmining Goto Github PK

View Code? Open in Web Editor NEW

7.0 1.0 1.0 13.53 MB

Mining NCBI BLAST output

License: GNU General Public License v3.0

Python 99.66% Shell 0.34%

blast blastn lca ncbi-blast

blastmining's People

Contributors

Stargazers

Watchers

Forkers

zm-git-dev

blastmining's Issues

Coverage parameter?

This looks like an interesting tool and I'm looking forward to testing it out. Before going ahead with the testing, I was just curious if your tool has a coverage cutoff parameter too, for example query coverage or query/hit coverage?

Method A doesn't give expected results

Hi,

I quickly tested your tool and observed that the Method A (majority vote with percent identity cut-off) doesn't give expected results as stated in the README. All results in lca_method.tsv indicate the species name rather than the taxon names (reflecting kingdom,phylum,class,order,family & genus taxonomy ranks) as informed by the percent identity cutoff. I made this testing on the conda installation of your package and below are the command-lines that I had used.

cat test_zotus.fasta | parallel --gnu -j 100 --recstart '>' -N 10 --pipe blastn -task blastn -query - -db /home/antonycp/tools/ncbi-blast-2.3.0+/bin/nt -outfmt \'6 qseqid sseqid pident length mismatch gapopen evalue bitscore staxid\' -qcov_hsp_perc 70 -max_target_seqs 10 > test_zotus_coverage_blastn_maxtarget10_screenresults.txt
Nucleotide-Nucleotide BLAST 2.12.0+ was used above
conda activate conda_blastmining_env/
blastMining vote -i test_zotus_coverage_blastn_maxtarget10_screenresults.txt -o vote_method -e 0.001 -txl 99,97,95,90,85,80,75 -n 10 -sm 'Sample' -j 100 -p lca_method -kp -rm

head -n10 vote_method/lca_method.tsv

qseqid	Kingdom	Phylum	Class	Order	Family	Genus	Species
Zotu1680	k__Eukaryota	p__Rhodophyta	c__Florideophyceae	o__Ceramiales	f__Dasyaceaeg__Heterosiphonia	s__Heterosiphonia sp. 1densiuscula
Zotu1625	k__Eukaryota	p__Chordata	c__Actinopteri	o__	f__Centropomidae	g__Lates	s__Lates calcarifer
Zotu933	k__Eukaryota	p__Chordata	c__Actinopteri	o__Scombriformes	f__Scombridae	g__Thunnus	s__Thunnus albacares
Zotu1317	k__Eukaryota	p__Annelida	c__Clitellata	o__Crassiclitellata	f__Megascolecidae	g__Pontodrilus	s__Pontodrilus litoralis
Zotu791	k__Eukaryota	p__Arthropoda	c__Hexanauplia	o__Harpacticoida	f__Canthocamptidae	g__Australocamptus	s__Australocamptus hamondi
Zotu1561	k__Eukaryota	p__Chordata	c__Actinopteri	o__Pempheriformes	f__Lateolabracidae	g__Lateolabrax	s__Lateolabrax maculatus
Zotu1611	k__Eukaryota	p__Arthropoda	c__Insecta	o__Diptera	f__Phoridae	g__Megaselia	s__Megaselia sp. BOLD-2016
Zotu942	k__Eukaryota	p__Arthropoda	c__Hexanauplia	o__Calanoida	f__Paracalanidae	g__Paracalanus	s__Paracalanus aculeatus
Zotu958	k__Eukaryota	p__Mollusca	c__Gastropoda	o__Pteropoda	f__Cymbuliidae	g__Corolla	s__Corolla spectabilis

I'm attaching here the input files (in .txt format so as to comply github upload rules)
test_zotus_coverage_blastn_maxtarget10_screenresults.txt
test_zotus.fasta.txt

and the output files I got
lca_method.summary.txt
lca_method.tsv.txt

Also, is there any way to get the taxonomy ids in the final output files? I'm sure many users would really require this info. Thank you very much in advance!

Missing results when Method A is used as opposed to Method D

I was wondering why some sequence results are completely excluded after running Method A (vote method) as opposed to Method D (besthit method)? I have 25 sequences in my BLAST input file and yet only 18 of them appear in the result tsv file after running Method A. In the result tsv file after running Method D, all 25 of them are present. I would appreciate it very much if you could let me know the reason for this discrepancy? For example why are Zotu1 & Zotu2 sequence results missing from my vote method output tsv file but not from the besthit output tsv file? I'm attaching here all the input and output files for your kind perusal.
Below are the command-lines that I had used

blastMining besthit -i newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt -o new_besthit_method -e 0.001 -pi 97 -n 10 -sm test_zotus -j 100 -p besthit_method -kp
blastMining vote -i newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt -o vote_method -e 0.001 -txl 99,97,95,90,85,80,75 -n 10 -sm test_zotus -j 100 -p vote_method -kp

newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt
besthit_method.tsv.txt
vote_method.tsv.txt
Here's the sequences fasta file that I had used to perform the BLASTn analysis
test_select_zotus.fasta.txt

Error with version 1.2.0

Hi @NuruddinKhoiry I pip installed blastMining v1.2.0 as per your suggestion and tried running Method A but it still gives me an error (possibly related to taxonkit flag?). Below are the command-lines I used and the error I got. I have also attached below the input file that I had used. Thank you very much in advance for your help here.

blastMining --version
blastMining v.1.2.0
blastMining vote -i newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt -o vote_method -e 0.001 -txl 99,97,95,90,85,80,75 -n 10 -sm test_zotus -j 100 -p vote_method -kp
Error: unknown shorthand flag: 'P' in -P
Usage:
  taxonkit reformat [flags]

Flags:
  -d, --delimiter string               field delimiter in input lineage (default ";")
  -F, --fill-miss-rank                 fill missing rank with original lineage information (experimental)
  -f, --format string                  output format, placeholders of rank are needed (default "{k};{p};{c};{o};{f};{g};{s}")
  -h, --help                           help for reformat
  -i, --lineage-field int              field index of lineage. data should be tab-separated (default 2)
  -r, --miss-rank-repl string          replacement string for missing rank, if given "", "unclassified xxx xxx" will used, where "unclassified " is settable by flag -p/--miss-rank-repl-prefix
  -p, --miss-rank-repl-prefix string   prefix for estimated taxon level (default "unclassified ")
  -R, --miss-taxid-repl string         replacement string for missing taxid
  -t, --show-lineage-taxids            show corresponding taxids of reformated lineage

Global Flags:
      --data-dir string   directory containing nodes.dmp and names.dmp (default "/home/jubin/.taxonkit")
      --line-buffered     use line buffering on output, i.e., immediately writing to stdin/file for every line of output
  -o, --out-file string   out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
  -j, --threads int       number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
      --verbose           print verbose information

unknown shorthand flag: 'P' in -P
[ERRO] empty file: -
Traceback (most recent call last):
  File "/home/jubin/tools/conda_env_for_blastmining/bin/blastMining", line 8, in <module>
    sys.exit(main())
  File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/blastMining/blastMining.py", line 65, in main
    args.func(args)
  File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/blastMining/vote/vote.py", line 85, in main
    tax = pd.read_csv(str('./'+os.path.join(args.outdir, "TMPDIR")+"/TAXONKIT.out"), sep='\t', header=None, dtype=str)
  File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
    return mapping[engine](f, **self.options)
  File "/home/jubin/tools/conda_env_for_blastmining/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Here's the input file I had used
newforBLASTmining_10_test_zotus_coverage_blastn_maxtarget10_screenresults.txt

nuruddinkhoiry / blastmining Goto Github PK

blastmining's People

Contributors

Stargazers

Watchers

Forkers

blastmining's Issues

Coverage parameter?

Method A doesn't give expected results

Missing results when Method A is used as opposed to Method D

Error with version 1.2.0

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent