kit-ibg-5 / mdmcleaner Goto Github PK
View Code? Open in Web Editor NEWMDMcleaner the assessment, classification and refinement tool for microbial dark matter SAGs and MAGs
License: GNU General Public License v3.0
MDMcleaner the assessment, classification and refinement tool for microbial dark matter SAGs and MAGs
License: GNU General Public License v3.0
(also, selenocysteine tRNA species should not be necessarily epected for 100% completeness value)
Hello
after installation of mdmcleaner fro taggedrelease: https://github.com/KIT-IBG-5/mdmcleaner/releases/tag/v0.8.3
and a test run I had the following error: ] No such file or directory: 'xxxx/python3.8/site-packages/mdmcleaner-0.8.3-py3.8.egg/mdmcleaner/hmms/cutofftable_combined.tsv'
I was abble to restore the file ussing commit: e917a07
regards
Eric
sometimes, when the gtdb databse is updated, it can happen that the representative genome of a species changes.
if a process was started with one database, and then finished with another, it can happen that a query species that was present n the one db-version is not found anymore in the other...
An example is the presence of "GCF_002879875.1_NZ_PNXZ01000140.1" in gtrp_r95 which was replaced with "GCF_007097545.1" in gtdb_r202.
--> for every workflow/intermediate result make a note of the exact database version that was used to create it. return an error if a workflow is being continued with a different db version!
currently the output tables are a bit convoluted and not user friendly.
Instead of outputing the infos "as is", digest them a bit more to be better readable
Hi Jon,
When running mdmcleaner on our HPC (via slurm) I get the error:
"UnboundLocalError: local variable 'prodigal_proc' referenced before assignment".
Here is the detailed mdmcleaner.log I get from the system.
The error occurs when I use the following command:
mdmcleaner clean -t 15 -i /work_beegfs/user/02_Results/02_Metagenome/Bins/bins_mdmcleaner/*.fasta
Is this a problem with my installation, or can it be solved some other way?
Thank you,
Avril
Dear developer,
It seems that the url used for downloading silva "https://www.arb-silva.de/fileadmin/silva_databases/current" present in the "read_gtdb_taxonomy.py" Python script is no longer accessible,. I had to change it in the script to the ftp link: "ftp://arb-silva.de/current".
Now, it's working as expected, but I thought it's useful to share this piece of info with you.
Best,
Ali
Dear mdmcleaner developers,
I experienced a blastp error during mdmclean clean
, which resulted in a runtime error
. Could you have me troubleshoot please?
I was running mdmcleaner in a compute cluster using a virtual python (3.11) environment with the full mdmcleaner database. I've attached the log file here.
Meanwhile, I will try the database used in the pulibcation to see if this error is caused by the database.
Thank you for your help!
Sincerely,
Rui
Dear John,
I have a question about contamination percent calculation. I saw in your NAR article a comparison between bin contamination calculated by checkm vs that calculated by mdmcleaner, but when I used mdmcleaner clean command, I couldn't find the contamination % value. May I know where I can find it? Or if it's not available, how I can calculate it?
I really appreciate your help and cooperation.
Best regards,
Ali
command:
mdmcleaner.py refdb_contams -c mdmcleaner.config test_overview_refdb_ambiguities.tsv -o new_blacklist_additions.tsv -t 8
result:
Traceback (most recent call last):
File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 316, in <module>
main()
File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 224, in main
configs = config_object(args, read_blacklist = True)
File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 94, in __init__
self.blacklist = self.read_blacklistfiles()
File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 137, in read_blacklistfiles
with openfile(blacklistfile) as blf:
File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/misc.py", line 26, in openfile
filehandle = open(infilename, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'new_blacklist_additions.tsv'
Hi,
Thanks for developing this tool! After exploring the (many) output files and their columns, I was under the impression that the 'trustworthiness' score (from 0 to 10) mentioned in the NAR paper was in fact the column trust_index
in the output file fullcontiginfos_beforecleanup.tsv
. But this score is currently for each contig. Unfortunately, I could not find the aggregated score at the scale of the overall genome (as mentioned in the paper).
Should it be the fraction_evaluate_high or the bin_trust columns from the file overview_all_before_cleanup.tsv
? Can you please point future and current users to the equivalent to use :)
Best,
cpauvert
PS: Maybe this issue is related to #5 ; )
only one workflow (makedb) requires wget! for all others older versions of wget should not be a problem
- [ ] check which older versions of other dependencies still work, to make installation easier for not updated clusters
mdmcleaner.py set_configs --db_basedir /home/sandra/tools/mdmcleaner/gtdb
Traceback (most recent call last):
File "/home/sandra/tools/mdmcleaner/mdmcleaner/mdmcleaner.py", line 319, in <module>
main()
File "/home/sandra/tools/mdmcleaner/mdmcleaner/mdmcleaner.py", line 260, in main
configs, settings_source = read_configs(configfile_hierarchy, args)
NameError: name 'read_configs' is not defined
There are two duplicated mdmcleaner.py files. Currently the justification for both of them is to be able to build the project corectly with pip install .
and simultaneously use the mdmcleaner.py file as is.
Currently this does not wirk with only one of those files. The root one is needed to use the project as is and mdmcleaner/mdmcleaner.py needs to be present for building uing pip.
The duplication adds potential for errors.
I think the implemented if condition ( line 993 ) results in unintended behavior.
mdmcleaner/mdmcleaner/read_gtdb_taxonomy.py
Lines 991 to 1000 in ec3b23d
It checks if targetdir exists and is a dir. If that's true the function returns (after the access check). If os.path.isdir(targetdir)
returns false (the file exists but is not a dir) the else case is executed and the function tries to create the dir which is a file.
This should raise an OSError
in os.makedirs(targetdir)
which is not caught. The user should sees the native error.
At the moment I sadly can't test if the resulting error message is user-friendly or not, so I'm not totally sure about this.
But if you want I can look into it.
maybe use 98% aminoacid identity cut-off?
proteins that are unique for one species in a genus would still be attributed to that individual species (but only one copy would be kept, in case of multi-copy entries)
"redundant" proteins, that occur identically in multiple species of a genus would be attributed to the genus instead of the species (again only represented by one copy)
--> reduces dataset size
--> increases diamond/blast search speeds
--> increases speed of LCA-classifications (a little bit)?
If the checksum test fails after downoading, make up to n (user argument, by default = 3) attempts to re-download the data and check again.
maybe base nucleotide comparisons on minhash instead of blast searches? Maybe use sourmash for this?
https://sourmash.readthedocs.io/en/latest/
Generally calling a command in mdmcleaner with insufficient parameters results in a short helper message:
This isn't the case for makedb. That might confuse new users which potentially use this command first.
makedb has two exclusive required arguments. This is enfoced by code and not by argparse so there is no short help at the moment. It should be possible to generate the short help message and write it to console.
Explaining that the help function (-h) works for every command like in the README.me could also help.
Line 55 in ec3b23d
Hi,
I am testing MDMcleaner (version 0.8.3 installed via conda) on a set of bins, and I receive the following error:
-->evaluating contigs and setting filter-flags
finished 60%. Currently analysing apparent reference database ambiguity
THERE WAS A EXCEPTION WHILE HANDLING ../bins_IOWseq000005_pre_anvio/Co_bin.92.orig.fa
ERROR: evaluation result not accounted for: 'silva_conflict'
Traceback (most recent call last):
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/clean.py", line 197, in main
db_suspects = bindata.evaluate_and_flag_all_contigs(db=db, protblasts=protblasts, nucblasts=nucblasts, db_suspects=db_suspects, fast_run=args.fast_run)
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 1194, in evaluate_and_flag_all_contigs
is_potential_refdb_contam = step2()
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 1164, in step2
if db_suspects.last_checked_evaluations() == "OK":
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 271, in last_checked_evaluations
assert le == "OK", "\nERROR: evaluation result not accounted for: '{}'\n".format(le)
AssertionError:
ERROR: evaluation result not accounted for: 'silva_conflict'
In total, I had 144 bins and this error was thrown for 2 of them. Do you have any suggestions of how to solve this error?
Thanks!
Cheers,
Christiane
Installation with conda:
python=3.11.3
mdmcleaner=0.8.7
I got the next error when running mdmcleaner makedb
You are running the following MDMcleaner command:
'/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/bin/mdmcleaner makedb -o mdmcleaner'
reading settings from configfile: "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.config"
settings:
threads = '1'
db_type = '['gtdb']'
blacklistfile = '['/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/blacklist.list']'
blastn = 'blastn'
blastp = 'blastp'
makeblastdb = 'makeblastdb'
blastdbcmd = 'blastdbcmd'
diamond = 'diamond'
barrnap = 'barrnap'
hmmsearch = 'hmmsearch'
aragorn = 'aragorn'
prodigal = 'prodigal'
checking dependencies...
makeblastdb...2.13.0 --> OK!
diamond...2.1.6 --> OK!
wget...Traceback (most recent call last):
File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/bin/mdmcleaner", line 10, in <module>
sys.exit(main())
^^^^^^
File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.py", line 230, in main
check_dependencies.check_dependencies("makeblastdb", "diamond", "wget", configs=configs)
File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/check_dependencies.py", line 115, in check_dependencies
check_external_dependency(*toolnames, configs=configs)
File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/check_dependencies.py", line 128, in check_external_dependency
isttool = version_object(get_external_dependency_version_string(tool))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/check_dependencies.py", line 86, in get_external_dependency_version_string
output = proc.stdout.readline().strip()
^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
For some reason it is decoding the wget --version
text as ascii but in my version of wget the text contains some non-ascii characters.
I solved this by replacing the next line of get_external_dependency_version_string
:
proc = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, text = True, universal_newlines=True)
with:
proc = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, text = True, universal_newlines=True, errors="ignore")
The errors
parameter is explained here https://docs.python.org/3/library/io.html#io.TextIOWrapper. I don't think that setting it to ignore
would cause any unexpected behaviour but I'm not sure.
Hi @jvollme
Thanks for this tool! I ran into a small bug when running MDMcleaner (v0.8.3
from conda). It failed with the following error message.
reference-database contaminations detected during this run: 2
blasting 0 entries with blastx against reference proteins (another 1 entries were too long to blastx efficiently
Traceback (most recent call last):
File "/home/cpauvert/projects/.snakemake/conda/a0a4db55baa73e1a514e9cd116469487/bin/mdmcleaner", line 10, in <module>
sys.exit(main())
File "/home/cpauvert/projects/.snakemake/conda/a0a4db55baa73e1a514e9cd116469487/lib/python3.10/site-packages/mdmcleaner/mdmcleaner.py", line 217, in main
blacklist_additions = clean.main(args, configs)
File "/home/cpauvert/projects/.snakemake/conda/a0a4db55baa73e1a514e9cd116469487/lib/python3.10/site-packages/mdmcleaner/clean.py", line 230, in main
if "contamination" in db_suspects.collective_diamondblast():
TypeError: argument of type 'NoneType' is not iterable
By tracking it down, it seems that the culprit line is in review_refdbcontams.py L305
. I have no blast jobs to be run (blasting 0 entries
) but so the default return value of the collective_diamondblast()
function is a None
.
mdmcleaner/mdmcleaner/review_refdbcontams.py
Lines 302 to 305 in ec3b23d
But None
is not iterable so it makes the test in clean.py
line 230 fail:
mdmcleaner/mdmcleaner/clean.py
Lines 229 to 233 in ec3b23d
I think that amending the return
to return [None]
in L305 should do the trick, by keeping an iterable:
foo=None
print('yes' if 'A' in foo else 'no')
# yields TypeError: argument of type 'NoneType' is not iterable
# BUT
foo=[None]
print('yes' if 'A' in foo else 'no')
# yields no correctly
Do you have any test to check that it is not breaking anything?
Let me know if you prefer a pull request or do the fix yourself
Best,
Charlie
mdmcleaner is NOT designed to rescue terrible bins. It is only meant to validate good quality and to make good bins better!
This needs to be reflected in the results, so people are not encouraged to publish terrible bins (e.g. >20% contamination) after filtering
need better error messages when-
Thanks you for your good tool.
Directly providing latest constructed database may make this better and easier to access. :)
config settings for external binaries that are not in PATh are often ignored...
improve configs:
implement lock files to prevent two users writing to the same results-folder at the same time.
If NO contig yields ANY blast results, the following error is thrown:
'bindata' object has no attribute 'consensus_tax'
Traceback (most recent call last):
File "/home/ww5070/.conda/envs/mdmcleaner/lib/python3.10/site-packages/mdmcleaner/clean.py", line 179, in main
bindata.get_topleveltax(db)
File "/home/ww5070/.conda/envs/mdmcleaner/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 964, in get_topleveltax
sys.stderr.write("\tmajority tax-path: {}\n".format("; ".join(self.get_consensus_taxstringlist())))
File "/home/ww5070/.conda/envs/mdmcleaner/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 977, in get_consensus_taxstringlist
if self.consensus_tax != None:
AttributeError: 'bindata' object has no attribute 'consensus_tax'
This basically just means that the bin cannot be classified.
However, this needs to be caught, translated into a consenus-tax of "unknown" and reported as a more informative warning message
Divide protein blast databases into smaller subsets (similar to nucleotide dbs).
possibilities:
a positive list, containing the exact version of the reference -DB used along with all entires that were preiously assessed as "not contaminations" based on that database should be included also.
This would avoid repeating the same blast analyses with the same exact result over and over again and speed things up.
The list may get rather long over time, therefore:
Test how much the RAM usage increases for huge sets of accessions
Test how much slower a direct search in the text files via binary search tree would be
Hi,
Following issue 29, would you suggest a maximum % of contamination and/or minimum % completeness (i.e. as determined by checkm for example or any other tool) in which cases it would not be worth running MDMcleaner?
For example, I imagine that a MAG with 80% of contamination will never get to 10% after running MDMcleaner so I would probably skip it.
Best
Greg
Hi,
I have been running MDMcleaner on 2 different bin sets. In one run, not all bins were processed without error (see #37). Using the same reference data, I am now getting the following error for the last bin to be processed:
--> writing to output files
writing detailed contig infos to ./T4-48_bin.49.orig/fullcontiginfos_beforecleanup.tsv
appending overview data to overview_all_before_cleanup.tsv
creating output fastas
creating krona input-table
reference-database contaminations detected during this run: 69
blasting 13 entries with blastx against reference proteins (another 5 entries were too long to blastx efficiently
Traceback (most recent call last):
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/bin/mdmcleaner", line 10, in <module>
sys.exit(main())
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/mdmcleaner.py", line 217, in main
blacklist_additions = clean.main(args, configs)
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/clean.py", line 230, in main
if "contamination" in db_suspects.collective_diamondblast():
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 319, in collective_diamondblast
eval_list.append(self.evaluateornot(self.blastxjobs[x], blastxdone = True))
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 283, in evaluateornot
return_category, return_note = comp.count_contradictions() #todo: redundant. streamline blastcontigs() and countcontradictions() more
File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 123, in count_contradictions
domain_counts_expected = domain_counts[comparison_domain] #todo: only in try_except statement for debugging
KeyError: 'Bacteria'
Any advice would be appreciated. Thanks!
what the title says...
Hi!
I'm trying to download database using mdmcleaner makedb, and I've got an error like this:
01a: download GTDB data--
Now downloading from gtdb: "gtdb_taxfiles" (attempt 1)...
Now downloading from gtdb: "gtdb_fastas" (attempt 1)...
Now downloading from gtdb: "gtdb_vs_ncbi_lookup" (attempt 1)...
Traceback (most recent call last):
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/bin/mdmcleaner", line 10, in
sys.exit(main())
^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.py", line 231, in main
read_gtdb_taxonomy.main(args, configs)
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1142, in main
getNprepare_dbdata_nonncbi(args.outdir, verbose=args.verbose, settings=configs.settings)
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1036, in getNprepare_dbdata_nonncbi
progressdump = _download_dbdata_nonncbi(targetdir, progressdump, verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 714, in _download_dbdata_nonncbi
progressdump["gtdb_download_dict"], progressdump["gtdb_version"] = download_gtdb_stuff(gtdb_source_dict, targetdir, verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 316, in download_gtdb_stuff
download_dict = get_download_dict(sourcedict, targetfolder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 300, in get_download_dict
okdownloadfilelist, allisfine = check_gtdbmd5file(which_md5filename(targetfolder), targetfolder, sourcedict[x]["pattern"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 243, in which_md5filename
return glob.glob(targetdir + "/" + MD5FILEPATTERN_GTDB)[0] # --> assumes there is only one hit, therefore takes only the first of the list returned by glob.glob(); todo: make sure md5sum file is always deleted after db-setup! otherwise there may be problems if preexisting dbs are updated
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
Can you please help me to solve it?
Regards,
Maria
ORF calling on eukaryotes functions drastically different than for bacteria. Instead of using Eukaryotic proteins as references, rather run prodigal with prokaryotic and metagenomic settings on reference Eukaryotic genomes (not necessary for viral genomes) and use those as references. This is more likely to mimic what happens with eukaryotic contigs during metagenome analyses pipelines.
for this:
add workflows to:
when using"-M" with "get_markers", only contigs larger than specified are read into the contig_dict, BUT nonetheless all contigs are still searched for markers (at least for rrna genes).
--> need to make sure that only the contigs of contig dict are passed for any process that accepts stdin.
--> otherwise (e.g. aragorn) silently drop all hits associated to contigs that are not in contigdict...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.