kit-ibg-5 / mdmcleaner Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 6.0 11.75 MB

MDMcleaner the assessment, classification and refinement tool for microbial dark matter SAGs and MAGs

License: GNU General Public License v3.0

Python 100.00%

bioinformatics genomics mag mags metagenomics sag sags single-cell

mdmcleaner's People

Contributors

Stargazers

Watchers

Forkers

jvollme flenk cpauvert did10 juliamuench m0myhudz

mdmcleaner's Issues

Add trna-based completeness estimations to output table

(also, selenocysteine tRNA species should not be necessarily epected for 100% completeness value)

release tag 0.8.3 miss cutofftable_combined.tsv

Hello

after installation of mdmcleaner fro taggedrelease: https://github.com/KIT-IBG-5/mdmcleaner/releases/tag/v0.8.3
and a test run I had the following error: ] No such file or directory: 'xxxx/python3.8/site-packages/mdmcleaner-0.8.3-py3.8.egg/mdmcleaner/hmms/cutofftable_combined.tsv'

I was abble to restore the file ussing commit: e917a07

regards

Eric

review_refdb_contams shuld be run for each MAG/SAG during original contigs classifications...

When workflows are aborted and then recontinued later, check that the database used is still the same...

sometimes, when the gtdb databse is updated, it can happen that the representative genome of a species changes.
if a process was started with one database, and then finished with another, it can happen that a query species that was present n the one db-version is not found anymore in the other...

An example is the presence of "GCF_002879875.1_NZ_PNXZ01000140.1" in gtrp_r95 which was replaced with "GCF_007097545.1" in gtdb_r202.

--> for every workflow/intermediate result make a note of the exact database version that was used to create it. return an error if a workflow is being continued with a different db version!

simplify report tables

currently the output tables are a bit convoluted and not user friendly.
Instead of outputing the infos "as is", digest them a bit more to be better readable

UnboundLocalError 'prodigal_proc'

Hi Jon,
When running mdmcleaner on our HPC (via slurm) I get the error:
"UnboundLocalError: local variable 'prodigal_proc' referenced before assignment".
Here is the detailed mdmcleaner.log I get from the system.

The error occurs when I use the following command:
mdmcleaner clean -t 15 -i /work_beegfs/user/02_Results/02_Metagenome/Bins/bins_mdmcleaner/*.fasta

Is this a problem with my installation, or can it be solved some other way?
Thank you,
Avril

add workflow for evaluating potential refDB-contaminations

adjust nucleotide database creation to use "parse-seqids"
extract best hit and best contradiciton from nucleotide referenceDB (assuming it is not eukaryotic)
blast both aganst nucleotide reference DB
for each, examine the hits with at least 50% coverage (of the shorter of subject or query, respectively) and 90% identity
for blastx: require 90% coverage of the subect and 90% identity
count phylum and domain distribution among filtered hits
if more contradicting domains/phyla than expected domain/phylum --> assume contig is a contamination

inconsistency in cleaner logs

The log message of scanning for tRNAs prints the wrong number of universally required types of tRNA.
The automated code says there are 21tRNAs, the hand written message says there are 20.

Just a small inconsistency in logger output.

silva url

Dear developer,

It seems that the url used for downloading silva "https://www.arb-silva.de/fileadmin/silva_databases/current" present in the "read_gtdb_taxonomy.py" Python script is no longer accessible,. I had to change it in the script to the ftp link: "ftp://arb-silva.de/current".
Now, it's working as expected, but I thought it's useful to share this piece of info with you.

Best,

Ali

An error occured during blastp run with query '-'

Dear mdmcleaner developers,

I experienced a blastp error during mdmclean clean, which resulted in a runtime error. Could you have me troubleshoot please?

I was running mdmcleaner in a compute cluster using a virtual python (3.11) environment with the full mdmcleaner database. I've attached the log file here.

Meanwhile, I will try the database used in the pulibcation to see if this error is caused by the database.

Thank you for your help!

mdmcleaner_out.txt

Sincerely,
Rui

add release to pip and conda

wait till release version
then add release to pip
and to conda

Contamination percent calculation

Dear John,

I have a question about contamination percent calculation. I saw in your NAR article a comparison between bin contamination calculated by checkm vs that calculated by mdmcleaner, but when I used mdmcleaner clean command, I couldn't find the contamination % value. May I know where I can find it? Or if it's not available, how I can calculate it?

I really appreciate your help and cooperation.

Best regards,

Ali

"refdb_contams" workflow exits with error if outputblacklist does not already exist

command:

mdmcleaner.py refdb_contams -c mdmcleaner.config test_overview_refdb_ambiguities.tsv -o new_blacklist_additions.tsv -t 8

result:

Traceback (most recent call last):
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 316, in <module>
    main()
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 224, in main
    configs = config_object(args, read_blacklist = True)
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 94, in __init__
    self.blacklist = self.read_blacklistfiles()
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 137, in read_blacklistfiles
    with openfile(blacklistfile) as blf:
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/misc.py", line 26, in openfile
    filehandle = open(infilename, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'new_blacklist_additions.tsv'

add functions to extract reference sequences from reference dbs

add function to extract nucleotide references from blast-dbs
~~add function to extract protein references from diamond dbs~~ use blast-DBs for diamond also
add wrapper function to getdb-db-object that chooses the correct db based on configs

Extracting the 'trustworthiness' score for an overall genome classification from the output files

Hi,
Thanks for developing this tool! After exploring the (many) output files and their columns, I was under the impression that the 'trustworthiness' score (from 0 to 10) mentioned in the NAR paper was in fact the column trust_index in the output file fullcontiginfos_beforecleanup.tsv. But this score is currently for each contig. Unfortunately, I could not find the aggregated score at the scale of the overall genome (as mentioned in the paper).

Should it be the fraction_evaluate_high or the bin_trust columns from the file overview_all_before_cleanup.tsv ? Can you please point future and current users to the equivalent to use :)

Best,
cpauvert
PS: Maybe this issue is related to #5 ; )

adjust dependencies

only one workflow (makedb) requires wget! for all others older versions of wget should not be a problem

remove requirement of newest wget version for all other workflows

~~- [ ] check which older versions of other dependencies still work, to make installation easier for not updated clusters~~

failure to make reference database

Thanks a lot for your contribution to MDMcleaner.

I got an issue with "mdmcleaner mkdb", the error is no MD5SUM, please see the following screenshot.

Could please help me figure it out?

Thanks,
Peter

problem with set_configs

mdmcleaner.py set_configs --db_basedir /home/sandra/tools/mdmcleaner/gtdb
Traceback (most recent call last):
  File "/home/sandra/tools/mdmcleaner/mdmcleaner/mdmcleaner.py", line 319, in <module>
    main()
  File "/home/sandra/tools/mdmcleaner/mdmcleaner/mdmcleaner.py", line 260, in main
    configs, settings_source = read_configs(configfile_hierarchy, args)
NameError: name 'read_configs' is not defined

Remove duplicated mdmcleaner.py

There are two duplicated mdmcleaner.py files. Currently the justification for both of them is to be able to build the project corectly with pip install . and simultaneously use the mdmcleaner.py file as is.
Currently this does not wirk with only one of those files. The root one is needed to use the project as is and mdmcleaner/mdmcleaner.py needs to be present for building uing pip.
The duplication adds potential for errors.

Potential unintended behavior while directory checking and creating

I think the implemented if condition ( line 993 ) results in unintended behavior.

mdmcleaner/mdmcleaner/read_gtdb_taxonomy.py

Lines 991 to 1000 in ec3b23d

    
           def test_or_create_targetdir(targetdir): 
        
           	try: 
        
           		if os.path.exists(targetdir) and os.path.isdir(targetdir): 
        
           			if not os.access(targetdir, os.W_OK): 
        
           				raise PermissionError 
        
           		else: 
        
           			os.makedirs(targetdir) 
        
           	except PermissionError as e: 
        
           		# ~ sys.stderr.write("\nERROR: insufficient write permissions for '{}'! Please choose a different target directory and try again\n".format(targetdir)) 
        
           		raise PermissionError("\n\nERROR: insufficient write permissions for '{}'! Please choose a different target directory and try again\n".format(targetdir))

It checks if targetdir exists and is a dir. If that's true the function returns (after the access check). If os.path.isdir(targetdir) returns false (the file exists but is not a dir) the else case is executed and the function tries to create the dir which is a file.

This should raise an OSError in os.makedirs(targetdir) which is not caught. The user should sees the native error.
At the moment I sadly can't test if the resulting error message is user-friendly or not, so I'm not totally sure about this.

But if you want I can look into it.

dereplicate proteins on genus level to reduce database size

maybe use 98% aminoacid identity cut-off?
proteins that are unique for one species in a genus would still be attributed to that individual species (but only one copy would be kept, in case of multi-copy entries)
"redundant" proteins, that occur identically in multiple species of a genus would be attributed to the genus instead of the species (again only represented by one copy)
--> reduces dataset size
--> increases diamond/blast search speeds
--> increases speed of LCA-classifications (a little bit)?

automatically repeat failed attempts to (re-)download data during db-building

If the checksum test fails after downoading, make up to n (user argument, by default = 3) attempts to re-download the data and check again.

incorporate minHash

maybe base nucleotide comparisons on minhash instead of blast searches? Maybe use sourmash for this?
https://sourmash.readthedocs.io/en/latest/

enhancement add short help for refdb

Generally calling a command in mdmcleaner with insufficient parameters results in a short helper message:

This isn't the case for makedb. That might confuse new users which potentially use this command first.

makedb has two exclusive required arguments. This is enfoced by code and not by argparse so there is no short help at the moment. It should be possible to generate the short help message and write it to console.

Explaining that the help function (-h) works for every command like in the README.me could also help.

mdmcleaner/README.md

Line 55 in ec3b23d

    
           A list of mdmcleaner commands is returned when invoking the help function of MCMcleaner as follows: ```mdmcleaner -h```. Each command has it's own help function that can be invoked with ```mdmcleaner <COMMAND> -h```. The available commands are:

The current -h of mdmcleaner does not mension this use of -h for every command.

ERROR: evaluation result not accounted for: 'silva_conflict'

Hi,

I am testing MDMcleaner (version 0.8.3 installed via conda) on a set of bins, and I receive the following error:

-->evaluating contigs and setting filter-flags
         finished 60%. Currently analysing apparent reference database ambiguity
THERE WAS A EXCEPTION WHILE HANDLING ../bins_IOWseq000005_pre_anvio/Co_bin.92.orig.fa


ERROR: evaluation result not accounted for: 'silva_conflict'

Traceback (most recent call last):
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/clean.py", line 197, in main
    db_suspects = bindata.evaluate_and_flag_all_contigs(db=db, protblasts=protblasts, nucblasts=nucblasts, db_suspects=db_suspects, fast_run=args.fast_run)
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 1194, in evaluate_and_flag_all_contigs
    is_potential_refdb_contam = step2()
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 1164, in step2
    if db_suspects.last_checked_evaluations() == "OK":
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 271, in last_checked_evaluations
    assert le == "OK", "\nERROR: evaluation result not accounted for: '{}'\n".format(le)
AssertionError:
ERROR: evaluation result not accounted for: 'silva_conflict'

In total, I had 144 bins and this error was thrown for 2 of them. Do you have any suggestions of how to solve this error?

Thanks!

Cheers,
Christiane

Error when checking dependencies in mdmcleaner makedb

Installation with conda:
python=3.11.3
mdmcleaner=0.8.7

I got the next error when running mdmcleaner makedb

You are running the following MDMcleaner command:                                                                                
         '/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/bin/mdmcleaner makedb -o mdmcleaner'                                                                                                                                                             
reading settings from configfile: "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.config"                                                                                                              
                                                                                                                                                                                                                                                                  
        settings:                                                                                                                                                                                                                                                 
                threads = '1'                                                                                                                                                                                                                                     
                db_type = '['gtdb']'                                                                                                                                                                                                                              
                blacklistfile = '['/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/blacklist.list']'                                                                                                               
                blastn = 'blastn'                                                                                                
                blastp = 'blastp'                                                                                                
                makeblastdb = 'makeblastdb'                                                                                                                                                                                                                       
                blastdbcmd = 'blastdbcmd'                                                                                                                                                                                                                         
                diamond = 'diamond'                                                                                                                                                                                                                               
                barrnap = 'barrnap'                                                                                              
                hmmsearch = 'hmmsearch'                                                                                          
                aragorn = 'aragorn'                                                                                              
                prodigal = 'prodigal'                                                                                                                                                                                                                             
                                                                                                                                 
                                                                                                                                 
        checking dependencies...                                                                                                 
                makeblastdb...2.13.0 --> OK!                                                                                     
                diamond...2.1.6 --> OK!                                                                                                                                                                                                                           
                wget...Traceback (most recent call last):                                                                        
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/bin/mdmcleaner", line 10, in <module>                                                                                                                                                              
    sys.exit(main())                                                                                                             
             ^^^^^^                                                                                                                                                                                                                                               
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.py", line 230, in main
    check_dependencies.check_dependencies("makeblastdb", "diamond", "wget", configs=configs)                                     
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/check_dependencies.py", line 115, in check_dependencies                                                                                                    
    check_external_dependency(*toolnames, configs=configs)                                                                       
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/check_dependencies.py", line 128, in check_external_dependency                                                                                             
    isttool = version_object(get_external_dependency_version_string(tool))                                                       
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                         
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/check_dependencies.py", line 86, in get_external_dependency_version_string
    output = proc.stdout.readline().strip()                                                                                      
             ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                               
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/encodings/ascii.py", line 26, in decode            
    return codecs.ascii_decode(input, self.errors)[0]

For some reason it is decoding the wget --version text as ascii but in my version of wget the text contains some non-ascii characters.

I solved this by replacing the next line of get_external_dependency_version_string:

proc = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, text = True, universal_newlines=True)

with:

proc = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, text = True, universal_newlines=True, errors="ignore")

The errors parameter is explained here https://docs.python.org/3/library/io.html#io.TextIOWrapper. I don't think that setting it to ignore would cause any unexpected behaviour but I'm not sure.

TypeError in clean.py prevent MDMcleaner to exit gracefully when no blast jobs should be run

Hi @jvollme

Thanks for this tool! I ran into a small bug when running MDMcleaner (v0.8.3 from conda). It failed with the following error message.

reference-database contaminations detected during this run: 2
blasting 0 entries with blastx against reference proteins (another 1 entries were too long to blastx efficiently
Traceback (most recent call last):
  File "/home/cpauvert/projects/.snakemake/conda/a0a4db55baa73e1a514e9cd116469487/bin/mdmcleaner", line 10, in <module>
    sys.exit(main())
  File "/home/cpauvert/projects/.snakemake/conda/a0a4db55baa73e1a514e9cd116469487/lib/python3.10/site-packages/mdmcleaner/mdmcleaner.py", line 217, in main
    blacklist_additions = clean.main(args, configs)
  File "/home/cpauvert/projects/.snakemake/conda/a0a4db55baa73e1a514e9cd116469487/lib/python3.10/site-packages/mdmcleaner/clean.py", line 230, in main
    if "contamination" in db_suspects.collective_diamondblast():
TypeError: argument of type 'NoneType' is not iterable

By tracking it down, it seems that the culprit line is in review_refdbcontams.py L305. I have no blast jobs to be run (blasting 0 entries) but so the default return value of the collective_diamondblast() function is a None.

mdmcleaner/mdmcleaner/review_refdbcontams.py

Lines 302 to 305 in ec3b23d

    
           blastrecords = [self.blastxjobs[x].seqrecord[0] for x in self.blastxjobs if len(self.blastxjobs[x].seqrecord[0]) < 100000] #for now skipping reference contigs larger than 100 kb (takes too long to blastx). TODO: in such cases, search for ribosomal & other markergenes to verify classification! 
        
           sys.stderr.write("\nblasting {} entries with blastx against reference proteins (another {} entries were too long to blastx efficiently\n".format(len(blastrecords), len(self.blastxjobs) - len(blastrecords))) 
        
           if len(blastrecords) == 0: 
        
           	return

But None is not iterable so it makes the test in clean.py line 230 fail:

mdmcleaner/mdmcleaner/clean.py

Lines 229 to 233 in ec3b23d

    
           if not args.fast_run and db_suspects != None: 
        
           	if "contamination" in db_suspects.collective_diamondblast(): 
        
           		sys.stderr.write("\nWARNING: potential eukaryotic contaminants were determined in reference genomes. some classifications may need to be ajdjusted.\nIt is recommended to run the pipeline again (as is), with the updated blacklist to correct that (most intermediate results can be reused, so this will be faster than the original run)\n\n") 
        
           	if len(db_suspects.blacklist_additions) > 0: 
        
           		sys.stderr.write("\nA total of {} reference-database entries were newly detected as contaminants! Please note the updated blacklist!\n".format(len(db_suspects.blacklist_additions)))

I think that amending the return to return [None] in L305 should do the trick, by keeping an iterable:

foo=None
print('yes' if 'A' in foo else 'no')
# yields TypeError: argument of type 'NoneType' is not iterable
# BUT 
foo=[None]
print('yes' if 'A' in foo else 'no')
# yields no correctly

Do you have any test to check that it is not breaking anything?
Let me know if you prefer a pull request or do the fix yourself
Best,
Charlie

Correct, Complete and improve all docstrings. Add documentation to "readthedocs"

better error messages

better error messages if input fasta does not exist
better error message if database does not exist

highly contaminated bins should get a warning not to use filtered data even after cleaning

mdmcleaner is NOT designed to rescue terrible bins. It is only meant to validate good quality and to make good bins better!
This needs to be reflected in the results, so people are not encouraged to publish terrible bins (e.g. >20% contamination) after filtering

better error messages2

need better error messages when-

no reference database path is set in configs
no arguments are given to "refdb_contam" workflow

provide latest constructed database

Thanks you for your good tool.
Directly providing latest constructed database may make this better and easier to access. :)

improve config file

config settings for external binaries that are not in PATh are often ignored...
improve configs:

only paths to blast+/diamond biniaries folder, not to each individual binary file
add default blacklist
more consisent use of configurated binary-paths in sysem calls

implement lock-files

implement lock files to prevent two users writing to the same results-folder at the same time.

lock file for database creation
lock file for every mag/sag during genome assessment

add optional blacklist file to repository and add blacklist file location to config file

evaluate high indication refdb contaminations form MDM-paper analyses to determine blacklist entries
add blacklist to repository
add default blacklist file location to default configfile

"AttributeError: 'bindata' object has no attribute 'consensus_tax'" if NO contig yields ANY blast results

If NO contig yields ANY blast results, the following error is thrown:

'bindata' object has no attribute 'consensus_tax'
Traceback (most recent call last):
  File "/home/ww5070/.conda/envs/mdmcleaner/lib/python3.10/site-packages/mdmcleaner/clean.py", line 179, in main
    bindata.get_topleveltax(db)
  File "/home/ww5070/.conda/envs/mdmcleaner/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 964, in get_topleveltax
    sys.stderr.write("\tmajority tax-path: {}\n".format("; ".join(self.get_consensus_taxstringlist())))
  File "/home/ww5070/.conda/envs/mdmcleaner/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 977, in get_consensus_taxstringlist
    if self.consensus_tax != None:
AttributeError: 'bindata' object has no attribute 'consensus_tax'

This basically just means that the bin cannot be classified.
However, this needs to be caught, translated into a consenus-tax of "unknown" and reported as a more informative warning message

divide protein blast databases

Divide protein blast databases into smaller subsets (similar to nucleotide dbs).
possibilities:

seperate by component sub-db (gtdb or refseq_eukaryote/virus)
seperate into roughly equal numbers of proteins

Test if this speeds up blasts when blasting one-by-one, always devoting all threads to each single db
Test if this speeds up blasts when blasting all simultaneously, dividing threads over all dbs (when using option 1: prioitize larer sub-DBs)
select and permanentlyy implement the faster option

add a positive list for referencedb-ambiguities to speed things up

a positive list, containing the exact version of the reference -DB used along with all entires that were preiously assessed as "not contaminations" based on that database should be included also.

This would avoid repeating the same blast analyses with the same exact result over and over again and speed things up.

The list may get rather long over time, therefore:
Test how much the RAM usage increases for huge sets of accessions
Test how much slower a direct search in the text files via binary search tree would be

Threshold for contamination/completeness

Hi,
Following issue 29, would you suggest a maximum % of contamination and/or minimum % completeness (i.e. as determined by checkm for example or any other tool) in which cases it would not be worth running MDMcleaner?
For example, I imagine that a MAG with 80% of contamination will never get to 10% after running MDMcleaner so I would probably skip it.
Best
Greg

review_refdbcontams needs to delete intermediate blast files when done

KeyError: 'Bacteria'

Hi,

I have been running MDMcleaner on 2 different bin sets. In one run, not all bins were processed without error (see #37). Using the same reference data, I am now getting the following error for the last bin to be processed:

--> writing to output files
        writing detailed contig infos to ./T4-48_bin.49.orig/fullcontiginfos_beforecleanup.tsv
        appending overview data to overview_all_before_cleanup.tsv
        creating output fastas
        creating krona input-table
reference-database contaminations detected during this run: 69
blasting 13 entries with blastx against reference proteins (another 5 entries were too long to blastx efficiently
Traceback (most recent call last):
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/bin/mdmcleaner", line 10, in <module>
    sys.exit(main())
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/mdmcleaner.py", line 217, in main
    blacklist_additions = clean.main(args, configs)
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/clean.py", line 230, in main
    if "contamination" in db_suspects.collective_diamondblast():
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 319, in collective_diamondblast
    eval_list.append(self.evaluateornot(self.blastxjobs[x], blastxdone = True))
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 283, in evaluateornot
    return_category, return_note = comp.count_contradictions() #todo: redundant. streamline blastcontigs() and countcontradictions() more
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 123, in count_contradictions
    domain_counts_expected = domain_counts[comparison_domain] #todo: only in try_except statement for debugging
KeyError: 'Bacteria'

Any advice would be appreciated. Thanks!

instead of ncbi nr database, use MGnify in combination with RefSeq as reference db for ncbi-like taxonomy

what the title says...

Error in database downloading

Hi!
I'm trying to download database using mdmcleaner makedb, and I've got an error like this:
01a: download GTDB data--

    Now downloading from gtdb: "gtdb_taxfiles" (attempt 1)...

    Now downloading from gtdb: "gtdb_fastas" (attempt 1)...

    Now downloading from gtdb: "gtdb_vs_ncbi_lookup" (attempt 1)...

Traceback (most recent call last):
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/bin/mdmcleaner", line 10, in
sys.exit(main())
^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.py", line 231, in main
read_gtdb_taxonomy.main(args, configs)
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1142, in main
getNprepare_dbdata_nonncbi(args.outdir, verbose=args.verbose, settings=configs.settings)
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1036, in getNprepare_dbdata_nonncbi
progressdump = _download_dbdata_nonncbi(targetdir, progressdump, verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 714, in _download_dbdata_nonncbi
progressdump["gtdb_download_dict"], progressdump["gtdb_version"] = download_gtdb_stuff(gtdb_source_dict, targetdir, verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 316, in download_gtdb_stuff
download_dict = get_download_dict(sourcedict, targetfolder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 300, in get_download_dict
okdownloadfilelist, allisfine = check_gtdbmd5file(which_md5filename(targetfolder), targetfolder, sourcedict[x]["pattern"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 243, in which_md5filename
return glob.glob(targetdir + "/" + MD5FILEPATTERN_GTDB)[0] # --> assumes there is only one hit, therefore takes only the first of the list returned by glob.glob(); todo: make sure md5sum file is always deleted after db-setup! otherwise there may be problems if preexisting dbs are updated
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Can you please help me to solve it?

Regards,
Maria

use prodigal results on eukaryotig contigs rather than refseq eukaryotic proteins

ORF calling on eukaryotes functions drastically different than for bacteria. Instead of using Eukaryotic proteins as references, rather run prodigal with prokaryotic and metagenomic settings on reference Eukaryotic genomes (not necessary for viral genomes) and use those as references. This is more likely to mimic what happens with eukaryotic contigs during metagenome analyses pipelines.

for this:

download refseq release eukaryotig genomes (nucleotide sequences)
randomly cut into chunks of ~ 5kb, but also cut at stretches of "N"s (discard chunks that end up smaller than 200bp)
run prodigal, derepilicate proteins (95% identity? or 90% identity?) to reduce database size. always keep largest representative --> protein diamond db: eukaryotic-refprotein-db
extract all remaining chunks without any predicted CDS (non-coding reference chunks), dereplicate (95% or 90% identity?). always keep largest representative --> nucleotide blastn-db: eukaryotic-noncoding-chunks-db

check for all dependencies before running pipeline

todo: add metagenome workflows

add workflows to:

classify all contigs of a large shotgun metagenome (blast contigs in portions so progress can be saved in between)
reuse shotgun metagenome contig classifications when analyzin/filtering MAGs produced with same shotgun metagenome

error with -M argument in "get_markers" workflow

when using"-M" with "get_markers", only contigs larger than specified are read into the contig_dict, BUT nonetheless all contigs are still searched for markers (at least for rrna genes).

--> need to make sure that only the contigs of contig dict are passed for any process that accepts stdin.
--> otherwise (e.g. aragorn) silently drop all hits associated to contigs that are not in contigdict...

	def test_or_create_targetdir(targetdir):
	try:
	if os.path.exists(targetdir) and os.path.isdir(targetdir):
	if not os.access(targetdir, os.W_OK):
	raise PermissionError
	else:
	os.makedirs(targetdir)
	except PermissionError as e:
	# ~ sys.stderr.write("\nERROR: insufficient write permissions for '{}'! Please choose a different target directory and try again\n".format(targetdir))
	raise PermissionError("\n\nERROR: insufficient write permissions for '{}'! Please choose a different target directory and try again\n".format(targetdir))

	blastrecords = [self.blastxjobs[x].seqrecord[0] for x in self.blastxjobs if len(self.blastxjobs[x].seqrecord[0]) < 100000] #for now skipping reference contigs larger than 100 kb (takes too long to blastx). TODO: in such cases, search for ribosomal & other markergenes to verify classification!
	sys.stderr.write("\nblasting {} entries with blastx against reference proteins (another {} entries were too long to blastx efficiently\n".format(len(blastrecords), len(self.blastxjobs) - len(blastrecords)))
	if len(blastrecords) == 0:
	return

	if not args.fast_run and db_suspects != None:
	if "contamination" in db_suspects.collective_diamondblast():
	sys.stderr.write("\nWARNING: potential eukaryotic contaminants were determined in reference genomes. some classifications may need to be ajdjusted.\nIt is recommended to run the pipeline again (as is), with the updated blacklist to correct that (most intermediate results can be reused, so this will be faster than the original run)\n\n")
	if len(db_suspects.blacklist_additions) > 0:
	sys.stderr.write("\nA total of {} reference-database entries were newly detected as contaminants! Please note the updated blacklist!\n".format(len(db_suspects.blacklist_additions)))