Giter Site home page Giter Site logo

kit-ibg-5 / mdmcleaner Goto Github PK

View Code? Open in Web Editor NEW
18.0 18.0 6.0 11.75 MB

MDMcleaner the assessment, classification and refinement tool for microbial dark matter SAGs and MAGs

License: GNU General Public License v3.0

Python 100.00%
bioinformatics genomics mag mags metagenomics sag sags single-cell

mdmcleaner's People

Contributors

did10 avatar jvollme avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mdmcleaner's Issues

When workflows are aborted and then recontinued later, check that the database used is still the same...

sometimes, when the gtdb databse is updated, it can happen that the representative genome of a species changes.
if a process was started with one database, and then finished with another, it can happen that a query species that was present n the one db-version is not found anymore in the other...

An example is the presence of "GCF_002879875.1_NZ_PNXZ01000140.1" in gtrp_r95 which was replaced with "GCF_007097545.1" in gtdb_r202.

--> for every workflow/intermediate result make a note of the exact database version that was used to create it. return an error if a workflow is being continued with a different db version!

simplify report tables

currently the output tables are a bit convoluted and not user friendly.
Instead of outputing the infos "as is", digest them a bit more to be better readable

UnboundLocalError 'prodigal_proc'

Hi Jon,
When running mdmcleaner on our HPC (via slurm) I get the error:
"UnboundLocalError: local variable 'prodigal_proc' referenced before assignment".
Here is the detailed mdmcleaner.log I get from the system.

The error occurs when I use the following command:
mdmcleaner clean -t 15 -i /work_beegfs/user/02_Results/02_Metagenome/Bins/bins_mdmcleaner/*.fasta

Is this a problem with my installation, or can it be solved some other way?
Thank you,
Avril

add workflow for evaluating potential refDB-contaminations

  • adjust nucleotide database creation to use "parse-seqids"
  • extract best hit and best contradiciton from nucleotide referenceDB (assuming it is not eukaryotic)
  • blast both aganst nucleotide reference DB
  • for each, examine the hits with at least 50% coverage (of the shorter of subject or query, respectively) and 90% identity
  • for blastx: require 90% coverage of the subect and 90% identity
  • count phylum and domain distribution among filtered hits
  • if more contradicting domains/phyla than expected domain/phylum --> assume contig is a contamination

inconsistency in cleaner logs

The log message of scanning for tRNAs prints the wrong number of universally required types of tRNA.
The automated code says there are 21tRNAs, the hand written message says there are 20.

grafik

Just a small inconsistency in logger output.

silva url

Dear developer,

It seems that the url used for downloading silva "https://www.arb-silva.de/fileadmin/silva_databases/current" present in the "read_gtdb_taxonomy.py" Python script is no longer accessible,. I had to change it in the script to the ftp link: "ftp://arb-silva.de/current".
Now, it's working as expected, but I thought it's useful to share this piece of info with you.

Best,

Ali

An error occured during blastp run with query '-'

Dear mdmcleaner developers,

I experienced a blastp error during mdmclean clean, which resulted in a runtime error. Could you have me troubleshoot please?

I was running mdmcleaner in a compute cluster using a virtual python (3.11) environment with the full mdmcleaner database. I've attached the log file here.

Meanwhile, I will try the database used in the pulibcation to see if this error is caused by the database.

Thank you for your help!

mdmcleaner_out.txt

Sincerely,
Rui

Contamination percent calculation

Dear John,

I have a question about contamination percent calculation. I saw in your NAR article a comparison between bin contamination calculated by checkm vs that calculated by mdmcleaner, but when I used mdmcleaner clean command, I couldn't find the contamination % value. May I know where I can find it? Or if it's not available, how I can calculate it?

I really appreciate your help and cooperation.

Best regards,

Ali

"refdb_contams" workflow exits with error if outputblacklist does not already exist

command:

mdmcleaner.py refdb_contams -c mdmcleaner.config test_overview_refdb_ambiguities.tsv -o new_blacklist_additions.tsv -t 8 

result:

Traceback (most recent call last):
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 316, in <module>
    main()
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 224, in main
    configs = config_object(args, read_blacklist = True)
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 94, in __init__
    self.blacklist = self.read_blacklistfiles()
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/mdmcleaner.py", line 137, in read_blacklistfiles
    with openfile(blacklistfile) as blf:
  File "/home/ww5070/temp_binsaga/sagabin_refiner/lib/misc.py", line 26, in openfile
    filehandle = open(infilename, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'new_blacklist_additions.tsv'

Extracting the 'trustworthiness' score for an overall genome classification from the output files

Hi,
Thanks for developing this tool! After exploring the (many) output files and their columns, I was under the impression that the 'trustworthiness' score (from 0 to 10) mentioned in the NAR paper was in fact the column trust_index in the output file fullcontiginfos_beforecleanup.tsv. But this score is currently for each contig. Unfortunately, I could not find the aggregated score at the scale of the overall genome (as mentioned in the paper).

Should it be the fraction_evaluate_high or the bin_trust columns from the file overview_all_before_cleanup.tsv ? Can you please point future and current users to the equivalent to use :)

Best,
cpauvert
PS: Maybe this issue is related to #5 ; )

adjust dependencies

only one workflow (makedb) requires wget! for all others older versions of wget should not be a problem

  • remove requirement of newest wget version for all other workflows

- [ ] check which older versions of other dependencies still work, to make installation easier for not updated clusters

failure to make reference database

Thanks a lot for your contribution to MDMcleaner.

I got an issue with "mdmcleaner mkdb", the error is no MD5SUM, please see the following screenshot.
image

Could please help me figure it out?

Thanks,
Peter

problem with set_configs

mdmcleaner.py set_configs --db_basedir /home/sandra/tools/mdmcleaner/gtdb
Traceback (most recent call last):
  File "/home/sandra/tools/mdmcleaner/mdmcleaner/mdmcleaner.py", line 319, in <module>
    main()
  File "/home/sandra/tools/mdmcleaner/mdmcleaner/mdmcleaner.py", line 260, in main
    configs, settings_source = read_configs(configfile_hierarchy, args)
NameError: name 'read_configs' is not defined

Remove duplicated mdmcleaner.py

There are two duplicated mdmcleaner.py files. Currently the justification for both of them is to be able to build the project corectly with pip install . and simultaneously use the mdmcleaner.py file as is.
Currently this does not wirk with only one of those files. The root one is needed to use the project as is and mdmcleaner/mdmcleaner.py needs to be present for building uing pip.
The duplication adds potential for errors.

Potential unintended behavior while directory checking and creating

I think the implemented if condition ( line 993 ) results in unintended behavior.

def test_or_create_targetdir(targetdir):
try:
if os.path.exists(targetdir) and os.path.isdir(targetdir):
if not os.access(targetdir, os.W_OK):
raise PermissionError
else:
os.makedirs(targetdir)
except PermissionError as e:
# ~ sys.stderr.write("\nERROR: insufficient write permissions for '{}'! Please choose a different target directory and try again\n".format(targetdir))
raise PermissionError("\n\nERROR: insufficient write permissions for '{}'! Please choose a different target directory and try again\n".format(targetdir))

It checks if targetdir exists and is a dir. If that's true the function returns (after the access check). If os.path.isdir(targetdir) returns false (the file exists but is not a dir) the else case is executed and the function tries to create the dir which is a file.

This should raise an OSError in os.makedirs(targetdir) which is not caught. The user should sees the native error.
At the moment I sadly can't test if the resulting error message is user-friendly or not, so I'm not totally sure about this.

But if you want I can look into it.

dereplicate proteins on genus level to reduce database size

maybe use 98% aminoacid identity cut-off?
proteins that are unique for one species in a genus would still be attributed to that individual species (but only one copy would be kept, in case of multi-copy entries)
"redundant" proteins, that occur identically in multiple species of a genus would be attributed to the genus instead of the species (again only represented by one copy)
--> reduces dataset size
--> increases diamond/blast search speeds
--> increases speed of LCA-classifications (a little bit)?

enhancement add short help for refdb

Generally calling a command in mdmcleaner with insufficient parameters results in a short helper message:
grafik

grafik

This isn't the case for makedb. That might confuse new users which potentially use this command first.
grafik
makedb has two exclusive required arguments. This is enfoced by code and not by argparse so there is no short help at the moment. It should be possible to generate the short help message and write it to console.

Explaining that the help function (-h) works for every command like in the README.me could also help.

A list of mdmcleaner commands is returned when invoking the help function of MCMcleaner as follows: ```mdmcleaner -h```. Each command has it's own help function that can be invoked with ```mdmcleaner <COMMAND> -h```. The available commands are:

The current -h of mdmcleaner does not mension this use of -h for every command.
grafik

ERROR: evaluation result not accounted for: 'silva_conflict'

Hi,

I am testing MDMcleaner (version 0.8.3 installed via conda) on a set of bins, and I receive the following error:

-->evaluating contigs and setting filter-flags
         finished 60%. Currently analysing apparent reference database ambiguity
THERE WAS A EXCEPTION WHILE HANDLING ../bins_IOWseq000005_pre_anvio/Co_bin.92.orig.fa


ERROR: evaluation result not accounted for: 'silva_conflict'

Traceback (most recent call last):
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/clean.py", line 197, in main
    db_suspects = bindata.evaluate_and_flag_all_contigs(db=db, protblasts=protblasts, nucblasts=nucblasts, db_suspects=db_suspects, fast_run=args.fast_run)
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 1194, in evaluate_and_flag_all_contigs
    is_potential_refdb_contam = step2()
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 1164, in step2
    if db_suspects.last_checked_evaluations() == "OK":
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 271, in last_checked_evaluations
    assert le == "OK", "\nERROR: evaluation result not accounted for: '{}'\n".format(le)
AssertionError:
ERROR: evaluation result not accounted for: 'silva_conflict'

In total, I had 144 bins and this error was thrown for 2 of them. Do you have any suggestions of how to solve this error?

Thanks!

Cheers,
Christiane

Error when checking dependencies in mdmcleaner makedb

Installation with conda:
python=3.11.3
mdmcleaner=0.8.7

I got the next error when running mdmcleaner makedb

You are running the following MDMcleaner command:                                                                                
         '/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/bin/mdmcleaner makedb -o mdmcleaner'                                                                                                                                                             
reading settings from configfile: "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.config"                                                                                                              
                                                                                                                                                                                                                                                                  
        settings:                                                                                                                                                                                                                                                 
                threads = '1'                                                                                                                                                                                                                                     
                db_type = '['gtdb']'                                                                                                                                                                                                                              
                blacklistfile = '['/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/blacklist.list']'                                                                                                               
                blastn = 'blastn'                                                                                                
                blastp = 'blastp'                                                                                                
                makeblastdb = 'makeblastdb'                                                                                                                                                                                                                       
                blastdbcmd = 'blastdbcmd'                                                                                                                                                                                                                         
                diamond = 'diamond'                                                                                                                                                                                                                               
                barrnap = 'barrnap'                                                                                              
                hmmsearch = 'hmmsearch'                                                                                          
                aragorn = 'aragorn'                                                                                              
                prodigal = 'prodigal'                                                                                                                                                                                                                             
                                                                                                                                 
                                                                                                                                 
        checking dependencies...                                                                                                 
                makeblastdb...2.13.0 --> OK!                                                                                     
                diamond...2.1.6 --> OK!                                                                                                                                                                                                                           
                wget...Traceback (most recent call last):                                                                        
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/bin/mdmcleaner", line 10, in <module>                                                                                                                                                              
    sys.exit(main())                                                                                                             
             ^^^^^^                                                                                                                                                                                                                                               
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.py", line 230, in main
    check_dependencies.check_dependencies("makeblastdb", "diamond", "wget", configs=configs)                                     
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/check_dependencies.py", line 115, in check_dependencies                                                                                                    
    check_external_dependency(*toolnames, configs=configs)                                                                       
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/check_dependencies.py", line 128, in check_external_dependency                                                                                             
    isttool = version_object(get_external_dependency_version_string(tool))                                                       
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                         
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/check_dependencies.py", line 86, in get_external_dependency_version_string
    output = proc.stdout.readline().strip()                                                                                      
             ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                               
  File "/mnt/extra/home/danielpalma/mambaforge/envs/mdmcleaner/lib/python3.11/encodings/ascii.py", line 26, in decode            
    return codecs.ascii_decode(input, self.errors)[0]

For some reason it is decoding the wget --version text as ascii but in my version of wget the text contains some non-ascii characters.

I solved this by replacing the next line of get_external_dependency_version_string:

proc = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, text = True, universal_newlines=True)

with:

proc = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, text = True, universal_newlines=True, errors="ignore")

The errors parameter is explained here https://docs.python.org/3/library/io.html#io.TextIOWrapper. I don't think that setting it to ignore would cause any unexpected behaviour but I'm not sure.

TypeError in clean.py prevent MDMcleaner to exit gracefully when no blast jobs should be run

Hi @jvollme

Thanks for this tool! I ran into a small bug when running MDMcleaner (v0.8.3 from conda). It failed with the following error message.

reference-database contaminations detected during this run: 2
blasting 0 entries with blastx against reference proteins (another 1 entries were too long to blastx efficiently
Traceback (most recent call last):
  File "/home/cpauvert/projects/.snakemake/conda/a0a4db55baa73e1a514e9cd116469487/bin/mdmcleaner", line 10, in <module>
    sys.exit(main())
  File "/home/cpauvert/projects/.snakemake/conda/a0a4db55baa73e1a514e9cd116469487/lib/python3.10/site-packages/mdmcleaner/mdmcleaner.py", line 217, in main
    blacklist_additions = clean.main(args, configs)
  File "/home/cpauvert/projects/.snakemake/conda/a0a4db55baa73e1a514e9cd116469487/lib/python3.10/site-packages/mdmcleaner/clean.py", line 230, in main
    if "contamination" in db_suspects.collective_diamondblast():
TypeError: argument of type 'NoneType' is not iterable

By tracking it down, it seems that the culprit line is in review_refdbcontams.py L305. I have no blast jobs to be run (blasting 0 entries) but so the default return value of the collective_diamondblast() function is a None.

blastrecords = [self.blastxjobs[x].seqrecord[0] for x in self.blastxjobs if len(self.blastxjobs[x].seqrecord[0]) < 100000] #for now skipping reference contigs larger than 100 kb (takes too long to blastx). TODO: in such cases, search for ribosomal & other markergenes to verify classification!
sys.stderr.write("\nblasting {} entries with blastx against reference proteins (another {} entries were too long to blastx efficiently\n".format(len(blastrecords), len(self.blastxjobs) - len(blastrecords)))
if len(blastrecords) == 0:
return

But None is not iterable so it makes the test in clean.py line 230 fail:

if not args.fast_run and db_suspects != None:
if "contamination" in db_suspects.collective_diamondblast():
sys.stderr.write("\nWARNING: potential eukaryotic contaminants were determined in reference genomes. some classifications may need to be ajdjusted.\nIt is recommended to run the pipeline again (as is), with the updated blacklist to correct that (most intermediate results can be reused, so this will be faster than the original run)\n\n")
if len(db_suspects.blacklist_additions) > 0:
sys.stderr.write("\nA total of {} reference-database entries were newly detected as contaminants! Please note the updated blacklist!\n".format(len(db_suspects.blacklist_additions)))

I think that amending the return to return [None] in L305 should do the trick, by keeping an iterable:

foo=None
print('yes' if 'A' in foo else 'no')
# yields TypeError: argument of type 'NoneType' is not iterable
# BUT 
foo=[None]
print('yes' if 'A' in foo else 'no')
# yields no correctly

Do you have any test to check that it is not breaking anything?
Let me know if you prefer a pull request or do the fix yourself
Best,
Charlie

better error messages

  • better error messages if input fasta does not exist
  • better error message if database does not exist

better error messages2

need better error messages when-

  • no reference database path is set in configs
  • no arguments are given to "refdb_contam" workflow

improve config file

config settings for external binaries that are not in PATh are often ignored...
improve configs:

  • only paths to blast+/diamond biniaries folder, not to each individual binary file
  • add default blacklist
  • more consisent use of configurated binary-paths in sysem calls

implement lock-files

implement lock files to prevent two users writing to the same results-folder at the same time.

  • lock file for database creation
  • lock file for every mag/sag during genome assessment

"AttributeError: 'bindata' object has no attribute 'consensus_tax'" if NO contig yields ANY blast results

If NO contig yields ANY blast results, the following error is thrown:

'bindata' object has no attribute 'consensus_tax'
Traceback (most recent call last):
  File "/home/ww5070/.conda/envs/mdmcleaner/lib/python3.10/site-packages/mdmcleaner/clean.py", line 179, in main
    bindata.get_topleveltax(db)
  File "/home/ww5070/.conda/envs/mdmcleaner/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 964, in get_topleveltax
    sys.stderr.write("\tmajority tax-path: {}\n".format("; ".join(self.get_consensus_taxstringlist())))
  File "/home/ww5070/.conda/envs/mdmcleaner/lib/python3.10/site-packages/mdmcleaner/getmarkers.py", line 977, in get_consensus_taxstringlist
    if self.consensus_tax != None:
AttributeError: 'bindata' object has no attribute 'consensus_tax'

This basically just means that the bin cannot be classified.
However, this needs to be caught, translated into a consenus-tax of "unknown" and reported as a more informative warning message

divide protein blast databases

Divide protein blast databases into smaller subsets (similar to nucleotide dbs).
possibilities:

  1. seperate by component sub-db (gtdb or refseq_eukaryote/virus)
  2. seperate into roughly equal numbers of proteins
  • Test if this speeds up blasts when blasting one-by-one, always devoting all threads to each single db
  • Test if this speeds up blasts when blasting all simultaneously, dividing threads over all dbs (when using option 1: prioitize larer sub-DBs)
  • select and permanentlyy implement the faster option

add a positive list for referencedb-ambiguities to speed things up

a positive list, containing the exact version of the reference -DB used along with all entires that were preiously assessed as "not contaminations" based on that database should be included also.

This would avoid repeating the same blast analyses with the same exact result over and over again and speed things up.

The list may get rather long over time, therefore:
Test how much the RAM usage increases for huge sets of accessions
Test how much slower a direct search in the text files via binary search tree would be

Threshold for contamination/completeness

Hi,
Following issue 29, would you suggest a maximum % of contamination and/or minimum % completeness (i.e. as determined by checkm for example or any other tool) in which cases it would not be worth running MDMcleaner?
For example, I imagine that a MAG with 80% of contamination will never get to 10% after running MDMcleaner so I would probably skip it.
Best
Greg

KeyError: 'Bacteria'

Hi,

I have been running MDMcleaner on 2 different bin sets. In one run, not all bins were processed without error (see #37). Using the same reference data, I am now getting the following error for the last bin to be processed:

--> writing to output files
        writing detailed contig infos to ./T4-48_bin.49.orig/fullcontiginfos_beforecleanup.tsv
        appending overview data to overview_all_before_cleanup.tsv
        creating output fastas
        creating krona input-table
reference-database contaminations detected during this run: 69
blasting 13 entries with blastx against reference proteins (another 5 entries were too long to blastx efficiently
Traceback (most recent call last):
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/bin/mdmcleaner", line 10, in <module>
    sys.exit(main())
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/mdmcleaner.py", line 217, in main
    blacklist_additions = clean.main(args, configs)
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/clean.py", line 230, in main
    if "contamination" in db_suspects.collective_diamondblast():
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 319, in collective_diamondblast
    eval_list.append(self.evaluateornot(self.blastxjobs[x], blastxdone = True))
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 283, in evaluateornot
    return_category, return_note = comp.count_contradictions() #todo: redundant. streamline blastcontigs() and countcontradictions() more
  File "/bio/Software/anaconda3/envs/mdmcleaner-0.8.3/lib/python3.10/site-packages/mdmcleaner/review_refdbcontams.py", line 123, in count_contradictions
    domain_counts_expected = domain_counts[comparison_domain] #todo: only in try_except statement for debugging
KeyError: 'Bacteria'

Any advice would be appreciated. Thanks!

Error in database downloading

Hi!
I'm trying to download database using mdmcleaner makedb, and I've got an error like this:
01a: download GTDB data--

    Now downloading from gtdb: "gtdb_taxfiles" (attempt 1)...

    Now downloading from gtdb: "gtdb_fastas" (attempt 1)...

    Now downloading from gtdb: "gtdb_vs_ncbi_lookup" (attempt 1)...

Traceback (most recent call last):
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/bin/mdmcleaner", line 10, in
sys.exit(main())
^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.py", line 231, in main
read_gtdb_taxonomy.main(args, configs)
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1142, in main
getNprepare_dbdata_nonncbi(args.outdir, verbose=args.verbose, settings=configs.settings)
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1036, in getNprepare_dbdata_nonncbi
progressdump = _download_dbdata_nonncbi(targetdir, progressdump, verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 714, in _download_dbdata_nonncbi
progressdump["gtdb_download_dict"], progressdump["gtdb_version"] = download_gtdb_stuff(gtdb_source_dict, targetdir, verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 316, in download_gtdb_stuff
download_dict = get_download_dict(sourcedict, targetfolder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 300, in get_download_dict
okdownloadfilelist, allisfine = check_gtdbmd5file(which_md5filename(targetfolder), targetfolder, sourcedict[x]["pattern"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 243, in which_md5filename
return glob.glob(targetdir + "/" + MD5FILEPATTERN_GTDB)[0] # --> assumes there is only one hit, therefore takes only the first of the list returned by glob.glob(); todo: make sure md5sum file is always deleted after db-setup! otherwise there may be problems if preexisting dbs are updated
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Can you please help me to solve it?

Regards,
Maria

use prodigal results on eukaryotig contigs rather than refseq eukaryotic proteins

ORF calling on eukaryotes functions drastically different than for bacteria. Instead of using Eukaryotic proteins as references, rather run prodigal with prokaryotic and metagenomic settings on reference Eukaryotic genomes (not necessary for viral genomes) and use those as references. This is more likely to mimic what happens with eukaryotic contigs during metagenome analyses pipelines.

for this:

  • download refseq release eukaryotig genomes (nucleotide sequences)
  • randomly cut into chunks of ~ 5kb, but also cut at stretches of "N"s (discard chunks that end up smaller than 200bp)
  • run prodigal, derepilicate proteins (95% identity? or 90% identity?) to reduce database size. always keep largest representative --> protein diamond db: eukaryotic-refprotein-db
  • extract all remaining chunks without any predicted CDS (non-coding reference chunks), dereplicate (95% or 90% identity?). always keep largest representative --> nucleotide blastn-db: eukaryotic-noncoding-chunks-db

todo: add metagenome workflows

add workflows to:

  • classify all contigs of a large shotgun metagenome (blast contigs in portions so progress can be saved in between)
  • reuse shotgun metagenome contig classifications when analyzin/filtering MAGs produced with same shotgun metagenome

error with -M argument in "get_markers" workflow

when using"-M" with "get_markers", only contigs larger than specified are read into the contig_dict, BUT nonetheless all contigs are still searched for markers (at least for rrna genes).

--> need to make sure that only the contigs of contig dict are passed for any process that accepts stdin.
--> otherwise (e.g. aragorn) silently drop all hits associated to contigs that are not in contigdict...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.