Giter Site home page Giter Site logo

checkm2's Introduction

CheckM2

Rapid assessment of genome bin quality using machine learning.

Unlike CheckM1, CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage to predict the completeness and contamination of genomic bins. This allows it to incorporate many lineages in its training set that have few - or even just one - high-quality genomic representatives, by putting it in the context of all other organisms in the training set. As a result of this machine learning framework, CheckM2 is also highly accurate on organisms with reduced genomes or unusual biology, such as the Nanoarchaeota or Patescibacteria.

CheckM2 uses two distinct machine learning models to predict genome completeness. The 'general' gradient boost model is able to generalize well and is intended to be used on organisms not well represented in GenBank or RefSeq (roughly, when an organism is novel at the level of order, class or phylum). The 'specific' neural network model is more accurate when predicting completeness of organisms more closely related to the reference training set (roughly, when an organism belongs to a known species, genus or family). CheckM2 uses a cosine similarity calculation to automatically determine the appropriate completeness model for each input genome, but you can also force the use of a particular completeness model, or get the prediction outputs for both. There is only one contamination model (based on gradient boost) which is applied regardless of taxonomic novelty and works well across all cases.

Usage

Bin quality prediction

The main use of CheckM2 is to predict the completeness and contamination of metagenome-assembled genomes (MAGs) and single-amplified genomes (SAGs), although it can also be applied to isolate genomes.

You can give it a folder with FASTA files using --input and direct its output with --output-directory:

checkm2 predict --threads 30 --input <folder_with_bins> --output-directory <output_folder> 

CheckM2 can also take a list of files in its --input parameter. It will work out automatically if it was given a folder or a list of files and process accordingly:

checkm2 predict --threads 30 --input ../bin1.fa ../../bin2.fna /some/other/directory/bin3.fasta --output-directory <output_folder> 

Checkm2 can also handle gzipped files. If passing a folder with gzip files, specify a gz --extension. If given a list of files, CheckM2 will work out automatically what to do and specifying an extension is unnecesary. It can also handle mixed lists of gzipped and non-gzipped files given to the --input command.

If you already have predicted protein files (ideally using Prodigal), you can pass the files to Checkm2 with an additional --genes option to let it know to expect protein files.

By default, the output folder will have a tab-delimited file quality_report.tsv containing the completeness and contamination information for each bin. You can also print the results to stdout by passing the --stdout option to checkm predict.

Low memory mode

If you are running CheckM2 on a device with limited RAM, you can use the --lowmem option to reduce DIAMOND RAM use by half at the expense of longer runtime.

Run without installing

For simplicity, you can just download CheckM2 from GitHub and run it directly without installing.

Retrieve the files:

git clone --recursive https://github.com/chklovski/checkm2.git && cd checkm2

Create an appropriate Conda environment with prerequisites using the checkm2.yml file:

conda env create -n checkm2 -f checkm2.yml
conda activate checkm2

Finally, run CheckM2:

bin/checkm2 -h

Installation

The easiest way to install is using Conda in a new environment:

conda create -n checkm2 -c bioconda -c conda-forge checkm2

However, conda can be very slow when processing requirements for the environment. A much faster and better way to install CheckM2 is to install using mamba and creating a new environment:

mamba create -n checkm2 -c bioconda -c conda-forge checkm2

CheckM2 is also available on Pypi. To install via pip, use the checkm2.yml file provided in the github to create a new conda environment:

conda env create -n checkm2 -f checkm2.yml and conda activate checkm2

then simply pip install CheckM2

Alternatively, retrieve the Github files:

git clone --recursive https://github.com/chklovski/checkm2.git && cd checkm2

Then create a Conda environment using the checkm2.yml file:

conda env create -n checkm2 -f checkm2.yml
conda activate checkm2

Finally, install CheckM2:

python setup.py install

This should take no longer than 5-10 mins on an average computer. Installation is then complete. To run Checkm2, then you can

conda activate checkm2
checkm2 -h

Database

You will also need to download and install the external DIAMOND database CheckM2 relies on for rapid annotation. Use checkm2 database --download to install it into your default /home/user/databases directory, or provide a custom location using checkm2 database --download --path /custom/path/. If centrally installed, ideally the administrator should carry out this step during the setup as users may not have permission to modify CheckM2 options.

The database path can also be set by setting the environmental variable CHECKM2DB using: export CHECKM2DB="path/to/database"

Finally, the --database_path can be used with checkm2 predict to provide an already downloaded checkm2 database during a single predict run, e.g. checkm2 predict -i ./folder_with_MAGs -o ./output_folder --database_path /path/to/database/CheckM2_database/uniref100.KO.1.dmnd

Test run

It is highly recommended to do a testrun with CheckM2 after installation and database download to ensure everything works successfully. You can test that the CheckM2 installation was successful using checkm2 testrun. This command should complete in < 5 mins on an average desktop computer.

Testrun runs CheckM2's genome quality prediction models on three (complete, uncontaminated) test genomes from diverse lineages to ensure the process runs to completeness and the predictions within expected margins. These are:

Genome GTDB taxonomy CheckM1 Completeness CheckM1 Contamination
TEST1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli 99.97 0.04
TEST2 d__Bacteria;p__Patescibacteria;c__Dojkabacteria;o__SC72;f__SC72;g__UBA5209;s__UBA5209 sp002840365 79.86 0.00
TEST3 d__Archaea;p__Nanohaloarchaeota;c__Nanosalinia;o__Nanosalinales;f__Nanosalinaceae;g__Nanohalobium;s__Nanohalobium sp001761425 87.77 0.00

checkm2's People

Contributors

chklovski avatar donovan-h-parks avatar lmrodriguezr avatar sternp avatar wwood avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

checkm2's Issues

ERROR: Saved models could not be loaded

For information, just downloaded and installed using the "git clone" approach, but the testrun fails

$ checkm2 testrun
[07/12/2022 10:16:33 AM] INFO: Test run: Running quality prediction workflow on test genomes with 1 threads.
[07/12/2022 10:16:33 AM] INFO: Running checksum on test genomes.
[07/12/2022 10:16:33 AM] INFO: Checksum successful.
[07/12/2022 10:16:33 AM] INFO: Calling genes in 3 bins with 1 threads:
    Finished processing 3 of 3 (100.00%) bins.
[07/12/2022 10:17:04 AM] INFO: Calculating metadata for 3 bins with 1 threads:
    Finished processing 3 of 3 (100.00%) bin metadata.
[07/12/2022 10:17:05 AM] INFO: Annotating input genomes with DIAMOND using 1 threads
[07/12/2022 10:25:04 AM] INFO: Processing DIAMOND output
[07/12/2022 10:25:04 AM] INFO: Calculating completeness of pathways and modules.
[07/12/2022 10:25:09 AM] ERROR: Saved models could not be loaded.

This is on CentOS 7.9.2009

PermissionError: diamond_path.json

checkm2 database --download --path .
[01/11/2023 09:37:17 PM] INFO: Command: Download database. Checking internal path information.
[01/11/2023 09:37:21 PM] INFO: Downloading https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz to ./checkm2_database.tar.gz.
100%|██████████████████████████████████████████████████████████████████████████████████████████| 1.74G/1.74G [12:53<00:00, 2.24MiB/s]
[01/11/2023 09:50:52 PM] INFO: Extracting files from archive...
[01/11/2023 09:55:08 PM] INFO: Verifying version and checksums...
[01/11/2023 09:55:08 PM] INFO: Verification success.
Traceback (most recent call last):
  File "/opt/conda/bin/checkm2", line 4, in <module>
    __import__('pkg_resources').run_script('CheckM2==1.0.0', 'checkm2')
  File "/opt/conda/lib/python3.9/site-packages/pkg_resources/__init__.py", line 656, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/conda/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1453, in run_script
    exec(code, namespace, namespace)
  File "/opt/conda/lib/python3.9/site-packages/CheckM2-1.0.0-py3.9.egg/EGG-INFO/scripts/checkm2", line 244, in <module>
    fileManager.DiamondDB().download_database(args.path)
  File "/opt/conda/lib/python3.9/site-packages/CheckM2-1.0.0-py3.9.egg/checkm2/fileManager.py", line 131, in download_database
    with open(diamond_location, 'w') as dd:
PermissionError: [Errno 13] Permission denied: '/opt/conda/lib/python3.9/site-packages/CheckM2-1.0.0-py3.9.egg/checkm2/version/diamond_path.json'

The diamond_path.json file has read-all permissions (-rw-r--r--).

Is checkm2 actually trying to edit a file within the package, and not just read from the file??

If this is the case, I highly recommend placing such editable files outside of the package itself (e.g., ${HOME}/.checkm2/).

Otherwise, there will be all sorts of permissions issues when multiple users run the same checkm2 install.

CheckM2 missing files and result inconsistencies

Great tool, but I have some questions.

Is there a way to see, with checkM2, how many markers were used and found, and which ones these are? The output it gives seems very basic.

Additionally, I am getting some inconsistent results and wanted to ask your opinion about them. Especially because it concerns DPANN and it was mentioned checkM2 works better for that group.

From a Nanopore dataset of corrected reads with Canu, I made Flye assemblies. Once with supplying the corrected reads as --nano-raw and once as --nano-corr.
These two assemblies (hereafter nano-raw and nano-corr) were used in metaWRAP for binning.

The nano-raw assembly gives one Concoct bin (1 contig) of a DPANN, and a Maxbin2 bin (2 contigs) of the same DPANN. Basically, the maxbin2 bin is the contig present in the Concoct bin plus a 7.8kb contig.
The nano-corr assembly gives one Concoct bin (1 contig) of a DPANN, and a Maxbin2 bin (2 contigs) of the same DPANN. Basically, the maxbin2 bin is the contig present in the Concoct bin plus a 7.8kb contig.

The large contig of the bins the same size (2bp difference) in the different assemblies and they are 99.956% similar to each other.
The 7.8kb contig present in the Maxbin2 bins of the different assemblies is identical.

When I run checkM2, however, results seem a bit odd.

nano-raw concoct bin: 60.96% complete; 0.54% contamination
nano-raw Maxbin2 bin: 61.05% complete; 0.53% contamination
nano-corr concoct bin: 60.74% complete; 0.63% contamination
nano-corr Maxbin2 bin: 60.74% complete; 0.63% contamination

Could you explain why that 7.8kb contig makes a difference in completeness in one of the bins, and not the other? The contig is identical, so potential changes should be the same.

When I run checkM1 on the 4 bins, they all have a completeness of 64.65% and 0% contamination.

Thanks a lot for the help/info on this!

lib_lightgbm.so: failed to map segment from shared object

Here was my installation command:

mamba env create -n checkm2_env -f checkm2/checkm2.yml && pip install --no-deps checkm2/

Here is the error:

(checkm2_env) [jespinoz@login02 Test]$ checkm2 -h
Traceback (most recent call last):
  File "/expanse/projects/jcl110/anaconda3/envs/checkm2_env/bin/checkm2", line 27, in <module>
    from checkm2 import predictQuality
  File "/expanse/projects/jcl110/anaconda3/envs/checkm2_env/lib/python3.9/site-packages/checkm2/predictQuality.py", line 7, in <module>
    from checkm2 import modelProcessing
  File "/expanse/projects/jcl110/anaconda3/envs/checkm2_env/lib/python3.9/site-packages/checkm2/modelProcessing.py", line 11, in <module>
    import lightgbm as lgb
  File "/expanse/projects/jcl110/anaconda3/envs/checkm2_env/lib/python3.9/site-packages/lightgbm/__init__.py", line 8, in <module>
    from .basic import Booster, Dataset, Sequence, register_logger
  File "/expanse/projects/jcl110/anaconda3/envs/checkm2_env/lib/python3.9/site-packages/lightgbm/basic.py", line 110, in <module>
    _LIB = _load_lib()
  File "/expanse/projects/jcl110/anaconda3/envs/checkm2_env/lib/python3.9/site-packages/lightgbm/basic.py", line 101, in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
  File "/expanse/projects/jcl110/anaconda3/envs/checkm2_env/lib/python3.9/ctypes/__init__.py", line 452, in LoadLibrary
    return self._dlltype(name)
  File "/expanse/projects/jcl110/anaconda3/envs/checkm2_env/lib/python3.9/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /expanse/projects/jcl110/anaconda3/envs/checkm2_env/lib/python3.9/site-packages/lightgbm/lib_lightgbm.so: failed to map segment from shared object

I reinstalled lightgbm and got the same error.

(Bio)conda

Hi together 😄

Are there any plans to make checkm2 available as a complete conda package?
Would make it easier to use a specific version in workflows.

Greetings,
Linda

Diamond database version?

Hi,

Testrun gives error "Error: Database was built with a different version of Diamond and is incompatible.". I tried both the latest Diamond v2.0.15 and the minimum required v2.0.4. The database was downloaded using checkm2 database --download.

Ideas?

Thanks!

Visualization

Is there any way to visualize checkm2 output? Can we use checkm(1) plotting features for this?

Thanks!!

UnicodeEncodeError

Hi,

I use CheckM2 (v. 1.0.1) on a HPC cluster and I get this error whenever I run CheckM2 as a submitted job (qsub):

Traceback (most recent call last): File "/services/tools/checkm2/1.0.1/bin/checkm2", line 211, in <module> args.resume, args.remove_intermediates, args.ttable) File "/services/tools/checkm2/1.0.1/lib/python3.6/site-packages/checkm2/predictQuality.py", line 135, in prediction_wf diamond_out = diamond_search.run(prodigal_files) File "/services/tools/checkm2/1.0.1/lib/python3.6/site-packages/checkm2/diamond.py", line 119, in run self.__call_diamond(protein_chunks, diamond_out) File "/services/tools/checkm2/1.0.1/lib/python3.6/site-packages/checkm2/diamond.py", line 74, in __call_diamond sequenceClasses.SeqReader().write_fasta(seq_object, temp_diamond_input.name) File "/services/tools/checkm2/1.0.1/lib/python3.6/site-packages/checkm2/sequenceClasses.py", line 104, in write_fasta fout.write('>' + seqId + '\n') UnicodeEncodeError: 'ascii' codec can't encode character '\u03a9' in position 18: ordinal not in range(128)

Have you seen this before?

The cluster uses environment modules which contains the tool and all its dependencies.
I have no problem running it on an interactive node with the exact same module loaded.

Thanks!

Database invalid version

The database download seems to be failing to determine the downloaded version.

I'm using v1.0.0 (as modified by #30 to allow installation) and Python 3.8.15.

This reproduces the error:

$> checkm2 database --download
[12/18/2022 04:34:36 PM] INFO: Command: Download database. Checking internal path information.
[12/18/2022 04:34:38 PM] INFO: Downloading https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz to /Users/miguel/databases/checkm2_database.tar.gz.
100%|███████████████████████████████████████████████████████████████████████████████████████| 1.74G/1.74G [08:00<00:00, 3.61MiB/s]
[12/18/2022 04:42:40 PM] INFO: Extracting files from archive...
[12/18/2022 04:43:11 PM] INFO: Verifying version and checksums...
[12/18/2022 04:43:11 PM] INFO: Verification success.
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/checkm2/bin/checkm2", line 4, in <module>
    __import__('pkg_resources').run_script('CheckM2==1.0.0', 'checkm2')
  File "/usr/local/Caskroom/miniconda/base/envs/checkm2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 672, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/Caskroom/miniconda/base/envs/checkm2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1472, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/Caskroom/miniconda/base/envs/checkm2/lib/python3.8/site-packages/CheckM2-1.0.0-py3.8.egg/EGG-INFO/scripts/checkm2", line 244, in <module>
    fileManager.DiamondDB().download_database(args.path)
  File "/usr/local/Caskroom/miniconda/base/envs/checkm2/lib/python3.8/site-packages/CheckM2-1.0.0-py3.8.egg/checkm2/fileManager.py", line 140, in download_database
    if versionControl.VersionControl().checksum_version_validate_DIAMOND():
  File "/usr/local/Caskroom/miniconda/base/envs/checkm2/lib/python3.8/site-packages/CheckM2-1.0.0-py3.8.egg/checkm2/versionControl.py", line 119, in checksum_version_validate_DIAMOND
    return self.__validateVersion(self.version, cutoff_version)
  File "/usr/local/Caskroom/miniconda/base/envs/checkm2/lib/python3.8/site-packages/CheckM2-1.0.0-py3.8.egg/checkm2/versionControl.py", line 58, in __validateVersion
    return v_compare.parse(str(query)) >= v_compare.parse(str(ref))
  File "/usr/local/Caskroom/miniconda/base/envs/checkm2/lib/python3.8/site-packages/packaging/version.py", line 52, in parse
    return Version(version)
  File "/usr/local/Caskroom/miniconda/base/envs/checkm2/lib/python3.8/site-packages/packaging/version.py", line 197, in __init__
    raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '10    0.0.0
14    0.1.0
Name: incompatible_below_checkm2ver, dtype: object'

CommandNotFoundError

Hi Alex,

I tried to run it without installing. But, after "conda activate checkm2", CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.

Screenshot 2022-07-27 at 10 31 44

Which one should I choose? If I can't see my login node after the "conda init zsh", how can I change it back - to cancel the "conda init zsh"?

Best,

Bing

Report missing models early

Hi,

Right now the check for the required models is only done after calling and annotating genes. This can takes hours when processing thousands of genomes. CheckM v2 may then fail as it can't find the required models, e.g.:

[12/16/2021 03:16:17 AM] INFO: Running quality prediction workflow with 96 threads.
[12/16/2021 03:16:17 AM] INFO: Calling genes in 52 bins with 96 threads:
    Finished processing 52 of 52 (100.00%) bins.
[12/16/2021 03:16:53 AM] INFO: Calculating metadata for 52 bins with 96 threads:
    Finished processing 52 of 52 (100.00%) bin metadata.
[12/16/2021 03:18:15 AM] INFO: Annotating input genomes with DIAMOND using 96 threads
[12/16/2021 03:18:15 AM] INFO: Processing DIAMOND output
[12/16/2021 03:18:16 AM] INFO: Calculating completeness of pathways and modules.
[12/16/2021 03:18:23 AM] ERROR: Saved models could not be loaded.

As a quality-of-life improvement, it would be nice to check for all required dependencies upfront and report any issues before starting processing.

CheckM2 too conservative?

Hi,

I used checkM2 to check the completeness and contamination of some RefSeq complete genomes. I found that even if some genomes have an anomalous (extremely low) number of ribosomal proteins, checkM2 still mark them as high quality. Here are two examples.

Accession Name Completeness Contamination
GCF_016653575.1 Bacillus sp. TK-2 98.77 0.08
GCF_014235785.1 Bacillus sp. PAMC26568 92.50 0.00

CheckM1 gives 80.36% and 69.64% for these two genomes, which seems more reasonable. So I was wondering what could be the cause of this?

Testrun discrepancies

Hello,

I downloaded checkM2 and did the test run but it gives different output than what is stated. I cannot find my mistake. Otherwise it runs perfectly also on other genomes. But I am not sure if I can trust the numbers now.

this is my output:
Name Completeness Contamination Completeness_Model_Used Translation_Table_Used
TEST1 100.00 0.74 Neural Network (Specific Model) 11
TEST2 98.54 0.21 Neural Network (Specific Model) 11
TEST3 98.75 0.51 Neural Network (Specific Model) 11

Release 1.0.0

Hello together,

I am a big fan of checkm(1) and was very excited when I heard about your new version.
I would like to use checkm2 in my new workflow and wanted to ask if you can estimate when release 1.0.0 will come?

I look forward to hearing from you!
Greetings,
Linda

Version of Numpy installed from conda

Hey Alex,

Just bringing the issue discussed here to your attention: https://github.com/rhysnewell/aviary/issues/92#issue

I have had this issue pop up once with a student running a snakemake rule using CheckM2 in Aviary, but the issue has not occurred for others (at least recently). So it is unclear if it an issue that will frequently occur.

AttributeError: module 'numpy' has no attribute 'typeDict'
[Fri Jan 27 06:54:28 2023]
Error in rule checkm_semibin:
    jobid: 20
    input: data/semibin_bins/done
    output: data/semibin_bins/checkm2_out, data/semibin_bins/checkm.out
    conda-env: /srv/home/USER/.conda/envs/096cdc4ad5efd911ad0acd1888e0a504_
    shell:
        touch data/semibin_bins/checkm.out; export CHECKM2DB=/srv/db/checkm2_data/0.1.3/CheckM2_database//uniref100.KO.1.dmnd; echo "Using CheckM2 database $CHECKM2DB"; checkm2 predict -i data/semibin_bins/output_recluster_bins// -x fa -o data/semibin_bins/checkm2_out -t 60 --force; cp data/semibin_bins/checkm2_out/quality_report.tsv data/semibin_bins/checkm.out
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Sequence is long

Hi,

Thanks for your nice tool. But I have a warning message of it, maybe because some contigs are large. As follow:
Warning: Sequence is long (max 32000000 for training).Training on the first 32000000 bases.
So that will have any influence on the results? And how can I solve this problem?

Thank you and have a nice day.
Carrie

quality report: *.gz extension removed from file name

If the user provides gzip'ed fasta files as input to checkm2 predict, the .gz extension is not include in the values of Name column in the quality report. This causes issues for downstream applications, such as dRep dereplicate --genomeInfo quality_report.tsv

scikit-learn upgrade?

Hi,

scikit-learn 0.23.2 is pretty outdated, could this (and maybe some of the other dependencies) be upgraded to a recent release?

Thanks!

grpcio dependency: No such file or directory: 'cc'

Steps to reproduce:

env.yaml file:

channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python=3.9
  - scikit-learn=0.23.2
  - h5py=2.10.0
  - numpy=1.19.2
  - diamond=2.0.4
  - tensorflow>=2.1.0,<2.6.0
  - lightgbm=3.2.1
  - pandas<=1.4.0
  - scipy
  - prodigal>=2.6.3
  - setuptools
  - requests
  - packaging
  - tqdm

Dockerfile:

FROM mambaorg/micromamba:1.1.0

ARG MAMBA_DOCKERFILE_ACTIVATE=1
COPY --chown=$MAMBA_USER:$MAMBA_USER env.yaml /tmp/env.yaml
RUN micromamba install -y -n base -f /tmp/env.yaml git && \
    micromamba clean --all --yes

RUN git clone --recursive https://github.com/chklovski/checkm2.git && \
    cd checkm2 && \
    python setup.py install && \
    cd ../ && rm -rf checkm2

Build:

docker build --platform linux/amd64 -t checkm2:1.0.0 .

Error

Installed /opt/conda/lib/python3.9/site-packages/wheel-0.38.4-py3.9.egg
Searching for grpcio~=1.32.0
Reading https://pypi.org/simple/grpcio/
Downloading https://files.pythonhosted.org/packages/0e/5f/eeb402746a65839acdec78b7e757635f5e446138cc1d68589dfa32cba593/grpcio-1.32.0.tar.gz#sha256=01d3046fe980be25796d368f8fc5ff34b7cf5e1444f3789a017a7fe794465639
Best match: grpcio 1.32.0
Processing grpcio-1.32.0.tar.gz
Writing /tmp/easy_install-eop8d0xx/grpcio-1.32.0/setup.cfg
Running grpcio-1.32.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-eop8d0xx/grpcio-1.32.0/egg-dist-tmp-s4mfdn9f
/tmp/easy_install-eop8d0xx/grpcio-1.32.0/src/python/grpcio/commands.py:104: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if exit_code is not 0:
error: [Errno 2] No such file or directory: 'cc'

Checkm 2 fails when a genome has lowercase bases

Hi!

Checkm2 expects uppercase bases and crashes when it is not the case.

GC = sum(seq.count(x) for x in ("G", "C"))

Traceback (most recent call last):
  File "miniconda3/envs/checkm2_env/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "miniconda3/envs/checkm2_env/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "miniconda3/envs/checkm2_env/lib/python3.9/site-packages/CheckM2-1.0.0-py3.9.egg/checkm2/predictQuality.py", line 322, in __set_up_prodigal_thread
    v_N50, v_avg_gene_len, v_total_bases, v_cds_count, v_GC = prodigal_thread.run(bin, ttable)
  File "miniconda3/envs/checkm2_env/lib/python3.9/site-packages/CheckM2-1.0.0-py3.9.egg/checkm2/prodigal.py", line 77, in run
    GC = float(GC/(AT + GC))
ZeroDivisionError: division by zero

insufficient for CUDA runtime version?

[10/06/2022 10:32:13 PM] ERROR: Saved models could not be loaded: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Any thoughts?

--allmodels is broken.

$ time checkm2 predict -i ~/Downloads/holden/ -o ~/Downloads/holden/temp --force --allmodels --remove_intermediates -t 16 -x fa
[12/29/2022 08:01:37 PM] INFO: Running quality prediction workflow with 16 threads.
[12/29/2022 08:01:38 PM] INFO: Calling genes in 3 bins with 16 threads:
    Finished processing 3 of 3 (100.00%) bins.
[12/29/2022 08:01:47 PM] INFO: Calculating metadata for 3 bins with 16 threads:
    Finished processing 3 of 3 (100.00%) bin metadata.
[12/29/2022 08:01:48 PM] INFO: Annotating input genomes with DIAMOND using 16 threads
[12/29/2022 08:02:47 PM] INFO: Processing DIAMOND output
[12/29/2022 08:02:48 PM] INFO: Predicting completeness and contamination using ML models.
[12/29/2022 08:02:53 PM] INFO: Parsing all results and constructing final output table.
Traceback (most recent call last):
  File "/home/user/.conda/envs/checkm2/bin/checkm2", line 4, in <module>
    __import__('pkg_resources').run_script('CheckM2==1.0.0', 'checkm2')
  File "/home/user/.conda/envs/checkm2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 672, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/user/.conda/envs/checkm2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1472, in run_script
    exec(code, namespace, namespace)
  File "/home/user/.conda/envs/checkm2/lib/python3.8/site-packages/CheckM2-1.0.0-py3.8.egg/EGG-INFO/scripts/checkm2", line 182, in <module>
    predictor.prediction_wf(args.genes, mode, args.dbg_cos, args.dbg_vectors, args.stdout,
  File "/home/user/.conda/envs/checkm2/lib/python3.8/site-packages/CheckM2-1.0.0-py3.8.egg/checkm2/predictQuality.py", line 250, in prediction_wf
    final_results['Contamination'] = np.round(general_results_cont, 2)
NameError: name 'general_results_cont' is not defined

Works without --allmodels

Write log to file

Hi,

This is a non-critical, quality-of-life feature request.

It would be helpful if the CheckM2 log indicated the version of the software being run and saved the log to file in the output directory. This allows people to inspect the log on crashes, send the log when reporting issues, and provides a record of what was run (including version number) when writing up study results.

Cheers,
Donovan

Remove intermediate files

Hi,

This is a non-critical, quality-of-life feature request.

It would be nice to have a flag the allowed all intermediate files to be removed. Namely, the diamond_ouput and protein_files directories. These are often not of interest and require non-trivial disk space when running large numbers of genomes (e.g. ~2G for 1000 genomes). Obviously, such things can be handled by auxiliary scripts, but it would be a nice QoL feature to have built-in to the software.

Cheers,
Donovan

[Discussion] Testing on microeukaryotic organisms as a negative control

I did a quick test on some marine plastisphere eukaryotes I've binned out using my VEBA pipeline to determine whether or not I should run Tiara.

image

I noticed that eukaryotes are returned with high completeness and contamination. With my default settings completeness ≥50 and contamination < 10 I would have removed these automatically but I wasn't sure if this info was useful for your research.

Good luck with the publication! Looking forward to checking out the final draft once its out. And thanks again for putting this up on bioconda and pypi as it has made my prokaryotic binning pipeline MUCH more straightforward.

Please close if this too far out of scope.

Assessing the quality of viral bins

Recently, I'm working on viral metagenomics. I identified viral contigs using virsorter2 and then performed binning using vRhyme. Finally I obtained some viral MAGs (vMAGs). CheckV seems could only assess the quality of single viral contig, is checkM2 worked for vMAGs?

I cannot run CheckM2 in Diamond

Database Configuration

$ INFO: Database check successful! Database path successfully added.

Even though
When I run

$ checkm2 predict --force --lowmem --threads 12 --allmodels --input checkm.bin --output-directory checkm2.out

and I should get the following results

$ ERROR: DIAMOND database not found. Please download the database using <checkm2 database --download>ERROR: DIAMOND database not found. Please download the database using <checkm2 database --download>.

and cannot be executed.

In the test

$ checkm2 testrun

then

$ INFO: checksum succeeded.

followed by

$ INFO: Annotation of input genome by DIAMOND (1 thread used)
Output processing by DIAMOND
INFO: Integrity and contamination prediction by ML model
INFO: parse all results and create final output table.
INFO: CheckM2 completed successfully.
INFO: Test run succeeded. See README for details. Result
Name Integrity Pollution Integrity_Model_Use Translation_Table_Use
TEST1 100.00 0.78 Neural Network (specific model) 11
TEST2 98.68 0.22 Neural network (specific model) 11
TEST3 98.76 0.50 Neural Network (Specific Model) 11

The result is as follows.

How can I run it?

No space left of device

Hi,

I'm running checkm2 on 3000 bins and running into the following error:

[09/01/2022 01:51:32 AM] INFO: Calculating metadata for 3000 bins with 4 threads:
[09/01/2022 01:52:28 AM] INFO: Annotating input genomes with DIAMOND using 4 threads
No space left on device
No space left on device
terminate called after throwing an instance of 'File_write_exception'
  what():  Error writing file /tmp/tmpf39hhc0_/diamond-tmp-AvchIO
No space left on device
terminate called after throwing an instance of 'File_write_exception'
  what():  Error writing file /tmp/tmp8fczbzb7/diamond-tmp-GPaMRF
No space left on device
terminate called after throwing an instance of 'File_write_exception'
  what():  Error writing file /tmp/tmpn4zhr9gd/diamond-tmp-FwJlQ5
No space left on device
terminate called after throwing an instance of 'File_write_exception'
  what():  Error writing file /tmp/tmpi3bsvtgx/diamond-tmp-jIZPmg
Traceback (most recent call last):
  File "/mnt/lscratch/users/sbusi/SnakemakeBinning/workflow/rules/../../submodules/bin/checkm2", line 154, in <module>
    predictor.prediction_wf(args.genes, mode, args.dbg_cos, args.dbg_vectors, args.stdout)
  File "/mnt/lscratch/users/sbusi/SnakemakeBinning/submodules/bin/../checkm2/predictQuality.py", line 103, in prediction_wf
    diamond_out = diamond_search.run(prodigal_files)
  File "/mnt/lscratch/users/sbusi/SnakemakeBinning/submodules/bin/../checkm2/diamond.py", line 126, in run
    self.__call_diamond(self.__concatenate_proteins(chunk), diamond_out)
  File "/mnt/lscratch/users/sbusi/SnakemakeBinning/submodules/bin/../checkm2/diamond.py", line 73, in __call_diamond
    sequenceClasses.SeqReader().write_fasta(seq_object, temp_diamond_input.name)
  File "/mnt/lscratch/users/sbusi/SnakemakeBinning/submodules/bin/../checkm2/sequenceClasses.py", line 104, in write_fasta
    fout.write('>' + seqId + '\n')
OSError: [Errno 28] No space left on device

Looks similar to the following issue: bbuchfink/diamond#267

However, the SLURM efficiency suggests that I have more than sufficient resources

Job ID: 2889109
Cluster: iris
User/Group: sbusi/clusterusers
State: FAILED (exit code 1)
Nodes: 1
Cores per node: 4
CPU Utilized: 22:40:03
CPU Efficiency: 96.88% of 23:23:48 core-walltime
Job Wall-clock time: 05:50:57
Memory Utilized: 11.43 GB
Memory Efficiency: 10.84% of 105.47 GB

Does one have to use the low mem setting for this?

Thank you,
Susheel

EDIT: I've verified and I have plenty of disk space and also inode space :)

AttributeError: 'Predictor' object has no attribute '__set_up_prodigal_thread'

Hey - thanks for this tool! Really excited to start using it. I used the "Run without Installing" method to use CheckM2 and I keep getting the error:

AttributeError: 'Predictor' object has no attribute '__set_up_prodigal_thread'

I get this with both bin/checkm2 testrun and CheckM2 predict with my own genomes. I thought it was a python version compatibility issue so I tried installing the version listed in the .yml file. I also tried actually installing it. In both cases, I either get this exact same error, or something very similar like:

AttributeError: module 'main' has no attribute 'spec' or
AttributeError: 'Predictor' object has no attribute '__reportProgress'

Any clues as to what could be going on? Not sure if it's an issue on my end or what. Appreciate any advice/help you can give!

Latest software and database versions

It would be helpful if the README indicated what the most recent version of the software and database. It would also help to indicate when these were last updated. This is particularly an issue for the database since without downloading a new version there is currently no way to know if it has been updated.

Unable to install scikit-learn==0.23.2

Hi,

on Ubuntu 20.04 LTS, I'm unable to install CheckM2 because the outdated scikit-learn fails to compile. This is with
Python 3.11.2, and conda or Docker currently isn't an option for us.

            creating build/temp.linux-x86_64-3.11/scipy/cluster
            INFO: compile options: '-I/tmp/pip-build-env-l2e7hsoe/overlay/lib/python3.11/site-packages/numpy/core/include -I/tmp/pip-build-env-l2e7hsoe/overlay/lib/python3.11/site-packages/numpy/core/include -Ibuild/src.linux-x86_64-3.11/numpy/distutils/include -I/vol/mgx-sw/include/python3.11 -c'
            extra options: '-msse -msse2 -msse3'
            INFO: gcc: scipy/cluster/_vq.c
            scipy/cluster/_vq.c:196:12: fatal error: longintrepr.h: No such file or directory
              196 |   #include "longintrepr.h"
                  |            ^~~~~~~~~~~~~~~
            compilation terminated.

Installation issue: wrong file version provided

Hey all,

I ran python setup.py install and am getting the following error:

installing package data to build/bdist.linux-x86_64/egg
running install_data
error: can't copy 'checkm2/version/version_hashes_0.1.1.json': doesn't exist or not a regular file

I checked the folder the file in there is version_hashes_0.1.2.json, so please update the setup.py file.

Thank you!

[Dependency Request] Generalize the dependency versions

I'm trying to establish this in my https://github.com/jolespin/veba pipeline and the prokaryotic binning module has quite a few dependencies I need to work around.

Here's the current dependency restrictions:

    - python >=3.6, <3.9
    - scikit-learn =0.23.2
    - h5py =2.10.0
    - numpy =1.19.2
    - diamond =2.0.4
    - tensorflow >=2.1.0, <2.6.0
    - lightgbm =3.2.1
    - pandas <=1.4.0
    - scipy
    - prodigal >=2.6.3
    - setuptools
    - requests
    - packaging
    - tqdm

Diamond

In particular, can Diamond allow for more versions? Any version compatible with db version 3?

(checkm2_env) [jespinoz@exp-15-05 jcl110]$ diamond dbinfo -d db/checkm2/CheckM2_database/uniref100.KO.1.dmnd
diamond v2.0.4.142 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org

Database format version = 3
Diamond build = 142
Sequences = 6518230
Letters = 2584051404

According to this thread: bbuchfink/diamond#313 (comment)
Any of the upcoming versions should be able to handle this database version.

Sklearn

Also, sklearn is only used for MinMaxScaler (https://github.com/chklovski/CheckM2/blob/d40aef910f24bf7e479e5f7ff1ed326f30d2884b/checkm2/modelProcessing.py)

I noticed that it's not actually used here:

from sklearn.preprocessing import MinMaxScaler

Though, it looks like one of the models might have it in the backend:

/expanse/projects/jcl110/anaconda3/envs/VEBA-binning-prokaryotic2_env/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator MinMaxScaler from version 0.23.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
...
Traceback (most recent call last):
  File "/expanse/projects/jcl110/anaconda3/envs/VEBA-binning-prokaryotic2_env/bin/checkm2", line 265, in <module>
    predictor.prediction_wf(False, 'auto', False, False, False)
  File "/expanse/projects/jcl110/anaconda3/envs/VEBA-binning-prokaryotic2_env/lib/python3.8/site-packages/checkm2/predictQuality.py", line 214, in prediction_wf
    specific_result_comp, scaled_features = modelProc.run_prediction_specific(vector_array, specific_model_vector_len)
  File "/expanse/projects/jcl110/anaconda3/envs/VEBA-binning-prokaryotic2_env/lib/python3.8/site-packages/checkm2/modelProcessing.py", line 68, in run_prediction_specific
    scaled_vector = self.minmax_scaler.transform(vector_array)
  File "/expanse/projects/jcl110/anaconda3/envs/VEBA-binning-prokaryotic2_env/lib/python3.8/site-packages/sklearn/preprocessing/_data.py", line 506, in transform
    if self.clip:
AttributeError: 'MinMaxScaler' object has no attribute 'clip'

I'm trying out a conda installation where I have different dependencies. I installed CheckM2 in this environment via pip with --no-deps to test it out. Just confirmed that it works with higher versions of some of the packages:

(base) [jespinoz@exp-15-05 checkm2]$ conda activate VEBA-binning-prokaryotic2_env
(VEBA-binning-prokaryotic2_env) [jespinoz@exp-15-05 checkm2]$ checkm2 testrun --threads 16
[01/24/2023 02:26:53 PM] INFO: Test run: Running quality prediction workflow on test genomes with 16 threads.
[01/24/2023 02:26:53 PM] INFO: Running checksum on test genomes.
[01/24/2023 02:26:53 PM] INFO: Checksum successful.
[01/24/2023 02:26:54 PM] INFO: Calling genes in 3 bins with 16 threads:
    Finished processing 3 of 3 (100.00%) bins.
[01/24/2023 02:27:12 PM] INFO: Calculating metadata for 3 bins with 16 threads:
    Finished processing 3 of 3 (100.00%) bin metadata.
[01/24/2023 02:27:12 PM] INFO: Annotating input genomes with DIAMOND using 16 threads
[01/24/2023 02:28:49 PM] INFO: Processing DIAMOND output
[01/24/2023 02:28:49 PM] INFO: Predicting completeness and contamination using ML models.
[01/24/2023 02:28:56 PM] INFO: Parsing all results and constructing final output table.
[01/24/2023 02:28:56 PM] INFO: CheckM2 finished successfully.
[01/24/2023 02:28:56 PM] INFO: Test run successful! See README for details. Results:
 Name  Completeness  Contamination         Completeness_Model_Used  Translation_Table_Used
TEST1        100.00           0.74 Neural Network (Specific Model)                      11
TEST2         98.39           0.19 Neural Network (Specific Model)                      11
TEST3         98.67           0.48 Neural Network (Specific Model)                      11
(VEBA-binning-prokaryotic2_env) [jespinoz@exp-15-05 checkm2]$ ls
CheckM2_database  checkm2_database.tar.gz  download.sh  uniref100.KO.1.dmnd
(VEBA-binning-prokaryotic2_env) [jespinoz@exp-15-05 checkm2]$ conda env export | grep -E "scikit-learn|h5py|numpy|diamond|tensorflow|lightgbm|pandas|scipy"
  - diamond=2.0.8=h56fc30b_0
  - h5py=2.10.0=nompi_py38h9915d05_106
  - lightgbm=3.3.5=py38h8dc9893_0
  - numpy=1.19.5=py38h8246c76_3
  - pandas=1.4.1=py38h43a58ef_0
  - scikit-learn=0.23.2=py38h5d63f67_3
  - scipy=1.8.0=py38h56a6a73_1
  - tensorflow=2.4.0=py38h578d9bd_0
  - tensorflow-base=2.4.0=py38h01d9eeb_0
  - tensorflow-estimator=2.4.0=pyh9656e83_0

Hope this helps! Very much looking forward to cleaning up my VEBA prokaryotic binning with your tool. I had quite a few verbose workarounds to handle CPR.

failure to reproduce Completeness data from Supplementary Table 6

We are trying to test our local installation of checkm2 by reproducing Completeness data for assemblies listed in Supplementary Table 6 from the preprint: https://www.biorxiv.org/content/10.1101/2022.07.11.499243v1.supplementary-material

We tried to use nucleotide sequence and protein sequences from NCBI for given NCBI assemblies. Is this something expected?

Version:

(checkm2) gpipedev21:PGAP-7474-Try-CheckM2$ ./checkm2/bin/checkm2 --version
1.0.1

command line:

checkm2 predict --force  --threads 30 --input  nucleotide.file.fasta --output-directory` my.results

for nucleotide input, and for protein input, we add --genes and replace nucleotide input file with a protein input file.

database downloading fails

asking for help, I met problem when trying to download database

[02/26/2023 11:59:04 PM] INFO: Command: Download database. Checking internal path information.
Traceback (most recent call last):
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
conn.connect()
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/connection.py", line 358, in connect
self.sock = conn = self._new_conn()
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f1ab1046e50>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='zenodo.org', port=443): Max retries exceeded with url: /record/5571251 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1ab1046e50>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/yanjiuyuan/Project/huangqiang/checkm2/bin/../checkm2/zenodo_backpack.py", line 164, in _retrieve_record_ID
r = requests.get(DOI, timeout=15.)
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/requests/sessions.py", line 723, in send
history = [resp for resp in gen]
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/requests/sessions.py", line 723, in
history = [resp for resp in gen]
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/requests/sessions.py", line 266, in resolve_redirects
resp = self.send(
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/yanjiuyuan/userLogin/huangqiang/miniconda3/envs/checkm2/lib/python3.8/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='zenodo.org', port=443): Max retries exceeded with url: /record/5571251 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1ab1046e50>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "bin/checkm2", line 280, in
fileManager.DiamondDB().download_database(args.path)
File "/yanjiuyuan/Project/huangqiang/checkm2/bin/../checkm2/fileManager.py", line 127, in download_database
backpack_downloader.download_and_extract(download_location, DOI, progress_bar=True, no_check_version=False)
File "/yanjiuyuan/Project/huangqiang/checkm2/bin/../checkm2/zenodo_backpack.py", line 46, in download_and_extract
recordID = self._retrieve_record_ID(DOI)
File "/yanjiuyuan/Project/huangqiang/checkm2/bin/../checkm2/zenodo_backpack.py", line 166, in _retrieve_record_ID
raise ZenodoConnectionException('Connection error: {}'.format(e))
checkm2.zenodo_backpack.ZenodoConnectionException: Connection error: HTTPSConnectionPool(host='zenodo.org', port=443): Max retries exceeded with url: /record/5571251 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1ab1046e50>: Failed to establish a new connection: [Errno 111] Connection refused'))

pypi and bioconda install

Are there any plans to get CheckM2 on pypi (and then bioconda)?

Given how many pipelines/workflows include CheckM v1, it will be very helpful to have CheckM2 on bioconda

tensorflow

Hi I followed the protocol to install the software and came across the following warning message. Is everything fine? Should I just ignore it?

: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0

checkm2 predict: diamond annotations as input

As stated in diamond.py:

Diamond only accepts single inputs, so we concat protein files and chunk them as input using tempfile

For large numbers of genomes (e.g., 10k or 100k MAGs), it would be best to annotation genomes in batches, with each batch annotated in a separate job. Then, the merged annotations can be provided as input to checkm2 predict. This should scale better than just only DIAMOND job for all genes in all genomes.

All that would likely be necessary to implement this is to allow for gene annotation files as input (similar to --genes in checkm2 predict) and skip the gene calling & annotation steps.

conda installation with tensorflow 2.9.1

Hey,

thanks for the awesome work. Just to let you know, I ran into issues using the latest tensorflow version on conda-forge: tensorflow 2.9.1

tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZN10tensorflow15TensorShapeBaseINS_11TensorShapeEEC1EN4absl12lts_202111024SpanIKlEE

Downgrading to version 2.8.1 did not fix it. However, deleting the file as suggested here seems to work (checkm2 testrun runs successfully)

Cheers

Make Prodigal and Diamond outputs reusable.

Hello,

Thank you for this great tool!

I am currently adding the metawrap bin_refinement module in a metagenomics analysis pipeline. This module uses CheckM1 and I would like to update it with CheckM2.

The metawrap bin_refinement module works as follow:

  • it takes in input 3 bin sets made from different binning tools
  • it creates 4 new hybrid bin sets from the combination of each of them
  • then launch checkM on each bin set
  • finally it finds the best version of each bin in each bin set using checkM contamination and completeness.

In this workflow, checkM is launched 7 times (on the 3 initial bin sets and on the 4 hybrid ones) but actually all the bin sets are from the same assembly which would make the Prodigal and Diamond step processing 7 times the same contigs.

As far as I can see, it's the diamond step that take most of the time in checkM2 execution. So ideally I would like to be able to reuse Prodigal and Diamond outputs to analyze any other bin sets made from the same assembly. This would greatly reduce metawrap execution time!

Thanks !
Jean

Database path

Hi!

I am working on a HPC environment. Checkm2 is added as an environmental module.
Is there any other way of defining the path to the diamond database than using:

$ checkm2 database --setdblocation

This causes a permission error, since the command changes a config file physically (in a module, where I do not have writing permissions).

As far as I can see, the program doesn't have a default database path.

Best

Some questions about NCBI archaeal genomes predicted as Low-quality draft

Dear Chklovski,
I performed quality prediction of all NCBI Refseq archaea genomes (n=1360, Downloaded on 2023-01-06) using CheckM2.
As results, a total of 52 NCBI Refseq archaeal genomes were predicted as Low-quality draft (Completion<50% with Contamination<10%).
I used to think the Refseq archaea genomes should be predicted as High-quality draft.
Their status in Refseq seems to be 'suppressed'. And most of them are partial, but they are still listed as Refseq.
The quality report of the 52 NCBI Refseq archaeal genomes are shown as follow:
GCF_000372505.1 7.02 0.02 Neural Network (Specific Model) 11 0.944 31458 308.0 233168 0.24 239 None
GCF_000382725.2 10.42 0.02 Neural Network (Specific Model) 11 0.89 6144 265.09005628517826 474487 0.35 533 None
GCF_000192595.1 5.21 0.01 Neural Network (Specific Model) 11 0.856 24163 216.18333333333334 45346 0.35 60 None
GCF_000192615.1 15.3 0.0 Neural Network (Specific Model) 11 0.915 31882 253.72727272727272 100212 0.35 121 None
GCF_000382765.1 20.16 0.27 Neural Network (Specific Model) 11 0.938 22214 338.16582914572865 429678 0.46 398 None
GCF_000380865.1 15.27 0.09 Neural Network (Specific Model) 11 0.831 28977 183.99468085106383 248182 0.54 376 None
GCF_000380685.1 38.25 0.3 Neural Network (Specific Model) 11 0.905 25175 278.3911764705882 624620 0.63 680 None
GCF_000402935.1 20.33 0.0 Neural Network (Specific Model) 11 0.9 40278 266.9569536423841 267561 0.58 302 None
GCF_000402915.1 49.95 0.0 Neural Network (Specific Model) 11 0.914 20269 266.9151436031332 667371 0.62 766 None
GCF_000402965.1 41.12 0.24 Neural Network (Specific Model) 11 0.92 16921 269.7848258706468 703726 0.57 804 None
GCF_000398485.1 31.96 0.33 Neural Network (Specific Model) 11 0.904 24529 280.14964788732397 525966 0.63 568 None
GCF_000405905.1 46.62 0.06 Neural Network (Specific Model) 11 0.908 26075 268.1021377672209 741841 0.62 842 None
GCF_000398505.1 40.54 0.76 Neural Network (Specific Model) 11 0.902 31581 271.7974683544304 640245 0.57 711 None
GCF_000405925.1 15.44 0.04 Neural Network (Specific Model) 11 0.905 11690 244.55392156862746 164795 0.63 204 None
GCF_000398525.1 20.52 0.0 Neural Network (Specific Model) 11 0.909 18425 261.6632302405498 249792 0.6 291 None
GCF_000398545.1 11.45 0.0 Neural Network (Specific Model) 11 0.922 26696 285.12820512820514 252228 0.49 273 None
GCF_000398565.1 23.54 0.08 Neural Network (Specific Model) 11 0.922 16590 268.9789029535865 412511 0.52 474 None
GCF_000398585.1 10.93 0.0 Neural Network (Specific Model) 11 0.875 13490 256.3568281938326 198883 0.63 227 None
GCF_000405745.1 19.49 0.0 Neural Network (Specific Model) 11 0.898 29390 263.81024096385545 291490 0.45 332 None
GCF_000402475.1 18.96 0.0 Neural Network (Specific Model) 11 0.921 13811 247.81496062992127 204081 0.34 254 None
GCF_000404305.1 47.56 0.23 Gradient Boost (General Model) 11 0.895 35139 255.77304964539007 361035 0.38 423 None
GCF_000371945.1 40.24 0.75 Neural Network (Specific Model) 11 0.863 8800 217.33881897386254 775531 0.39 1033None
GCF_000375605.1 22.4 0.03 Neural Network (Specific Model) 11 0.863 10933 218.59248554913296 523162 0.3 692 None
GCF_000372145.1 31.64 0.5 Neural Network (Specific Model) 11 0.861 12646 222.76753246753248 593616 0.33 770 None
GCF_000375665.1 17.76 0.1 Neural Network (Specific Model) 11 0.876 7044 215.02351097178683 467463 0.43 638 None
GCF_000376005.1 33.98 0.03 Neural Network (Specific Model) 11 0.875 9339 226.77879714576963 758618 0.3 981 None
GCF_000375985.1 39.12 0.87 Neural Network (Specific Model) 11 0.889 12223 247.26695842450766 758640 0.53 914 None
GCF_000376025.1 40.3 2.51 Neural Network (Specific Model) 11 0.882 10391 239.4714765100671 965738 0.64 1192None
GCF_000349625.1 14.5 0.35 Neural Network (Specific Model) 11 0.813 13721 219.9969465648855 529171 0.43 655 None
GCF_000364885.1 25.91 0.9 Neural Network (Specific Model) 11 0.833 10367 211.4298469387755 593453 0.31 784 None
GCF_000349685.1 33.53 0.25 Neural Network (Specific Model) 11 0.857 12904 236.6001589825119 1037251 0.37 1258None
GCF_001315925.1 46.15 1.02 Gradient Boost (General Model) 11 0.811 34784 130.14622178606476 1920275 0.59 4076None
GCF_000484955.1 28.56 0.67 Neural Network (Specific Model) 11 0.925 12024 234.4126679462572 394893 0.36 521 None
GCF_000484915.1 20.97 0.27 Neural Network (Specific Model) 11 0.86 16111 248.33558178752108 512945 0.25 593 None
GCF_000402255.1 28.8 0.01 Neural Network (Specific Model) 11 0.905 18127 230.38832487309645 299490 0.33 394 None
GCF_000415945.1 37.89 0.21 Neural Network (Specific Model) 4 0.801 2040945 216.5785992217899 2040945 0.61 2570None
GCF_000494185.1 40.45 0.22 Neural Network (Specific Model) 11 0.92 19616 271.79487179487177 688171 0.54 780 None
GCF_000494125.1 38.14 0.25 Neural Network (Specific Model) 11 0.907 25175 280.8230884557721 617045 0.61 667 None
GCF_000494165.1 49.89 0.0 Neural Network (Specific Model) 11 0.915 24396 268.8887399463807 654645 0.64 746 None
GCF_000496235.1 46.53 0.66 Neural Network (Specific Model) 4 0.84 304570 303.180260707635 2876249 0.59 2685None
GCF_000746695.1 30.61 1.16 Neural Network (Specific Model) 11 0.923 64295 239.33985765124555 435918 0.33 562 None
GCF_000746705.1 41.51 0.01 Neural Network (Specific Model) 11 0.91 20264 229.83549783549785 522768 0.34 693 None
GCF_902158755.1 37.31 0.0 Neural Network (Specific Model) 11 0.909 32599 274.0403587443946 803504 0.47 892 None
GCF_012927665.1 9.09 0.0 Neural Network (Specific Model) 11 0.934 50193 357.15441176470586 155723 0.45 136 None
GCF_012927685.1 22.19 0.11 Neural Network (Specific Model) 11 0.871 28343 259.7018469656992 337773 0.47 379 None
GCF_012927695.1 45.54 0.66 Neural Network (Specific Model) 11 0.91 63768 273.8136826783115 619468 0.44 687 None
GCF_012927715.1 25.28 0.1 Neural Network (Specific Model) 11 0.937 28455 271.8103975535168 284101 0.42 327 None
GCF_012927765.1 31.69 0.01 Neural Network (Specific Model) 11 0.918 41938 281.08061002178647 420777 0.44 459 None
GCF_905067475.1 34.87 0.05 Neural Network (Specific Model) 11 0.866 16637 259.2754662840746 623357 0.41 697 None
GCF_905067525.1 24.48 0.0 Neural Network (Specific Model) 11 0.87 14935 264.9917355371901 551157 0.44 605 None
GCF_905067565.1 26.05 0.11 Neural Network (Specific Model) 11 0.872 9703 242.27331606217618 640119 0.44 772 None
GCF_920984865.1 17.86 2.06 Neural Network (Specific Model) 11 0.599 607503 89.31677465802736 607503 0.24 1389None

Increase protein search stringency? (Low-homology problem)

The original checkm used hmmsearch with an HMM database of marker genes. This is arguably the most accurate and sensitive method for detecting remote homology in protein alignment, particularly when a sequence is notably diverged from any individual reference sequence in the database but still maintains the characteristic conservation pattern in the highly conserved sites (identified in an MSA).

The second most sensitive protein search is blastp and its ilk (pairwise sensitive alignment).

In third place is the default "fast" search mode of diamond. The difference between diamond's default "fast" and its --ultra-sensitive is well-documented (including in its own publication), with difference increasing with sequence divergence. The difference between pairwise blastp and a well-constructed HMM from a rigorous MSA is also well understood in phylogenetics and remote homology detection.

Can you justify why hmms have been dropped in favor of pairwise protein alignments, and further can you explain why the default "fast" mode is used in the more-heuristic diamond aligner rather than blastp or the near-equivalent (but faster) --ultra-sensitive mode of diamond?

Further, I see uniref100 was used (which does not contain many proteins found in metagenomes, e.g. MGNify) as a proxy for annotating proteins with KEGG IDs (which also relies on assignment of those original uniref proteins from likely an even older version of KEGG using further heuristic assignments). Propagating heuristics from the initial KEGG assignments to those proteins, and then from those proteins to new proteins, has potential for error especially with more distant homologies (which are less accurate anyway with the diamond default blastp fast parameters).

Most strikingly, these multiple layers of abstractions are unnecessary when the actual KEGG HMMs already exist publicly (see KoFamKoala's FTP) and are the standard in assigning KEGG IDs de novo. It seems checkm2 is going through a long game of telephone, chaining multiple loose heuristics together and relying on a machine learning black box to sort through it all. Why not simply pull or redistribute the subset of HMMs used by the model and align to those with hmmsearch? It's direct, more sensitive, more exact, and can potentially lead to just one unified model for all homologies.

Curious to hear your thoughts.

Thanks!
Gabe

UnicodeEncodeError

Hi all!

When I run the checkm2 program It seems like all the calculation finished, but the final results is not writing because of UnicodeEncodeError problem.

Maybe someone faced the same problem and know how to fix it?

[10/27/2022 02:01:40 PM] INFO: Running quality prediction workflow with 30 threads.
[10/27/2022 02:01:44 PM] INFO: Calling genes in 109 bins with 30 threads:
Finished processing 109 of 109 (100.00%) bins.
[10/27/2022 02:03:32 PM] INFO: Calculating metadata for 109 bins with 30 threads:
Finished processing 109 of 109 (100.00%) bin metadata.
[10/27/2022 02:03:34 PM] INFO: Annotating input genomes with DIAMOND using 30 threads
Traceback (most recent call last):
File "/gpfs/space/home/pantiukh/.conda/envs/checkm2/bin/checkm2", line 4, in
import('pkg_resources').run_script('CheckM2==0.1.3', 'checkm2')
File "/gpfs/space/home/pantiukh/.conda/envs/checkm2/lib/python3.6/site-packages/pkg_resources/init.py", line 651, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/gpfs/space/home/pantiukh/.conda/envs/checkm2/lib/python3.6/site-packages/pkg_resources/init.py", line 1448, in run_script
exec(code, namespace, namespace)
File "/gpfs/space/home/pantiukh/.conda/envs/checkm2/lib/python3.6/site-packages/CheckM2-0.1.3-py3.6.egg/EGG-INFO/scripts/checkm2", line 154, in
predictor.prediction_wf(args.genes, mode, args.dbg_cos, args.dbg_vectors, args.stdout)
File "/gpfs/space/home/pantiukh/.conda/envs/checkm2/lib/python3.6/site-packages/CheckM2-0.1.3-py3.6.egg/checkm2/predictQuality.py", line 103, in prediction_wf
diamond_out = diamond_search.run(prodigal_files)
File "/gpfs/space/home/pantiukh/.conda/envs/checkm2/lib/python3.6/site-packages/CheckM2-0.1.3-py3.6.egg/checkm2/diamond.py", line 118, in run
self.__call_diamond(protein_chunks, diamond_out)
File "/gpfs/space/home/pantiukh/.conda/envs/checkm2/lib/python3.6/site-packages/CheckM2-0.1.3-py3.6.egg/checkm2/diamond.py", line 73, in __call_diamond
sequenceClasses.SeqReader().write_fasta(seq_object, temp_diamond_input.name)
File "/gpfs/space/home/pantiukh/.conda/envs/checkm2/lib/python3.6/site-packages/CheckM2-0.1.3-py3.6.egg/checkm2/sequenceClasses.py", line 104, in write_fasta
fout.write('>' + seqId + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character '\u03a9' in position 10: ordinal not in range(128)

Runs out of /tmp space. Needs option to specify temp dir

On modern systems, /tmp redirects to RAM.
On a 1tb RAM node, the 500gb of RAM in /tmp is easily exhausted during the diamond run if multiple instances are running with hundreds/thousands of MAGs each.

The solution is to allow the user to specify temporary directory. If this is already possible, I don't see a flag for it.

Results are sometimes silently corrupt if the diamond blast file is truncated once /tmp is exhausted.

OSError: [Errno 95] Operation not supported

Hi CheckM2 team,

I ran the checkm2 predict and it shows the error below:

[03/09/2023 07:24:51 AM] INFO: Running CheckM2 version 1.0.1
[03/09/2023 07:24:51 AM] INFO: Custom database path provided for predict run. Checking database at CheckM2_database/uniref100.KO.1.dmnd...
[03/09/2023 07:26:00 AM] INFO: Running quality prediction workflow with 8 threads.
[03/09/2023 07:26:01 AM] INFO: Calling genes in 7 bins with 8 threads:
Process SyncManager-1:
Traceback (most recent call last):
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/managers.py", line 608, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/managers.py", line 154, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/connection.py", line 591, in __init__
    self._socket.bind(address)
OSError: [Errno 95] Operation not supported
Traceback (most recent call last):
  File "/opt/conda/envs/checkm2_env/bin/checkm2", line 210, in <module>
    predictor.prediction_wf(args.genes, mode, args.dbg_cos, args.dbg_vectors, args.stdout,
  File "/opt/conda/envs/checkm2_env/lib/python3.8/site-packages/checkm2/predictQuality.py", line 103, in prediction_wf
    GC = self.__run_prodigal(ttable)
  File "/opt/conda/envs/checkm2_env/lib/python3.8/site-packages/checkm2/predictQuality.py", line 380, in __run_prodigal
    used_ttables = mp.Manager().dict()
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/managers.py", line 583, in start
    self._address = reader.recv()
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/opt/conda/envs/checkm2_env/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Do you have any idea how to troubleshoot that?

Best regards,

Quang

Adding to Bioconda

Hi CheckM2 team,

Congratulations on this exciting release. Can you add it to Bioconda?

Best,
Liam

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.