fpusan / superpang Goto Github PK

Non-redundant pangenome assemblies from multiple genomes or bins

License: BSD 3-Clause "New" or "Revised" License

Python 91.25% Cython 8.75%

bioinformatics pangenome genomics metagenomics

superpang's Introduction

SuperPang: non-redundant pangenome assemblies from multiple genomes or bins

Check our paper: Puente-Sánchez F, Hoetzinger M, Buck M and Bertilsson S. Exploring intra-species diversity through non-redundant pangenome assemblies Molecular Ecology Resources (2023) DOI: 10.1111/1755-0998.13826

... but note that performance is now better (3x less memory usage, 20% faster execution time) than when we first benchmarked Superpang.

Installation

Requires graph-tool, speedict, mOTUlizer v0.2.4, minimap2 and mappy. The easiest way to get it running is using conda.

# Install into a new conda environment
conda create -n SuperPang -c conda-forge -c bioconda -c fpusan superpang
# Check that it works for you!
conda activate SuperPang
test-SuperPang.py

Usage

SuperPang.py --fasta <genome1.fasta> <genome2.fasta> <genomeN.fasta> --checkm <check_results> --output-dir <output_directory>

Input files and choice of parameters

The input genomes can be genomes from isolates, MAGs (Metagenome-Assembled Genomes) or SAGs (Single-cell Assembled Genomes).
The input genomes can have different qualities, for normal usage we recommend that you provide completeness estimates for each input genome through the -q/--checkm parameter.
If you are certain that all your input genomes are complete, you can use the --assume-complete flag or manually tweak the -a/--genome-assignment-threshold and -x/--default-completeness parameters instead of providing a file with completeness estimates.
The default parameter values in SuperPang assume that all of the input genomes come from the same species (ANI>=0.95). This can be controlled by changing the values of the -i/--identity_threshold and -b/--bubble-identity-threshold to the expected ANI. However SuperPang has currently only been tested in species-level clusters.

Arguments

-f/--fasta: Input fasta files with the sequences for each bin/genome, or a single file containing the path to one input fasta file per line.
-q/--checkm: CheckM output for the bins. This can be the STDOUT of running checkm on all the fasta files passed in --fasta, or a tab-delimited file ended with a .tsv extension, in the form genome1 percent_completeness. Genome names should not contain the file extension (e.g. .fna). If empty, completeness will be estimated by mOTUpan but this may lead to wrong estimations for very incomplete genomes.
-i/--identity_threshold: Identity threshold (fraction) to initiate correction with minimap2. Values of 1 or higher will skip the correction step entirely. Default 0.95.
-m/--mismatch-size-threshold: Maximum contiguous mismatch size that will be corrected. Default 100.
-g/--indel-size-threshold: Maximum contiguous indel size that will be corrected. Default 100.
-r/--correction-repeats: Maximum iterations for sequence correction. Default 20.
-n/--correction-repeats-min: Minimum iterations for sequence correction. Default 5.
-k/--ksize: Kmer-size. Default 301.
-l/--minlen: Scaffold length cutoff. Default 0 (no cutoff).
-c/--mincov: Scaffold coverage cutoff. Default 0 (no cutoff).
-b/--bubble-identity-threshold: Minimum identity (matches / length) required to remove a bubble in the sequence graph. Default 0.95.
-a/--genome-assignment-threshold. Fraction of shared kmers required to assign a contig to an input genome (0 means a single shared kmer is enough) (DEPRECATED).
-x/--default-completeness: Default genome completeness to assume if a CheckM output is not provided with --checkm. Default 70.
-t/--threads: Number of processors to use. Default 1.
-o/--output: Output directory. Default output.
-d/--temp-dir: Directory for temp files. Default tmp.
-u/--header-prefix: Prefix to be added to output sequence names. No prefix is added by default.
--assume-complete: Assume that the input genomes are complete (--default-completeness 99).
--lowmem: Use disk storages instead of memory when possible, reduces memory usage at the cost of execution time.
--minimap2-path: Path to the minimap2 executable. Default minimap2.
--keep-intermediate: Keep intermediate files.
--keep-temporary: Keep temporary files.
--verbose-mOTUpan: Print out mOTUpan logs.
--nice-headers: Removes semicolons from non-branching-path names.
--output-as-file-prefix: Use the output dir name also as a prefix for output file names.
--force-overwrite: Write results even if the output directory already exists.
--debug: Run additional sanity checks (increases execution time).

Output

assembly.fasta: contigs.
assembly.info: core/auxiliary and path information for each contig.
NBPs.fasta: non-branching paths.
NBPs.core.fasta: non-branching paths deemed to belong to the core genome of the species by mOTUpan.
NBPs.accessory.fasta: non-branching paths deemed to belong to the accessory genome of the species.
NBP2origins.tsv: tab-separated file with the non-branching path IDs, a comma-separated list of the input sequences in which that NBP was deemed present, a comma-separated list of the input genomes in which that NBP was deemed present, and the number of input genomes in which that NBP was deemed present.
graph.fastg: assembly graph in a format compatible with bandage.
graph.NBP2origins.csv: file with similar structure as NBP2origins.tsv, formatted for use together with the "Load CSV file" option in Bandage. This allows using the information in the file as node labels in Bandage.
params.tsv: parameters used in the run.

About

SuperPang is developed by Fernando Puente-Sánchez (Sveriges lantbruksuniversitet). Feel free to open an issue or reach out for support [email protected].

superpang's People

Contributors

Stargazers

Watchers

Forkers

bbalog87 alephreish

superpang's Issues

There was an error running homogenize.py

Hi!
Thank you for the tool and your active support here.
Following another issue (from SqueezeMeta) I run mOTUlizer on bins from numerous independent SquezeMeta projects and got list of mOTUs with MAGs. Now I run SuperPang on each mOTU as it was suggested. No issues with most of the mOTUs, but one specific mOTU contains about 8000 MAGs, so SuperPang is giving me an error:

There was an error running homogenize.py. Please open an issue

Traceback (most recent call last):
  File "/home/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/utils.py", line 61, in write_fastq
    outfile.write(f'@{name}\n{seq}\n+\n{qual}\n')
OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/miniconda3/envs/SuperPang/bin/homogenize.py", line 332, in <module>
    main(parse_args())
  File "/home/miniconda3/envs/SuperPang/bin/homogenize.py", line 24, in main
    fasta2fastq(args.fasta, current1)
  File "/home/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/utils.py", line 65, in fasta2fastq
    write_fastq(read_fasta(fasta), fastq)
  File "/home/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/utils.py", line 61, in write_fastq
    outfile.write(f'@{name}\n{seq}\n+\n{qual}\n')
OSError: [Errno 28] No space left on device

I run it on HPC with 120 Gb of RAM (HPC report tells me that only around 30 Gb of RAM were used when error occured) and several Tb of storage, also tried to export temp directory to a larger disk with no success.

My question if it is "legal" to separate 8000 MAGs on subsets, create assemblies with SuperPang and then run SuperPang on those assemblies to get one final assembly.
Or, is there a better approach to handle it?

errors when running the test-SuperPang.py

the error message is as follows:

Reconstructing contigs
File "/home/sl/.conda/envs/SuperPang/bin/condense-edges.py", line 254, in
main()
File "/home/sl/.conda/envs/SuperPang/bin/condense-edges.py", line 60, in main
GS.add_edge_list(edges)
File "/home/sl/.conda/envs/SuperPang/lib/python3.8/site-packages/graph_tool/init.py", line 2603, in add_edge_list
libcore.add_edge_list_iter(self.__graph, edge_list, eprops)
TypeError: No registered converter was able to produce a C++ rvalue of type double from this Python object of type Vertex

There was an error running condense-edges.py.

IndexError: list index out of range

Error in ".../anaconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 829, in reconstruct_sequence
if k[:-1] == kmers[-1][1:]:
IndexError: list index out of range

multiple file paths for `--fasta`

The potential problem with multiple args for --fasta (as in the README: SuperPang.py --fasta <genome1.fasta> <genome2.fasta> <genomeN.fasta>) is that if the user has 100's (or 1000's) of genomes and uses the full file paths, the command length can exceed the character limits of the OS. It would be helpful to allow the user to provide a file listing all input fasta files (one per line).

IndexError: list index out of range

Hello,

I have install SuperPang (v=0.8) with conda, and it can normally use test-SuperPang.py, but SuperPang.py presented an error:

Traceback (most recent call last):
File "/home/miniconda3/envs/SuperPang/bin/SuperPang.py", line 270, in
main(parse_args())
File "/home/miniconda3/envs/SuperPang/bin/SuperPang.py", line 92, in main
bin_ = f.rsplit('/',1)[1].rsplit('.',1)[0]
IndexError: list index out of range

How to improve the mOTUpan core-genome estimation for viruses species?

When I tried to generate the viruses pangenome, I found that the effect mOTUpan estimated the core-genome was not ideal.
How does SuperPang improve the core-genome estimation effect of small genome-sized species?

Read checkm file at the begining even if not yet needed

So that if the file path is wrong we'll get an error immediately instead of after homogeneization and assembly

the pangeome size is larger than all size of input genomes

When generate the pangenome of HIV-2 species, I got a 421,130bp pangenome which is larger than all size of input genomes.
How to understand this situation?
SuperPang.py --fasta 11709/genome/*.fa --output-dir 11709/pangenome --force-overwrite -t 20 --assume-complete -b 0.95 -i 0.95 -k 301

mOTUpan

Hey guys,

Naturally I went all in with 200 MAGs as a "test"

after 24 h mOTUpan starts running and checks the checkm output

AssertionError: your completness files is badly formed, it should be TAB-separated (multispaced...) and needs a header line with 'Bin Id', 'Completeness', and 'Contamination' in it

-q/--checkm: CheckM output for the bins. This can be the STDOUT of running checkm on all the fasta files passed in --fasta, or a tab-delimited file in the form genome1 percent_completeness. If empty, completeness will be estimated by mOTUpan but this may lead to wrong estimations for very incomplete genomes.

Obviously i provided the format specified in the help but mOTUpan says this is wrong.

In any case maybe a check of the input files should occur at the start of the pipeline?

ValueError: invalid vertex: 3; condense-edges.py error.

Hi,

I was trying to run Superpang on a 50 MAGs (supposedly 1 species) . It returned the following error:

Traceback (most recent call last):
  File "/lustre/BIF/nobackup/hrab001/micromamba/envs/superpang/lib/python3.8/site-packages/superpang/scripts/condense-edges.py", line 278, in <module>
    main(parse_args())
  File "/lustre/BIF/nobackup/hrab001/micromamba/envs/superpang/lib/python3.8/site-packages/superpang/scripts/condense-edges.py", line 222, in main
    predecessors, successors = G2dicts(GS, name2vertex, vertex2name)
  File "/lustre/BIF/nobackup/hrab001/micromamba/envs/superpang/lib/python3.8/site-packages/superpang/scripts/condense-edges.py", line 238, in G2dicts
    for pv in GS.get_in_neighbors(v):
  File "/lustre/BIF/nobackup/hrab001/micromamba/envs/superpang/lib/python3.8/site-packages/graph_tool/__init__.py", line 2441, in get_in_neighbors
    vertices = libcore.get_in_neighbors_list(self.__graph, int(v),
ValueError: invalid vertex: 3

There was an error running condense-edges.py. Please open an issue

Can you advice on resolving it?

Best,
Pavlo

Error after identifying connected components

Hi! I have an error after the step "identifying connected components".
I installed SuperPang using conda and I am working with 4 bins (>90% compl <5% contamination) from the same species (ANI > 97%). test-SuperPang.py runs with out problems.
The error occurs after running
$SuperPang.py -f genomes/*fa -t 20 -o SuperPang --force-overwrite --checkm checkm_sox.txt

Traceback (most recent call last):
File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 291, in
main(parse_args())
File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 144, in main
contigs = Assembler(input_minimap2, args.ksize, args.threads).run(args.minlen, args.mincov, args.bubble_identity_threshold, args.genome_assignment_threshold, args.threads)
File "/home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 322, in run
psets = get_psets(comp2nvs[nc1] | comp2nvs[nc2])
File "/home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 292, in get_psets
assert len(nvs_) == 2
AssertionError

The checkM file has the format Bin_Id Completeness Contamination, but I have the same error after running SuperPang with the --assume-complete flag.

There was an error running homogenize.py

Hi!

I am working with around 450 genomes. I am also getting the same error when I am using the 0.9.4.beta1 version.

But latest version is giving me this error:

ImportError: /usr/lib64/libc.so.6: version `GLIBC_2.25' not found (required by /mmfs1/home/azk0151/miniconda3/envs/SuperPang/lib/python3.8/site-packages/speedict/speedict.cpython-38-x86_64-linux-gnu.so)

Can you please help me how to solve this issue?

Thank you!

Originally posted by @azk001 in #10 (comment)