shandley / hecatomb Goto Github PK

View Code? Open in Web Editor NEW

53.0 53.0 12.0 141.53 MB

hecatomb is a virome analysis pipeline for analysis of Illumina sequence data

License: MIT License

Python 100.00%

snakemake viral-metagenomes virome

hecatomb's People

Contributors

Stargazers

Watchers

Forkers

kant nhatas daviddebruin amrfaisal henr0089 sarahbeecroft qazwsx1995 vini2 pythseq linsalrob pastvir nickp60

hecatomb's Issues

Database installation problems

Hi there,

Good job for this awesome pipeline! I'd really like to try it, but I'm stuck on installing the databases.

I'm working on a cluster and tried:

hecatomb install --profile slurm
hecatomb install

Please see the log files enclosed.

Thanks in advance,
Patricia

2022-02-03T082640.845628.snakemake.log
2022-02-03T082606.247106.snakemake.log

Run Offline

Hi,

The cluster I am trying to run hecatomb on blocks all internet access to compute nodes, which is very frustrating.

It looks like a couple of rules in 00_functions.smk rely on Snakemake wrappers - for bam_index and fasta_index rules and so require internet access.

The entire log file I get when I run hecatomb on the compute node is:

Config file /hpcfs/users/a1667917/myconda/envs/hecatomb/snakemake/workflow/../config/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
WorkflowError:
Failed to open source file https://github.com/snakemake/snakemake-wrappers/raw/0.77.0/bio/samtools/index/environment.yaml
ConnectTimeout: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /snakemake/snakemake-wrappers/raw/0.77.0/bio/samtools/index/environment.yaml (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f65c0836c80>, 'Connection to github.com timed out. (connect timeout=None)'))

I commented out the wrapper for the 2 rules and hecatomb ran fine other than the very final indexing step (I then uncommented and ran the indexing steps on the login node - which worked to complete the pipeline just fine).

Therefore, as far as I can tell, this is the only section of hecatomb that requires online access - perhaps you could implement an option to run offline only?

I tried to manually change the snakemake rule to something like samtools index {input} but that didn't work.

George

Documentation

Hi,

For the documentation, can you add a link under each output files to a table that defines each column?
Ex: bigtable.tsv
seqID UID for each representative sequence

This would make the tool more easily accessible, but not clutter the doc.

Thank you,
Kathie

Error: EOF marker is absent.There appear to be different numbers of reads in the paired input files

Hi,

I ran Hecatomb (version beta4) on 141 paired-end samples, it crashed and gave this error message:

[W::bgzf_read_block] EOF marker is absent. The input may be truncated
java.lang.AssertionError: 
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process. It may be fixable by running repair.sh.
    at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:498)
    at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:363)
    at stream.ConcurrentGenericReadInputStream.run0(ConcurrentGenericReadInputStream.java:207)
    at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:183)
    at java.base/java.lang.Thread.run(Thread.java:834)

I double checked the sequences files, 141 R1 files and 141 R2 files, and it doesn’t seem the number of R1 and R2 are any different. And I don't understand what does "EOF marker is absent" mean.

Thanks!
Leran

Partition phage and eukaryotic viruses

In an early developmental version of hecatomb I merged the 'bigtable' (wasn't called that at the time!) to a list of phage taxonomies. This added a column called virus_type that allowed for simple partitioning go phage-to-nonphage viruses.

This should be easy to implement the same way we add Baltimore classifications.

I have the original phage list that I used for this (below).
Ampullaviridae
Atkinsviridae
Autographiviridae
Autolykiviridae
Bicaudaviridae
Blumeviridae
Caudovirales
Caudovirales_undefined_family
Clavaviridae
Corticoviridae
Crevaviridae
Cystoviridae
Duinviridae
Fiersviridae
Fuselloviridae
Globuloviridae
Guelinviridae
Guttaviridae
Inoviridae
Intestiviridae
Jelitoviridae
Leviviridae
Ligamenvirales
Ligamenvirales_undefined_family
Lipothrixviridae
Matshushitaviridae
Microviridae
Myoviridae
Paulinoviridae
Picobirnaviridae
Plasmaviridae
Pleolipoviridae
Podoviridae
Portogloboviridae
Rountreeviridae
Rudiviridae
Salasmaviridae
Schitoviridae
Simuloviridae
Siphoviridae
Solspiviridae
Sphaerolipoviridae
Spiraviridae
Steigviridae
Steitzviridae
Suoliviridae
Tectiviridae
Tinaiviridae
Tristromaviridae
Tubulavirales
Tubulavirales_undefined_family
Turriviridae
unidentified phage
Zobellviridae

Update mmseqs scripts to work with the newest version of mmseqs2

There are issues with the mmseqs scripts in the /base dir. They were created using an older version of mmseqs2 so the logic doesnt work anymore (particularly with --shuffle). Needs to be updated to work with the current version of mmseqs2.

hecatomb install giving error

hecatomb install
Checking and downloading database files
Running snakemake command:
snakemake -j 32 --use-conda --conda-frontend mamba --rerun-incomplete --printshellcmds --nolock --show-failed-logs --conda-prefix /home/vsingh/miniconda3/envs/hecatomb/snakemake/workflow/conda -s /home/vsingh/miniconda3/envs/hecatomb/snakemake/workflow/DownloadDB.smk -C Output=hecatomb_out
Config file /home/vsingh/miniconda3/envs/hecatomb/snakemake/workflow/../config/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
/usr/bin/bash: line 10: __conda_exe: command not found
Traceback (most recent call last):
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/init.py", line 699, in snakemake
success = workflow.execute(
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/workflow.py", line 973, in execute
self.scheduler = JobScheduler(
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/scheduler.py", line 348, in init
self._executor = CPUExecutor(
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/executors/init.py", line 453, in init
self.exec_job += self.get_additional_args()
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/executors/init.py", line 298, in get_additional_args
if self.workflow.conda_base_path and self.assume_shared_fs:
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/workflow.py", line 276, in conda_base_path
return Conda().prefix_path
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 439, in init
shell.check_output(self._get_cmd("conda info --json"))
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/shell.py", line 63, in check_output
return sp.check_output(cmd, shell=True, executable=executable, **kwargs)
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'conda info --json' returned non-zero exit status 127.

megahit -> flye reformatting

Hi Scott,
This is the current reformat command between megahit and flye:

reformat.sh in={output.rename} out={output.size} \
        ml={config[MINLENGTH]} \
        ow=t \
        -Xmx{config[System][Memory]}g;

If you try to run flye with the same MINLENGTH as megahit, you may fail immediately because of duplicate names. This can be solved by adding this to the reformat command:
uniquenames=t

Kathie

Error in rule PRIMARY_NT_reformat

Hi,

I have been running hecatomb on a 190 sample-sized dataset, and it crashed at "rule PRIMARY_NT_reformat" step. This is what shows in hecatomb_out/STDERR/t09.mmseqs_PRIMARY_nt_summary.log file:

/bin/bash: line 10: 27845 Illegal instruction (core dumped) mmseqs filterdb hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/result hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/firsthit --extract-lines 1

I didn't remember I saw this error when I ran a subset (10 samples) of this 190 dataset. Can I just rebuild the hecatomb or is there something we should tune before rebuilding?

Thanks!
Leran

For NextSeq samples, maximum HTCF memory insufficient for merge seqtable step in R

I have tried with the max memory (#SBATCH --mem=250G) and 42 samples, but it ends in an out of memory error:
Failed, Run time 04:46:02, OUT_OF_MEMORY

/var/lib/slurm-llnl/slurmd/job28663469/slurm_script: line 13: 12927 Killed Rscript /scratch/sahlab/Jeffrey_NextSeq_IBD_VLP/seqtable_merge.R
slurmstepd: error: Detected 1 oom-kill event(s) in step 28663469.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

All samples make it through this step:
Parsed with column specification:
cols(
sequence = col_character(),
N646_I8216_32668_Jeffrey_Sup_10_NEBNext-Index-16C_CCGTCCGC_S11 = col_double()
)

and then the error occurs.

We may have to convert to python or somehow decrease memory usage for this script if we want to process NextSeq data.

Kathie

Streamline fastq vs fasta file usage

Adjust to fasta over fastq whenever possible

Java Runtime Error when running test databases on VM

Hi,

I'm currently trying to get hecatomb working on a VM, but I've run into a Java-related error. The support for said VM and storage servers told me that this error was not related to the VM, and that I should contact the hecatomb developers. I've attached the error log. Please let me know if there's any further information that needs to be provided.
hs_err_pid170.log

read pair handling

bbtools appears to only recognize :1 & :2 or /1 & /2 for read pairs. This breaks everything at step 6 when we run repair.sh

Many reads downloaded from SRA use .1 and .2 to identify paired-end reads.

A simple solution is to rename all the reads before you begin to ensure that they end with /1 and /2 as appropriate. You can use change_fastq_pair_symbol to do that, but it is not the most elegant or efficient solution (but it works).

I am not sure where/how to implement a better fix for this, or if it is our responsibility to check read names,

pipeline requires higher version of conda which has --conda-frontend parameter

Hi,

When I firstly ran hecatomb it failed at the beginning because it cannot find the --conda-frontend parameter that the pipeline required.

This issue could go away after I updated my conda version to conda 4.10.3.

Thanks!
Leran

missing python package

When I run the test data, it dies with this error:

Error executing rule sankey_diagram on cluster (jobid: 213, external: Submitted batch job 33397859, jobscript: /scratch/sahlab/RC2_IBD_virome/te
st/.snakemake/tmp.1edmh2g2/snakejob.sankey_diagram.213.sh). For error details see the cluster log and the log files of the involved rule(s).
Job failed, going on with independent jobs.
[Thu Oct 28 01:20:39 2021]
Finished job 196.
214 of 216 steps (99%) done
Exiting because a job execution failed. Look above for error message
Complete log: /scratch/sahlab/RC2_IBD_virome/test/.snakemake/log/2021-10-27T220225.813094.snakemake.log
ERROR: Snakemake command failed

ERROR:snakemake.logging:RuleException:
ValueError in line 501 of /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/snakemake/workflow/rules/04_summaries.smk:

Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
$ pip install -U kaleido

File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 2357, in run
_wrapper
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/snakemake/workflow/rules/04_summaries.smk", line 501, in __rule_sankey_diagram
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/plotly/basedatatypes.py", line 3821, in write_image
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/plotly/io/_kaleido.py", line 268, in write_image
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/plotly/io/_kaleido.py", line 134, in to_image
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 574, in _cal
lback
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/concurrent/futures/thread.py", line 52, in run
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 560, in cach
ed_or_run
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 2390, in run
_wrapper

Error executing PRIMARY_NT_reformat

Cluster error:

/bin/bash: line 10: 18485 Illegal instruction (core dumped) mmseqs filterdb hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/result hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/firsthit --extract-lines 1

Log file error report (same as above):

Update phage lineages

We should use the ICTV resources phage lineages. If there are missing lineages, post an issue on that website.

The ICTV will update this from time to time, but we should work with them to have a single URL we can download

updating hecatomb

Hi Mike,

Can you add to the documentation how to update hecatomb? Do we have to reinstall when there is an update? Will hecatomb alert us to major updates?

Thank you,
Kathie

slurm and space

Hi Rob,

I am not sure if this will happen with Snakemake, but if I run out of space on the server and contaminant removal quits after sample 42 of 527, because those 42 completed, it exits with status 0 and no out of space message.

Kathie

Update download databases for bowtie2

The downloading is the same, but we need to add the indexing to make sure we have the bt2 files if we have the use_bowtie option

Streamline file zipping and unzipping

Keep zipping / unzipping to a minimum for efficiency.

two concerns with the convert_phyloseq_euk_viral_glom.r standardized count table (species)

1.) Check that coercion from factor to numeric is behaving as intended, for example:

# Adjust values
ps.melt.value.sp <- ps.melt.sp %>%
  select(c((ncol(MAP)+11):(ncol(ps.melt.sp)))) %>%
   mutate_if(is.factor, ~ as.numeric(levels(.x))[.x])

Coercion from factor to numeric requires you coerce to a character first.

2.) Merging the Baltimore classification table with the melted phyloseq data using a left_join() will likely drop taxa from your table (ie: families in your melted table that are not present in your Baltimore file), as the keep argument defaults to FALSE. Is this the intended behavior? If not, I would recommend changing this parameter to TRUE, then filtering any rows added to the table that have an empty OTU (the Baltimore file will pull in some families that may not match to any families in your melted table).

# may lose information here:
ps.melt.fixed.sp <- left_join(ps.melt.fixed.sp, baltimore, by = "Family")

3.) When generating the standardized count object at the genus level, the default option for tax_glom is to drop any taxa for which you are missing information at the specified rank. In this case it should never be an issue (NAs aren't expected) but it would be safer to go ahead and set this parameter to FALSE in the event something upstream has gone wrong:

ps0.ge.glom <- ps0.sp %>%
  speedyseq::tax_glom("Genus", NArm = FALSE)

taxonomy improvement

Hi Mike,

In working with the SIV data, I realized that there are spaces in taxonomic fields. For example, for families:
Verrucomicrobia subdivision 3
Verrucomicrobia subdivision 6

This will make pulling reads by family more difficult. Could all spaces in taxonomy fields be replaced with underscores in the next update?

Thank you,
Kathie

Update master to Rob's branch

update the master branch to reflect the snakemake changes.

Create a new release

Create a release for the base install of hecatomb before we start

Make primerB heat map

settings in '--configfile' file not passed to 'script:' rules

I specify a custom config file with --configfile e.g.

hecatomb run --configfile hecatomb.config.yaml ...

which specifies a custom location for the databases.

The pipeline runs fine until it hits the first rule that uses the script: directive (instead of shell: or run:). The DB check function reverts to the old database location and throws the 'database not installed error'.

Find R1 and R2 for multiple samples

Add functionality to find paired-end sample pairs (R1 and R2) from a folder containing many different samples.

snakemake profiles

"Non-standard" default snakemake profiles are giving a hecatomb error.

I have followed Mike's snakemake profile tutorial, but then added things to both the sbatch command and the default-resources.

For example, our cluster has a partition setting (general vs GPU) with a default resource of general that has been missed by hecatomb.

We need to make sure that hecatomb is compatible with any profile, and I think we need to ensure we incorporate the users default-resources settings, and perhaps other settings too(?)

kill switch

Hi Mike,

Can you add to the documentation how to kill hecatomb? I killed the tmux session, but it is my understanding that to kill snakemake, I have to A each running process.

Kathie

convert_phyloseq.R

This block of code

# Sumamrise per taxon counts
merged_table <- stmerge %>%
  select(-id) %>%
  unite("lineage", c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "_") %>%
  group_by(lineage) %>%
  summarise_if(is.numeric, funs(sum(as.numeric(.)))) %>%
  separate("lineage", c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "_") %>%
  ungroup()

Generates the warning:

Warning messages:
1: `funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
2: Expected 7 pieces. Additional pieces discarded in 2 rows [137, 138].

include info about potential false positives in README?

Should we work on a list of potential false positives with reasoning for the paper? I think we should include Ebrahim and Rob's work on the Poxviridae and the lines and sines. We could include a warning about the need for follow up with the large dsDNA viruses (CRISPR-Cas related function in mimiviruses; transposons)?

WorkflowError in line 122 of 01_contaminant_removal_hosts.smk: Resources function did not return int.

the "Running on Slurm" part in the INSTALLATION.md file has an extra space in parameter -o

this typo causes errors.

Error downloading databases

I followed the directions to generate databases here and ran into an error. After decompressing the database tar file and running the following command: snakemake --configfile snakemake/config/sample_config.yaml --snakefile snakemake/workflow/download_databases.smk --cores 8 I get the output below. My snakemake version is 5.26.1.

Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job counts:
	count	jobs
	1	all
	1	cluster_uniprot
	1	download_id_taxonomy_mapping
	1	download_ncbi_taxonomy
	1	download_uniprot_viruses
	1	download_uniref50
	1	extract_ncbi_taxonomy
	1	line_sine_download
	1	make_bac_databases
	1	make_host_databases
	1	mmseqs_uniprot_clusters
	1	mmseqs_uniprot_taxdb
	1	mmseqs_urv
	1	mmseqs_urv_taxonomy
	1	uniprot_to_ncbi_mapping
	1	uniref_plus_viruses
	16

[Thu Oct  8 18:24:54 2020]
rule download_uniref50:
    output: databases/proteins/uniref50.fasta.gz
    jobid: 16


[Thu Oct  8 18:24:54 2020]
rule download_id_taxonomy_mapping:
    output: databases/taxonomy/idmapping.dat.gz
    jobid: 9


[Thu Oct  8 18:24:54 2020]
rule download_ncbi_taxonomy:
    output: databases/taxonomy/taxdump.tar.gz
    jobid: 14


[Thu Oct  8 18:24:54 2020]
rule make_bac_databases:
    input: databases/bac_giant_unique_species/bac_uniquespecies_giant.masked_Ns_removed.fasta
    output: databases/bac_giant_unique_species/ref
    jobid: 1
    resources: time_min=240, mem_mb=100000, cpus=16


[Thu Oct  8 18:24:54 2020]
rule download_uniprot_viruses:
    output: databases/proteins/uniprot_virus.faa
    jobid: 4


[Thu Oct  8 18:24:54 2020]
rule make_host_databases:
    input: databases/human_masked/human_virus_masked.fasta
    output: databases/human_masked/ref
    jobid: 2
    resources: time_min=240, mem_mb=100000, cpus=16


[Thu Oct  8 18:24:54 2020]
rule line_sine_download:
    output: databases/contaminants/line_sine.fasta
    jobid: 3

[Thu Oct  8 18:24:54 2020]
[Thu Oct  8 18:24:54 2020]
Error in rule download_id_taxonomy_mapping:
Error in rule download_uniprot_viruses:
    jobid: 9
    jobid: 4
    output: databases/taxonomy/idmapping.dat.gz
    output: databases/proteins/uniprot_virus.faa
    shell:
        
        cd databases/taxonomy;
        curl -LO "https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz"
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    shell:
        
        mkdir -p databases/proteins && curl -Lgo databases/proteins/uniprot_virus.faa "https://www.uniprot.org/uniprot/?query=taxonomy:%22Viruses%20[10239]%22&format=fasta&&sort=score&fil=reviewed:no"
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)


[Thu Oct  8 18:24:55 2020]
Error in rule line_sine_download:
    jobid: 3
    output: databases/contaminants/line_sine.fasta
    shell:
        
        (curl -L http://sines.eimb.ru/banks/SINEs.bnk &&                 curl -L http://sines.eimb.ru/banks/LINEs.bnk)                 | sed -e '/^>/ s/ /_/g' | seqtk rename                 > databases/contaminants/line_sine.fasta
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job line_sine_download since they might be corrupted:
databases/contaminants/line_sine.fasta
[Thu Oct  8 18:24:57 2020]
Finished job 14.
1 of 16 steps (6%) done
[Thu Oct  8 18:29:42 2020]
Finished job 2.
2 of 16 steps (12%) done
[Thu Oct  8 18:30:36 2020]
Finished job 1.
3 of 16 steps (19%) done
[Thu Oct  8 18:40:49 2020]
Finished job 16.
4 of 16 steps (25%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/pathogen1/stahan/hecatomb/.snakemake/log/2020-10-08T182453.577588.snakemake.log```

additional script to merge new data with old at seqtable_merge.R level

If we had a script that would allow us to merge new results with a previous seq table, this could make processing projects with additional rounds of sequencing (or combining several projects related to the same disease or body site) more efficient.

Example Profile Broken

Hi,

I tried to implement the example Snakemake profile following the tutorial instructions on my cluster - it did not work.

Full error message is below:

Running Hecatomb
Running snakemake command:
snakemake --profile slurm --default-resources mem_mb=2000 time=1440 jobs=100 --use-conda --conda-frontend mamba --rerun-incomplete --printshellcmds --nolock --show-failed-logs --conda-prefix /hpcfs/users/a1667917/myconda/envs/hecatomb/snakemake/workflow/conda --configfile hecatomb.config_prof.yaml -s /hpcfs/users/a1667917/myconda/envs/hecatomb/snakemake/workflow/Hecatomb.smk -C Reads=reads.txt Host=human Output=hecatomb_out SkipAssembly=True Fast=False
usage: snakemake [-h] [--dry-run] [--profile PROFILE] [--cache [RULE ...]]
[--snakefile FILE] [--cores [N]] [--jobs [N]]
[--local-cores N] [--resources [NAME=INT ...]]
[--set-threads RULE=THREADS [RULE=THREADS ...]]
[--max-threads MAX_THREADS]
[--set-resources RULE:RESOURCE=VALUE [RULE:RESOURCE=VALUE ...]]
[--set-scatter NAME=SCATTERITEMS [NAME=SCATTERITEMS ...]]
[--default-resources [NAME=INT ...]]
[--preemption-default PREEMPTION_DEFAULT]
[--preemptible-rules PREEMPTIBLE_RULES [PREEMPTIBLE_RULES ...]]
[--config [KEY=VALUE ...]] [--configfile FILE [FILE ...]]
[--envvars VARNAME [VARNAME ...]] [--directory DIR] [--touch]
[--keep-going] [--force] [--forceall]
[--forcerun [TARGET ...]] [--prioritize TARGET [TARGET ...]]
[--batch RULE=BATCH/BATCHES] [--until TARGET [TARGET ...]]
[--omit-from TARGET [TARGET ...]] [--rerun-incomplete]
[--shadow-prefix DIR] [--scheduler [{ilp,greedy}]]
[--wms-monitor [WMS_MONITOR]]
[--wms-monitor-arg [NAME=VALUE ...]]
[--scheduler-ilp-solver {COIN_CMD}]
[--scheduler-solver-path SCHEDULER_SOLVER_PATH]
[--conda-base-path CONDA_BASE_PATH] [--no-subworkflows]
[--groups GROUPS [GROUPS ...]]
[--group-components GROUP_COMPONENTS [GROUP_COMPONENTS ...]]
[--report [FILE]] [--report-stylesheet CSSFILE]
[--draft-notebook TARGET] [--edit-notebook TARGET]
[--notebook-listen IP:PORT] [--lint [{text,json}]]
[--generate-unit-tests [TESTPATH]] [--containerize]
[--export-cwl FILE] [--list] [--list-target-rules] [--dag]
[--rulegraph] [--filegraph] [--d3dag] [--summary]
[--detailed-summary] [--archive FILE]
[--cleanup-metadata FILE [FILE ...]] [--cleanup-shadow]
[--skip-script-cleanup] [--unlock] [--list-version-changes]
[--list-code-changes] [--list-input-changes]
[--list-params-changes] [--list-untracked]
[--delete-all-output] [--delete-temp-output]
[--bash-completion] [--keep-incomplete] [--drop-metadata]
[--version] [--reason] [--gui [PORT]] [--printshellcmds]
[--debug-dag] [--stats FILE] [--nocolor] [--quiet]
[--print-compilation] [--verbose] [--force-use-threads]
[--allow-ambiguity] [--nolock] [--ignore-incomplete]
[--max-inventory-time SECONDS] [--latency-wait SECONDS]
[--wait-for-files [FILE ...]] [--wait-for-files-file FILE]
[--notemp] [--all-temp] [--keep-remote] [--keep-target-files]
[--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...]]
[--max-jobs-per-second MAX_JOBS_PER_SECOND]
[--max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND]
[-T RESTART_TIMES] [--attempt ATTEMPT]
[--wrapper-prefix WRAPPER_PREFIX]
[--default-remote-provider {S3,GS,FTP,SFTP,S3Mocked,gfal,gridftp,iRODS,AzBlob,XRootD}]
[--default-remote-prefix DEFAULT_REMOTE_PREFIX]
[--no-shared-fs] [--greediness GREEDINESS] [--no-hooks]
[--overwrite-shellcmd OVERWRITE_SHELLCMD] [--debug]
[--runtime-profile FILE] [--mode {0,1,2}]
[--show-failed-logs] [--log-handler-script FILE]
[--log-service {none,slack,wms}]
[--cluster CMD | --cluster-sync CMD | --drmaa [ARGS]]
[--cluster-config FILE] [--immediate-submit]
[--jobscript SCRIPT] [--jobname NAME]
[--cluster-status CLUSTER_STATUS] [--drmaa-log-dir DIR]
[--kubernetes [NAMESPACE]] [--container-image IMAGE]
[--tibanna] [--tibanna-sfn TIBANNA_SFN]
[--precommand PRECOMMAND]
[--tibanna-config TIBANNA_CONFIG [TIBANNA_CONFIG ...]]
[--google-lifesciences]
[--google-lifesciences-regions GOOGLE_LIFESCIENCES_REGIONS [GOOGLE_LIFESCIENCES_REGIONS ...]]
[--google-lifesciences-location GOOGLE_LIFESCIENCES_LOCATION]
[--google-lifesciences-keep-cache] [--tes URL] [--use-conda]
[--conda-not-block-search-path-envvars] [--list-conda-envs]
[--conda-prefix DIR] [--conda-cleanup-envs]
[--conda-cleanup-pkgs [{tarballs,cache}]]
[--conda-create-envs-only] [--conda-frontend {conda,mamba}]
[--use-singularity] [--singularity-prefix DIR]
[--singularity-args ARGS] [--use-envmodules]
[target ...]
snakemake: error: Couldn't parse config file: mapping values are not allowed here
in "/home/a1667917/.config/snakemake/slurm/config.yaml", line 145, column 65

George

Problem with running

When I'm running the hecatomb, program stops at:
Downloading and installing remote packages...

No processes are running, the program is not executed.

missing fields in outfile

Hi Mike,

But the last 3 columns are empty: taxid taxname lineage. They will be important for parsing the contig taxonomy by kingdom, family etc.

From: rule PRIMARY_AA_taxonomy_assignment

Kathie

Enhancement: Host

Hi Mike,
It would be VERY helpful if Hecatomb could link host information to the taxonomy. I have attached a file with a list of viral families and hosts that could help with this.

Thank you,
Kathie
2020_11_24_Viral_Family_host.xlsx

Add human taxonomy to a taxtable

We should record which reads map to humans (and perhaps bacteria and other things) so we can use them in downstream analysis (e.g. contig oarfishing)

does --configfile work with the download script?

it didnt seem to be working

hecatomb's same output files from the same dataset are with different sizes

Hi,

Me and My colleague are running hecatomb on the same dataset. But the same output tables generated from us are with different rows or different number of sequences:

mine:

wc -l bigtable.tsv
**33073** bigtable.tsv

seqkit stat seqtable.fasta
file            format  type  num_seqs     sum_len  min_len  avg_len  max_len
seqtable.fasta  FASTA   DNA    **124,014**  28,329,286       90    228.4      250

My colleague:

wc -l assembly.fasta
**23998** assembly.fasta

seqkit stat seqtable.fasta
file            format  type  num_seqs     sum_len  min_len  avg_len  max_len
seqtable.fasta  FASTA   DNA    **124,041**  28,332,842       90    228.4      250

We are not sure if these steps have finished or not. Because we both failed at the "sankey_diagram" step.

I wanted to checked .err files in the LOG folder to see if the steps that generated those files were finished or not, but couldn't make sure which folders are the right ones to go.

I think If there could be a final .log file says "The entire hecatomb pipeline has successfully finished! " generated after the whole pipeline is done, It'll be very helpful.

Thanks!
Leran

Error in secondary_nt_lca_table step

Hi,

One of my dataset (141 samples) have been running for 5 days, now it threw an error message said:

Error in rule secondary_nt_lca_table:
    jobid: 2734
    output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
    log: hecatomb_out/STDERR/secondary_nt_lca_table.log (check log file(s) for error message)
    cluster_jobid: Submitted batch job 35662963
Logfile hecatomb_out/STDERR/secondary_nt_lca_table.log:
DEBUG:root:Reading alignments and extracting taxon IDs


Error executing rule secondary_nt_lca_table on cluster (jobid: 2734, external: Submitted batch job 35662963, jobscript: /scratch/sahlab/AK/.snakemake/tmp.w_b1srs3/snakejob.secondary_nt_lca_table.2734.sh). For error details see the cluster log and the log files of the involved rule(s).

Anywhere else that I can check for more information?

Thanks!
Leran

Mappers for host removal

There is an ongoing discussion around which mapper to use to remove host sequences. This seems like a simple issue, but as with all things it is multi-factorial.

Memory efficiency / speed: This is a big issue for running hecatomb on multiple architectures. Java memory management causes lots of user issues. This is not a simple issue as it is dependent on I/O, indexing, zipping/unzipping, etc.
Mapper outputs: bbmap has by far the best 'built-in' options for outputs. Lots of summary logs and files to be mined eliminating the need to generate parsing scripts to generate the tables you want. In particular, bbmap has nice functionality for directly incorporating taxonomy. This was used to annotate the taxonomy of the bacteria reads were hitting in step_8 (base/bacterial_contaminants.sh). Simple and clean.
Specificity: A lot of mappers are assessed based on their specificity. Correct mappings are required for accurate gene counting in RNA seq and SNP calling. However, neither of these are a real issue for host removal so this is somewhat of a non-issue for this specific application but may have issues in other parts of the pipeline.
Dependency creep: I know that conda fixes a lot of the issues with dependency installation, but there will always be issues with keeping up with updates for multiple dependencies. There is just no way around that. I think it is a noble goal to a minimum of dependencies even if some other minor issues are complicated because of this.

There are faster mappers than both bowtie and bbmap. For example, minimap2 is typically 10X faster than bowtie and is likely to be a better fit for hecatomb than either bowtie or bbmap (https://github.com/lh3/minimap2).

Another option is psuedoalignment mappers such as kallisto (https://pachterlab.github.io/kallisto/about). This would likely be our fastest option, and isn't an additional dependency as it is already used for abundance estimation of assembled contigs. Kallisto is so efficient in both index generation and alignment that a switch to this psuedoalignment approach (if it works!) would get us closer to the magical 'laptop version' of hecatomb.

So I wanted to propose the following. If you have comments or issues on this strategy please post them as comments here:

Remove step_8 mapping the virus masked bacterial genomes. This isn't exactly the topic at hand, but is an elimination of a major, somewhat crippling component in terms of computation and database curation. This should be a separate module.
Swap bbmap/bt2 to kallisto for host mapping

This was a long post, but there are several issues in our weekly calls and on GitHub that deal with mapping and memory issues so I wanted to try and summarize them here as a single place for continued discussion.

How to convert fastq-dumps to R1 /R2

Since we use _R1 and _R2 to find mate pairs, here is how to convert fastq-dump output to R1/R2 format

for F in *; do O=$(echo $F | sed -e 's/pass_/pass_R/'); echo $F $O; mv $F $O; done

We should add this to the wiki

Hecatomb v1.0.0.beta.2 crash

Hi Mike,

I only got 86% of the way through this time:

[Thu Nov 4 02:02:55 2021]
Finished job 1885.
1894 of 2200 steps (86%) done
Exiting because a job execution failed. Look above for error message

FATAL: Hecatomb encountered an error.
       Dumping all error logs to "hecatomb.errorLogs.txt"Complete log: /scratch/sahlab/RC2_IBD_virome/allt_results/.snakemake/log/2021-11-03T210405.313772.snakemake.log

Complete log: /scratch/sahlab/RC2_IBD_virome/allt_results/.snakemake/log/2021-11-03T210405.313772.snakemake.log:
rule concatentate_contig_count_tables:
benchmark: hecatomb_out/BENCHMARKS/concatentate_contig_count_tables.txt
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
37.3387 0:00:37 5.26 20.88 1.20 1.48 172.50 172.50 0.68 0.25

cat hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/MAPPING/...
sed -i '1i sample_id contig_id length reads RPKM FPKM TPM avg_fold_cov contig_GC cov_perc cov_bases median_fold_cov' hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/MAPPING/contig_count_table.tsv; } 2> hecatomb_out/STDERR/concatentate_contig_count_tables.log
rm hecatomb_out/STDERR/concatentate_contig_count_tables.log

benchmark: hecatomb_out/BENCHMARKS/mmseqs_contig_annotation.txt
threads: 32
resources: mem_mb=64000, time=1440, jobs=100


    {
    mmseqs createdb hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/assembly.fasta hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/queryDB --dbtype 2;
    mmseqs search hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/queryDB /home/mihindu/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/nt/virus_primary_nt/sequenceDB hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/results/result hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/mmseqs_nt_tmp             --start-sens 2 -s 7 --sens-steps 3 --min-length 90 -e 1e-5             --search-type 3 ; } &> hecatomb_out/STDERR/mmseqs_contig_annotation.log
    rm hecatomb_out/STDERR/mmseqs_contig_annotation.log

Error submitting jobscript (exit code 1):

Job failed, going on with independent jobs.

[Thu Nov 4 00:27:44 2021]
rule coverage_calculations:

1. Checked the presence of the db:
ls /home/mihindu/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/nt/virus_primary_nt/

sequenceDB sequenceDB.dbtype sequenceDB_h sequenceDB_h.dbtype sequenceDB_h.index sequenceDB.index sequenceDB.lookup sequenceDB.source

2. Checking Flye output:
ls /scratch/sahlab/RC2_IBD_virome/allt_results/hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE
00-assembly 22-plasmids 40-polishing assembly_graph.gfa assembly_info.txt flye.log
20-repeat 30-contigger assembly.fasta assembly_graph.gv contig_dictionary.stats params.json

NOTE: no results or mmseqs_nt_tmp folders

Unfortunately, the crash log is empty:
wc -l hecatomb.crashreport.log
0 hecatomb.crashreport.log

flye "Looks like the system ran out of memory"

I am trying to do the contig assembly for 523 samples and have run into this issue:

[2020-10-28 16:49:19] INFO: Simplifying the graph
[2020-10-28 19:21:09] ERROR: Looks like the system ran out of memory
[2020-10-28 19:21:09] ERROR: Command '['flye-modules', 'repeat', '--disjointigs', '/scratch/sahlab/RC2_IBD_virome/assembly/contig_dictionary/00-assembly/draft_assembly.fasta', '--reads', './assembly/contig_dictionary/all.mh.contigs_for_flye.fa', '--out-dir', '/scratch/sahlab/RC2_IBD_virome/assembly/contig_dictionary/20-repeat', '--config', '/opt/htcf/spack/opt/spack/linux-ubuntu16.04-x86_64/gcc-5.4.0/py-flye-2.7.1-36mvt7vew5klvjj37weoxusoqe4l33ka/lib/python3.6/site-packages/flye/config/bin_cfg/asm_subasm.cfg', '--log', '/scratch/sahlab/RC2_IBD_virome/assembly/contig_dictionary/flye.log', '--threads', '24', '--meta', '--min-ovlp', '1000', '--kmer', '31']' died with <Signals.SIGKILL: 9>.
[2020-10-28 19:21:09] ERROR: Pipeline aborted

This is what the next step SHOULD have been:
[2020-06-03 23:00:27] INFO: >>>STAGE: plasmids
[2020-06-03 23:00:27] INFO: Recovering short unassembled sequences

Here is the command that was running when it crashed:
flye --subassemblies $OUT/contig_dictionary/all.mh.contigs_for_flye.fa -t 24 --meta --plasmids -o $OUT/contig_dictionary -g 1g

Here are my memory and node requests:
#SBATCH --cpus-per-task=16
#SBATCH --mem=250G

This is my version of flye:
module load py-flye/2.7.1-python-3.6.5

I have looked for this issue re flye. One recommedation was to update to flye2.5, but we are more up to date than that.
mikolmogorov/Flye#142
The second recommendation was "Hard to tell what is going on, because all cluster environments are usually configured very differently. I would suggest to try to resume run with an increased number of requested threads (maybe 25 in PBS, but use -t 20 in Flye) and specify maximum RAM (say ~500Gb should be sufficient)."
mikolmogorov/Flye#138

Would appreciate any suggestions of what to try next. I would try the flye command in an interactive session
I can't go above 250G, but I could try increasing the cpus, -t 24 in flye

mouse

Can the documentation list which mouse genome is used? C57BL6 or something else?

MANY missing output files; Input files updated by another job

Hi Mike,
After Hecatomb crashed, I ran this:
hecatomb run --reads RC2_freeze_2_samples_C.tsv --profile slurm --configfile heca
tomb.config.yaml --snake=-n --snake=--reason

                                                                                                                                       [142/1903]

[Thu Feb 17 08:52:52 2022]
rule secondary_nt_lca_table:
input: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.m8
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
log: hecatomb_out/STDERR/secondary_nt_lca_table.log
jobid: 2034
benchmark: hecatomb_out/BENCHMARKS/secondary_nt_lca_table.txt
reason: Missing output files: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
resources: mem_mb=16000, disk_mb=893298, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule secondary_nt_calc_lca:
input: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin, /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb/snakemake/workflow/../..
/databases/tax/taxonomy
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/lca.lineage, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv
log: hecatomb_out/STDERR/secondary_nt_calc_lca.log
jobid: 2033
benchmark: hecatomb_out/BENCHMARKS/secondary_nt_calc_lca.txt
reason: Missing output files: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv; Input files updated by another job: hecatomb
_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
threads: 24
resources: mem_mb=64000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

    {
    # calculate lca and lineage
    taxonkit lca -i 2 -s ';' --data-dir /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/tax/taxonomy h

ecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin | taxonkit lineage -i 3 --data-dir /opt/apps/labs/sahlab/software/miniconda3/envs
/hecatomb/snakemake/workflow/../../databases/tax/taxonomy | cut --complement -f 2 > hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/
results/lca.lineage 2> hecatomb_out/STDERR/secondary_nt_calc_lca.log

    # Reformat lineages
    awk -F '        ' '$2 != 0' hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/lca.lineage |             taxonkit reformat --data-dir /opt/apps

/labs/sahlab/software/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/tax/taxonomy -i 3 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}
t{s}" -F --fill-miss-rank 2>> hecatomb_out/STDERR/secondary_nt_calc_lca.log |
cut --complement -f3 > hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv
} &> hecatomb_out/STDERR/secondary_nt_calc_lca.log
rm hecatomb_out/STDERR/secondary_nt_calc_lca.log

[Thu Feb 17 08:52:52 2022]
rule SECONDARY_NT_generate_output_table:
input: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/tophit.m8, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/SECONDARY_nt.tsv, hecatomb_out/RESULT
S/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv, hecatomb_out/RESULTS/sampleSeqCounts.tsv, /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb
/snakemake/workflow/../../databases/tables/2020_07_27_Viral_classification_table_ICTV2019.txt
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv
log: hecatomb_out/STDERR/SECONDARY_NT_generate_output_table.log
jobid: 2026
benchmark: hecatomb_out/BENCHMARKS/SECONDARY_NT_generate_output_table.txt
reason: Missing output files: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv; Input files updated by another job: hecatomb_out/RESULTS/
MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule combine_AA_NT:
input: hecatomb_out/RESULTS/MMSEQS_AA_SECONDARY/AA_bigtable.tsv, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv
output: hecatomb_out/RESULTS/bigtable.tsv
log: hecatomb_out/STDERR/combine_AA_NT.log
jobid: 2036
benchmark: hecatomb_out/BENCHMARKS/combine_AA_NT.txt
reason: Missing output files: hecatomb_out/RESULTS/bigtable.tsv; Input files updated by another job: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_
bigtable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

    { cat hecatomb_out/RESULTS/MMSEQS_AA_SECONDARY/AA_bigtable.tsv > hecatomb_out/RESULTS/bigtable.tsv;
    tail -n+2 hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv >> hecatomb_out/RESULTS/bigtable.tsv; } &> hecatomb_out/STDERR/combine_AA_

NT.log
rm hecatomb_out/STDERR/combine_AA_NT.log

[Thu Feb 17 08:52:52 2022]
rule tax_level_counts:
input: hecatomb_out/RESULTS/bigtable.tsv
output: hecatomb_report/taxonLevelCounts.tsv
log: hecatomb_out/STDERR/tax_level_counts.log
jobid: 2045
reason: Missing output files: hecatomb_report/taxonLevelCounts.tsv; Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv
threads: 2
resources: mem_mb=16000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule contig_read_taxonomy:
input: hecatomb_out/PROCESSING/MAPPING/assembly.seqtable.bam, hecatomb_out/PROCESSING/MAPPING/assembly.seqtable.bam.bai, hecatomb_out/RESULTS/big
table.tsv
output: hecatomb_out/RESULTS/contigSeqTable.tsv
log: hecatomb_out/STDERR/contig_read_taxonomy.log
jobid: 2041
benchmark: hecatomb_out/BENCHMARKS/contig_read_taxonomy.txt
reason: Missing output files: hecatomb_out/RESULTS/contigSeqTable.tsv; Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv
threads: 2
resources: mem_mb=16000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule krona_text_format:
input: hecatomb_out/RESULTS/bigtable.tsv
output: hecatomb_report/krona.txt
log: hecatomb_out/STDERR/krona_text_format.log
jobid: 2047
benchmark: hecatomb_out/BENCHMARKS/krona_text_format.txt
reason: Missing output files: hecatomb_report/krona.txt; Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule contig_krona_text_format:
input: hecatomb_out/RESULTS/contigSeqTable.tsv
output: hecatomb_report/contigKrona.txt
log: hecatomb_out/STDERR/contig_krona_text_format.log
jobid: 2043
reason: Missing output files: hecatomb_report/contigKrona.txt; Input files updated by another job: hecatomb_out/RESULTS/contigSeqTable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule krona_plot:
input: hecatomb_report/krona.txt
output: hecatomb_report/krona.html
log: hecatomb_out/STDERR/krona_plot.log
jobid: 2046
benchmark: hecatomb_out/BENCHMARKS/krona_plot.txt
reason: Missing output files: hecatomb_report/krona.html; Input files updated by another job: hecatomb_report/krona.txt
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

    ktImportText hecatomb_report/krona.txt -o hecatomb_report/krona.html &> hecatomb_out/STDERR/krona_plot.log
    rm hecatomb_out/STDERR/krona_plot.log

[Thu Feb 17 08:52:52 2022]
rule contig_krona_plot:
input: hecatomb_report/contigKrona.txt
output: hecatomb_report/contigKrona.html
log: hecatomb_out/STDERR/contig_krona_plot.log
jobid: 2042
reason: Missing output files: hecatomb_report/contigKrona.html; Input files updated by another job: hecatomb_report/contigKrona.txt
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

    ktImportText hecatomb_report/contigKrona.txt -o hecatomb_report/contigKrona.html &> hecatomb_out/STDERR/contig_krona_plot.log
    rm hecatomb_out/STDERR/contig_krona_plot.log

[Thu Feb 17 08:52:52 2022]
localrule all:
input: hecatomb_out/RESULTS/seqtable.fasta, hecatomb_out/RESULTS/sampleSeqCounts.tsv, hecatomb_out/RESULTS/seqtable.properties.tsv, hecatomb_out/
PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/assembly.fasta, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/MAPPING/contig_count_table.tsv, hecatom
b_out/RESULTS/assembly.properties.tsv, hecatomb_out/RESULTS/MMSEQS_AA_SECONDARY/AA_bigtable.tsv, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable
.tsv, hecatomb_out/RESULTS/bigtable.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt.tsv, hecatomb_out/PROCESSING/ASSEMBLY/C
ONTIG_DICTIONARY/FLYE/SECONDARY_nt_phylum_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt_class_summary.tsv, hecato
mb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt_order_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_n
t_family_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt_genus_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG
DICTIONARY/FLYE/SECONDARY_nt_species_summary.tsv, hecatomb_out/PROCESSING/MAPPING/assembly.seqtable.bam, hecatomb_out/PROCESSING/MAPPING/assembly.se
qtable.bam.bai, hecatomb_out/RESULTS/contigSeqTable.tsv, hecatomb_report/contigKrona.html, hecatomb_report/Step00_counts.tsv, hecatomb_report/Step01
counts.tsv, hecatomb_report/Step02_counts.tsv, hecatomb_report/Step03_counts.tsv, hecatomb_report/Step04_counts.tsv, hecatomb_report/Step05_counts.ts
v, hecatomb_report/Step06_counts.tsv, hecatomb_report/Step07_counts.tsv, hecatomb_report/Step08_counts.tsv, hecatomb_report/Step09_counts.tsv, hecato
mb_report/Step10_counts.tsv, hecatomb_report/Step11_counts.tsv, hecatomb_report/Step12_counts.tsv, hecatomb_report/Step13_counts.tsv, hecatomb_report
/Sankey.svg, hecatomb_report/hecatomb.samples.tsv, hecatomb_report/taxonLevelCounts.tsv, hecatomb_report/krona.html
jobid: 0
reason: Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv, hecatomb_report/contigKrona.html, hecatomb_report/krona.html, heca
tomb_report/taxonLevelCounts.tsv, hecatomb_out/RESULTS/contigSeqTable.tsv, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

Job stats:
job count min threads max threads

SECONDARY_NT_generate_output_table 1 1 1
all 1 1 1
combine_AA_NT 1 1 1
contig_krona_plot 1 1 1
contig_krona_text_format 1 1 1
contig_read_taxonomy 1 2 2
krona_plot 1 1 1
krona_text_format 1 1 1
secondary_nt_calc_lca 1 24 24
secondary_nt_lca_table 1 1 1
tax_level_counts 1 2 2
total 11 1 24

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

What the hecatomb?

shandley / hecatomb Goto Github PK

hecatomb's People

Contributors

Stargazers

Watchers

Forkers

hecatomb's Issues

Recommend Projects

Recommend Topics

Recommend Org