shandley / hecatomb Goto Github PK
View Code? Open in Web Editor NEWhecatomb is a virome analysis pipeline for analysis of Illumina sequence data
License: MIT License
hecatomb is a virome analysis pipeline for analysis of Illumina sequence data
License: MIT License
Hi,
Me and My colleague are running hecatomb on the same dataset. But the same output tables generated from us are with different rows or different number of sequences:
mine:
wc -l bigtable.tsv
**33073** bigtable.tsv
seqkit stat seqtable.fasta
file format type num_seqs sum_len min_len avg_len max_len
seqtable.fasta FASTA DNA **124,014** 28,329,286 90 228.4 250
My colleague:
wc -l assembly.fasta
**23998** assembly.fasta
seqkit stat seqtable.fasta
file format type num_seqs sum_len min_len avg_len max_len
seqtable.fasta FASTA DNA **124,041** 28,332,842 90 228.4 250
We are not sure if these steps have finished or not. Because we both failed at the "sankey_diagram" step.
I wanted to checked .err files in the LOG folder to see if the steps that generated those files were finished or not, but couldn't make sure which folders are the right ones to go.
I think If there could be a final .log file says "The entire hecatomb pipeline has successfully finished! " generated after the whole pipeline is done, It'll be very helpful.
Thanks!
Leran
Hi Mike,
Can you add to the documentation how to kill hecatomb? I killed the tmux session, but it is my understanding that to kill snakemake, I have to A each running process.
Kathie
Hi Scott,
This is the current reformat command between megahit and flye:
reformat.sh in={output.rename} out={output.size} \
ml={config[MINLENGTH]} \
ow=t \
-Xmx{config[System][Memory]}g;
If you try to run flye with the same MINLENGTH as megahit, you may fail immediately because of duplicate names. This can be solved by adding this to the reformat command:
uniquenames=t
Kathie
1.) Check that coercion from factor to numeric is behaving as intended, for example:
# Adjust values
ps.melt.value.sp <- ps.melt.sp %>%
select(c((ncol(MAP)+11):(ncol(ps.melt.sp)))) %>%
mutate_if(is.factor, ~ as.numeric(levels(.x))[.x])
Coercion from factor to numeric requires you coerce to a character first.
2.) Merging the Baltimore classification table with the melted phyloseq data using a left_join() will likely drop taxa from your table (ie: families in your melted table that are not present in your Baltimore file), as the keep argument defaults to FALSE. Is this the intended behavior? If not, I would recommend changing this parameter to TRUE, then filtering any rows added to the table that have an empty OTU (the Baltimore file will pull in some families that may not match to any families in your melted table).
# may lose information here:
ps.melt.fixed.sp <- left_join(ps.melt.fixed.sp, baltimore, by = "Family")
3.) When generating the standardized count object at the genus level, the default option for tax_glom is to drop any taxa for which you are missing information at the specified rank. In this case it should never be an issue (NAs aren't expected) but it would be safer to go ahead and set this parameter to FALSE in the event something upstream has gone wrong:
ps0.ge.glom <- ps0.sp %>%
speedyseq::tax_glom("Genus", NArm = FALSE)
Hi Mike,
I only got 86% of the way through this time:
[Thu Nov 4 02:02:55 2021]
Finished job 1885.
1894 of 2200 steps (86%) done
Exiting because a job execution failed. Look above for error message
FATAL: Hecatomb encountered an error.
Dumping all error logs to "hecatomb.errorLogs.txt"Complete log: /scratch/sahlab/RC2_IBD_virome/allt_results/.snakemake/log/2021-11-03T210405.313772.snakemake.log
Complete log: /scratch/sahlab/RC2_IBD_virome/allt_results/.snakemake/log/2021-11-03T210405.313772.snakemake.log:
rule concatentate_contig_count_tables:
benchmark: hecatomb_out/BENCHMARKS/concatentate_contig_count_tables.txt
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
37.3387 0:00:37 5.26 20.88 1.20 1.48 172.50 172.50 0.68 0.25
cat hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/MAPPING/...
sed -i '1i sample_id contig_id length reads RPKM FPKM TPM avg_fold_cov contig_GC cov_perc cov_bases median_fold_cov' hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/MAPPING/contig_count_table.tsv; } 2> hecatomb_out/STDERR/concatentate_contig_count_tables.log
rm hecatomb_out/STDERR/concatentate_contig_count_tables.log
benchmark: hecatomb_out/BENCHMARKS/mmseqs_contig_annotation.txt
threads: 32
resources: mem_mb=64000, time=1440, jobs=100
{
mmseqs createdb hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/assembly.fasta hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/queryDB --dbtype 2;
mmseqs search hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/queryDB /home/mihindu/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/nt/virus_primary_nt/sequenceDB hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/results/result hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/mmseqs_nt_tmp --start-sens 2 -s 7 --sens-steps 3 --min-length 90 -e 1e-5 --search-type 3 ; } &> hecatomb_out/STDERR/mmseqs_contig_annotation.log
rm hecatomb_out/STDERR/mmseqs_contig_annotation.log
Error submitting jobscript (exit code 1):
Job failed, going on with independent jobs.
[Thu Nov 4 00:27:44 2021]
rule coverage_calculations:
1. Checked the presence of the db:
ls /home/mihindu/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/nt/virus_primary_nt/
sequenceDB sequenceDB.dbtype sequenceDB_h sequenceDB_h.dbtype sequenceDB_h.index sequenceDB.index sequenceDB.lookup sequenceDB.source
2. Checking Flye output:
ls /scratch/sahlab/RC2_IBD_virome/allt_results/hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE
00-assembly 22-plasmids 40-polishing assembly_graph.gfa assembly_info.txt flye.log
20-repeat 30-contigger assembly.fasta assembly_graph.gv contig_dictionary.stats params.json
NOTE: no results or mmseqs_nt_tmp folders
Unfortunately, the crash log is empty:
wc -l hecatomb.crashreport.log
0 hecatomb.crashreport.log
I followed the directions to generate databases here and ran into an error. After decompressing the database tar file and running the following command: snakemake --configfile snakemake/config/sample_config.yaml --snakefile snakemake/workflow/download_databases.smk --cores 8
I get the output below. My snakemake version is 5.26.1.
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job counts:
count jobs
1 all
1 cluster_uniprot
1 download_id_taxonomy_mapping
1 download_ncbi_taxonomy
1 download_uniprot_viruses
1 download_uniref50
1 extract_ncbi_taxonomy
1 line_sine_download
1 make_bac_databases
1 make_host_databases
1 mmseqs_uniprot_clusters
1 mmseqs_uniprot_taxdb
1 mmseqs_urv
1 mmseqs_urv_taxonomy
1 uniprot_to_ncbi_mapping
1 uniref_plus_viruses
16
[Thu Oct 8 18:24:54 2020]
rule download_uniref50:
output: databases/proteins/uniref50.fasta.gz
jobid: 16
[Thu Oct 8 18:24:54 2020]
rule download_id_taxonomy_mapping:
output: databases/taxonomy/idmapping.dat.gz
jobid: 9
[Thu Oct 8 18:24:54 2020]
rule download_ncbi_taxonomy:
output: databases/taxonomy/taxdump.tar.gz
jobid: 14
[Thu Oct 8 18:24:54 2020]
rule make_bac_databases:
input: databases/bac_giant_unique_species/bac_uniquespecies_giant.masked_Ns_removed.fasta
output: databases/bac_giant_unique_species/ref
jobid: 1
resources: time_min=240, mem_mb=100000, cpus=16
[Thu Oct 8 18:24:54 2020]
rule download_uniprot_viruses:
output: databases/proteins/uniprot_virus.faa
jobid: 4
[Thu Oct 8 18:24:54 2020]
rule make_host_databases:
input: databases/human_masked/human_virus_masked.fasta
output: databases/human_masked/ref
jobid: 2
resources: time_min=240, mem_mb=100000, cpus=16
[Thu Oct 8 18:24:54 2020]
rule line_sine_download:
output: databases/contaminants/line_sine.fasta
jobid: 3
[Thu Oct 8 18:24:54 2020]
[Thu Oct 8 18:24:54 2020]
Error in rule download_id_taxonomy_mapping:
Error in rule download_uniprot_viruses:
jobid: 9
jobid: 4
output: databases/taxonomy/idmapping.dat.gz
output: databases/proteins/uniprot_virus.faa
shell:
cd databases/taxonomy;
curl -LO "https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz"
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
shell:
mkdir -p databases/proteins && curl -Lgo databases/proteins/uniprot_virus.faa "https://www.uniprot.org/uniprot/?query=taxonomy:%22Viruses%20[10239]%22&format=fasta&&sort=score&fil=reviewed:no"
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[Thu Oct 8 18:24:55 2020]
Error in rule line_sine_download:
jobid: 3
output: databases/contaminants/line_sine.fasta
shell:
(curl -L http://sines.eimb.ru/banks/SINEs.bnk && curl -L http://sines.eimb.ru/banks/LINEs.bnk) | sed -e '/^>/ s/ /_/g' | seqtk rename > databases/contaminants/line_sine.fasta
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job line_sine_download since they might be corrupted:
databases/contaminants/line_sine.fasta
[Thu Oct 8 18:24:57 2020]
Finished job 14.
1 of 16 steps (6%) done
[Thu Oct 8 18:29:42 2020]
Finished job 2.
2 of 16 steps (12%) done
[Thu Oct 8 18:30:36 2020]
Finished job 1.
3 of 16 steps (19%) done
[Thu Oct 8 18:40:49 2020]
Finished job 16.
4 of 16 steps (25%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/pathogen1/stahan/hecatomb/.snakemake/log/2020-10-08T182453.577588.snakemake.log```
Hi,
One of my dataset (141 samples) have been running for 5 days, now it threw an error message said:
Error in rule secondary_nt_lca_table:
jobid: 2734
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
log: hecatomb_out/STDERR/secondary_nt_lca_table.log (check log file(s) for error message)
cluster_jobid: Submitted batch job 35662963
Logfile hecatomb_out/STDERR/secondary_nt_lca_table.log:
DEBUG:root:Reading alignments and extracting taxon IDs
Error executing rule secondary_nt_lca_table on cluster (jobid: 2734, external: Submitted batch job 35662963, jobscript: /scratch/sahlab/AK/.snakemake/tmp.w_b1srs3/snakejob.secondary_nt_lca_table.2734.sh). For error details see the cluster log and the log files of the involved rule(s).
Anywhere else that I can check for more information?
Thanks!
Leran
In an early developmental version of hecatomb I merged the 'bigtable' (wasn't called that at the time!) to a list of phage taxonomies. This added a column called virus_type that allowed for simple partitioning go phage-to-nonphage viruses.
This should be easy to implement the same way we add Baltimore classifications.
I have the original phage list that I used for this (below).
Ampullaviridae
Atkinsviridae
Autographiviridae
Autolykiviridae
Bicaudaviridae
Blumeviridae
Caudovirales
Caudovirales_undefined_family
Clavaviridae
Corticoviridae
Crevaviridae
Cystoviridae
Duinviridae
Fiersviridae
Fuselloviridae
Globuloviridae
Guelinviridae
Guttaviridae
Inoviridae
Intestiviridae
Jelitoviridae
Leviviridae
Ligamenvirales
Ligamenvirales_undefined_family
Lipothrixviridae
Matshushitaviridae
Microviridae
Myoviridae
Paulinoviridae
Picobirnaviridae
Plasmaviridae
Pleolipoviridae
Podoviridae
Portogloboviridae
Rountreeviridae
Rudiviridae
Salasmaviridae
Schitoviridae
Simuloviridae
Siphoviridae
Solspiviridae
Sphaerolipoviridae
Spiraviridae
Steigviridae
Steitzviridae
Suoliviridae
Tectiviridae
Tinaiviridae
Tristromaviridae
Tubulavirales
Tubulavirales_undefined_family
Turriviridae
unidentified phage
Zobellviridae
Hi,
I have been running hecatomb on a 190 sample-sized dataset, and it crashed at "rule PRIMARY_NT_reformat" step. This is what shows in hecatomb_out/STDERR/t09.mmseqs_PRIMARY_nt_summary.log file:
/bin/bash: line 10: 27845 Illegal instruction (core dumped) mmseqs filterdb hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/result hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/firsthit --extract-lines 1
I didn't remember I saw this error when I ran a subset (10 samples) of this 190 dataset. Can I just rebuild the hecatomb or is there something we should tune before rebuilding?
Thanks!
Leran
When I run the test data, it dies with this error:
Error executing rule sankey_diagram on cluster (jobid: 213, external: Submitted batch job 33397859, jobscript: /scratch/sahlab/RC2_IBD_virome/te
st/.snakemake/tmp.1edmh2g2/snakejob.sankey_diagram.213.sh). For error details see the cluster log and the log files of the involved rule(s).
Job failed, going on with independent jobs.
[Thu Oct 28 01:20:39 2021]
Finished job 196.
214 of 216 steps (99%) done
Exiting because a job execution failed. Look above for error message
Complete log: /scratch/sahlab/RC2_IBD_virome/test/.snakemake/log/2021-10-27T220225.813094.snakemake.log
ERROR: Snakemake command failed
ERROR:snakemake.logging:RuleException:
ValueError in line 501 of /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/snakemake/workflow/rules/04_summaries.smk:
Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
$ pip install -U kaleido
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 2357, in run
_wrapper
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/snakemake/workflow/rules/04_summaries.smk", line 501, in __rule_sankey_diagram
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/plotly/basedatatypes.py", line 3821, in write_image
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/plotly/io/_kaleido.py", line 268, in write_image
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/plotly/io/_kaleido.py", line 134, in to_image
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 574, in _cal
lback
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/concurrent/futures/thread.py", line 52, in run
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 560, in cach
ed_or_run
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 2390, in run
_wrapper
Hi Mike,
It would be VERY helpful if Hecatomb could link host information to the taxonomy. I have attached a file with a list of viral families and hosts that could help with this.
Thank you,
Kathie
2020_11_24_Viral_Family_host.xlsx
Hi,
I tried to implement the example Snakemake profile following the tutorial instructions on my cluster - it did not work.
Full error message is below:
Running Hecatomb
Running snakemake command:
snakemake --profile slurm --default-resources mem_mb=2000 time=1440 jobs=100 --use-conda --conda-frontend mamba --rerun-incomplete --printshellcmds --nolock --show-failed-logs --conda-prefix /hpcfs/users/a1667917/myconda/envs/hecatomb/snakemake/workflow/conda --configfile hecatomb.config_prof.yaml -s /hpcfs/users/a1667917/myconda/envs/hecatomb/snakemake/workflow/Hecatomb.smk -C Reads=reads.txt Host=human Output=hecatomb_out SkipAssembly=True Fast=False
usage: snakemake [-h] [--dry-run] [--profile PROFILE] [--cache [RULE ...]]
[--snakefile FILE] [--cores [N]] [--jobs [N]]
[--local-cores N] [--resources [NAME=INT ...]]
[--set-threads RULE=THREADS [RULE=THREADS ...]]
[--max-threads MAX_THREADS]
[--set-resources RULE:RESOURCE=VALUE [RULE:RESOURCE=VALUE ...]]
[--set-scatter NAME=SCATTERITEMS [NAME=SCATTERITEMS ...]]
[--default-resources [NAME=INT ...]]
[--preemption-default PREEMPTION_DEFAULT]
[--preemptible-rules PREEMPTIBLE_RULES [PREEMPTIBLE_RULES ...]]
[--config [KEY=VALUE ...]] [--configfile FILE [FILE ...]]
[--envvars VARNAME [VARNAME ...]] [--directory DIR] [--touch]
[--keep-going] [--force] [--forceall]
[--forcerun [TARGET ...]] [--prioritize TARGET [TARGET ...]]
[--batch RULE=BATCH/BATCHES] [--until TARGET [TARGET ...]]
[--omit-from TARGET [TARGET ...]] [--rerun-incomplete]
[--shadow-prefix DIR] [--scheduler [{ilp,greedy}]]
[--wms-monitor [WMS_MONITOR]]
[--wms-monitor-arg [NAME=VALUE ...]]
[--scheduler-ilp-solver {COIN_CMD}]
[--scheduler-solver-path SCHEDULER_SOLVER_PATH]
[--conda-base-path CONDA_BASE_PATH] [--no-subworkflows]
[--groups GROUPS [GROUPS ...]]
[--group-components GROUP_COMPONENTS [GROUP_COMPONENTS ...]]
[--report [FILE]] [--report-stylesheet CSSFILE]
[--draft-notebook TARGET] [--edit-notebook TARGET]
[--notebook-listen IP:PORT] [--lint [{text,json}]]
[--generate-unit-tests [TESTPATH]] [--containerize]
[--export-cwl FILE] [--list] [--list-target-rules] [--dag]
[--rulegraph] [--filegraph] [--d3dag] [--summary]
[--detailed-summary] [--archive FILE]
[--cleanup-metadata FILE [FILE ...]] [--cleanup-shadow]
[--skip-script-cleanup] [--unlock] [--list-version-changes]
[--list-code-changes] [--list-input-changes]
[--list-params-changes] [--list-untracked]
[--delete-all-output] [--delete-temp-output]
[--bash-completion] [--keep-incomplete] [--drop-metadata]
[--version] [--reason] [--gui [PORT]] [--printshellcmds]
[--debug-dag] [--stats FILE] [--nocolor] [--quiet]
[--print-compilation] [--verbose] [--force-use-threads]
[--allow-ambiguity] [--nolock] [--ignore-incomplete]
[--max-inventory-time SECONDS] [--latency-wait SECONDS]
[--wait-for-files [FILE ...]] [--wait-for-files-file FILE]
[--notemp] [--all-temp] [--keep-remote] [--keep-target-files]
[--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...]]
[--max-jobs-per-second MAX_JOBS_PER_SECOND]
[--max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND]
[-T RESTART_TIMES] [--attempt ATTEMPT]
[--wrapper-prefix WRAPPER_PREFIX]
[--default-remote-provider {S3,GS,FTP,SFTP,S3Mocked,gfal,gridftp,iRODS,AzBlob,XRootD}]
[--default-remote-prefix DEFAULT_REMOTE_PREFIX]
[--no-shared-fs] [--greediness GREEDINESS] [--no-hooks]
[--overwrite-shellcmd OVERWRITE_SHELLCMD] [--debug]
[--runtime-profile FILE] [--mode {0,1,2}]
[--show-failed-logs] [--log-handler-script FILE]
[--log-service {none,slack,wms}]
[--cluster CMD | --cluster-sync CMD | --drmaa [ARGS]]
[--cluster-config FILE] [--immediate-submit]
[--jobscript SCRIPT] [--jobname NAME]
[--cluster-status CLUSTER_STATUS] [--drmaa-log-dir DIR]
[--kubernetes [NAMESPACE]] [--container-image IMAGE]
[--tibanna] [--tibanna-sfn TIBANNA_SFN]
[--precommand PRECOMMAND]
[--tibanna-config TIBANNA_CONFIG [TIBANNA_CONFIG ...]]
[--google-lifesciences]
[--google-lifesciences-regions GOOGLE_LIFESCIENCES_REGIONS [GOOGLE_LIFESCIENCES_REGIONS ...]]
[--google-lifesciences-location GOOGLE_LIFESCIENCES_LOCATION]
[--google-lifesciences-keep-cache] [--tes URL] [--use-conda]
[--conda-not-block-search-path-envvars] [--list-conda-envs]
[--conda-prefix DIR] [--conda-cleanup-envs]
[--conda-cleanup-pkgs [{tarballs,cache}]]
[--conda-create-envs-only] [--conda-frontend {conda,mamba}]
[--use-singularity] [--singularity-prefix DIR]
[--singularity-args ARGS] [--use-envmodules]
[target ...]
snakemake: error: Couldn't parse config file: mapping values are not allowed here
in "/home/a1667917/.config/snakemake/slurm/config.yaml", line 145, column 65
George
I have tried with the max memory (#SBATCH --mem=250G) and 42 samples, but it ends in an out of memory error:
Failed, Run time 04:46:02, OUT_OF_MEMORY
/var/lib/slurm-llnl/slurmd/job28663469/slurm_script: line 13: 12927 Killed Rscript /scratch/sahlab/Jeffrey_NextSeq_IBD_VLP/seqtable_merge.R
slurmstepd: error: Detected 1 oom-kill event(s) in step 28663469.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
All samples make it through this step:
Parsed with column specification:
cols(
sequence = col_character(),
N646_I8216_32668_Jeffrey_Sup_10_NEBNext-Index-16C_CCGTCCGC_S11
= col_double()
)
and then the error occurs.
We may have to convert to python or somehow decrease memory usage for this script if we want to process NextSeq data.
Kathie
it didnt seem to be working
Hi there,
Good job for this awesome pipeline! I'd really like to try it, but I'm stuck on installing the databases.
I'm working on a cluster and tried:
hecatomb install --profile slurm
hecatomb install
Please see the log files enclosed.
Thanks in advance,
Patricia
2022-02-03T082640.845628.snakemake.log
2022-02-03T082606.247106.snakemake.log
Hi,
I ran Hecatomb (version beta4) on 141 paired-end samples, it crashed and gave this error message:
[W::bgzf_read_block] EOF marker is absent. The input may be truncated
java.lang.AssertionError:
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process. It may be fixable by running repair.sh.
at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:498)
at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:363)
at stream.ConcurrentGenericReadInputStream.run0(ConcurrentGenericReadInputStream.java:207)
at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:183)
at java.base/java.lang.Thread.run(Thread.java:834)
I double checked the sequences files, 141 R1 files and 141 R2 files, and it doesn’t seem the number of R1 and R2 are any different. And I don't understand what does "EOF marker is absent" mean.
Thanks!
Leran
Since we use _R1 and _R2 to find mate pairs, here is how to convert fastq-dump output to R1/R2 format
for F in *; do O=$(echo $F | sed -e 's/pass_/pass_R/'); echo $F $O; mv $F $O; done
We should add this to the wiki
Add functionality to find paired-end sample pairs (R1 and R2) from a folder containing many different samples.
Should we work on a list of potential false positives with reasoning for the paper? I think we should include Ebrahim and Rob's work on the Poxviridae and the lines and sines. We could include a warning about the need for follow up with the large dsDNA viruses (CRISPR-Cas related function in mimiviruses; transposons)?
Hi,
When I firstly ran hecatomb it failed at the beginning because it cannot find the --conda-frontend parameter that the pipeline required.
This issue could go away after I updated my conda version to conda 4.10.3.
Thanks!
Leran
Hi Mike,
After Hecatomb crashed, I ran this:
hecatomb run --reads RC2_freeze_2_samples_C.tsv --profile slurm --configfile heca
tomb.config.yaml --snake=-n --snake=--reason
[142/1903]
[Thu Feb 17 08:52:52 2022]
rule secondary_nt_lca_table:
input: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.m8
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
log: hecatomb_out/STDERR/secondary_nt_lca_table.log
jobid: 2034
benchmark: hecatomb_out/BENCHMARKS/secondary_nt_lca_table.txt
reason: Missing output files: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
resources: mem_mb=16000, disk_mb=893298, tmpdir=/tmp, time=1440, jobs=100
[Thu Feb 17 08:52:52 2022]
rule secondary_nt_calc_lca:
input: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin, /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb/snakemake/workflow/../..
/databases/tax/taxonomy
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/lca.lineage, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv
log: hecatomb_out/STDERR/secondary_nt_calc_lca.log
jobid: 2033
benchmark: hecatomb_out/BENCHMARKS/secondary_nt_calc_lca.txt
reason: Missing output files: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv; Input files updated by another job: hecatomb
_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
threads: 24
resources: mem_mb=64000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
{
# calculate lca and lineage
taxonkit lca -i 2 -s ';' --data-dir /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/tax/taxonomy h
ecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin | taxonkit lineage -i 3 --data-dir /opt/apps/labs/sahlab/software/miniconda3/envs
/hecatomb/snakemake/workflow/../../databases/tax/taxonomy | cut --complement -f 2 > hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/
results/lca.lineage 2> hecatomb_out/STDERR/secondary_nt_calc_lca.log
# Reformat lineages
awk -F ' ' '$2 != 0' hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/lca.lineage | taxonkit reformat --data-dir /opt/apps
/labs/sahlab/software/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/tax/taxonomy -i 3 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}
t{s}" -F --fill-miss-rank 2>> hecatomb_out/STDERR/secondary_nt_calc_lca.log |
cut --complement -f3 > hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv
} &> hecatomb_out/STDERR/secondary_nt_calc_lca.log
rm hecatomb_out/STDERR/secondary_nt_calc_lca.log
[Thu Feb 17 08:52:52 2022]
rule SECONDARY_NT_generate_output_table:
input: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/tophit.m8, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/SECONDARY_nt.tsv, hecatomb_out/RESULT
S/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv, hecatomb_out/RESULTS/sampleSeqCounts.tsv, /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb
/snakemake/workflow/../../databases/tables/2020_07_27_Viral_classification_table_ICTV2019.txt
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv
log: hecatomb_out/STDERR/SECONDARY_NT_generate_output_table.log
jobid: 2026
benchmark: hecatomb_out/BENCHMARKS/SECONDARY_NT_generate_output_table.txt
reason: Missing output files: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv; Input files updated by another job: hecatomb_out/RESULTS/
MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
[Thu Feb 17 08:52:52 2022]
rule combine_AA_NT:
input: hecatomb_out/RESULTS/MMSEQS_AA_SECONDARY/AA_bigtable.tsv, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv
output: hecatomb_out/RESULTS/bigtable.tsv
log: hecatomb_out/STDERR/combine_AA_NT.log
jobid: 2036
benchmark: hecatomb_out/BENCHMARKS/combine_AA_NT.txt
reason: Missing output files: hecatomb_out/RESULTS/bigtable.tsv; Input files updated by another job: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_
bigtable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
{ cat hecatomb_out/RESULTS/MMSEQS_AA_SECONDARY/AA_bigtable.tsv > hecatomb_out/RESULTS/bigtable.tsv;
tail -n+2 hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv >> hecatomb_out/RESULTS/bigtable.tsv; } &> hecatomb_out/STDERR/combine_AA_
NT.log
rm hecatomb_out/STDERR/combine_AA_NT.log
[Thu Feb 17 08:52:52 2022]
rule tax_level_counts:
input: hecatomb_out/RESULTS/bigtable.tsv
output: hecatomb_report/taxonLevelCounts.tsv
log: hecatomb_out/STDERR/tax_level_counts.log
jobid: 2045
reason: Missing output files: hecatomb_report/taxonLevelCounts.tsv; Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv
threads: 2
resources: mem_mb=16000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
[Thu Feb 17 08:52:52 2022]
rule contig_read_taxonomy:
input: hecatomb_out/PROCESSING/MAPPING/assembly.seqtable.bam, hecatomb_out/PROCESSING/MAPPING/assembly.seqtable.bam.bai, hecatomb_out/RESULTS/big
table.tsv
output: hecatomb_out/RESULTS/contigSeqTable.tsv
log: hecatomb_out/STDERR/contig_read_taxonomy.log
jobid: 2041
benchmark: hecatomb_out/BENCHMARKS/contig_read_taxonomy.txt
reason: Missing output files: hecatomb_out/RESULTS/contigSeqTable.tsv; Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv
threads: 2
resources: mem_mb=16000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
[Thu Feb 17 08:52:52 2022]
rule krona_text_format:
input: hecatomb_out/RESULTS/bigtable.tsv
output: hecatomb_report/krona.txt
log: hecatomb_out/STDERR/krona_text_format.log
jobid: 2047
benchmark: hecatomb_out/BENCHMARKS/krona_text_format.txt
reason: Missing output files: hecatomb_report/krona.txt; Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
[Thu Feb 17 08:52:52 2022]
rule contig_krona_text_format:
input: hecatomb_out/RESULTS/contigSeqTable.tsv
output: hecatomb_report/contigKrona.txt
log: hecatomb_out/STDERR/contig_krona_text_format.log
jobid: 2043
reason: Missing output files: hecatomb_report/contigKrona.txt; Input files updated by another job: hecatomb_out/RESULTS/contigSeqTable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
[Thu Feb 17 08:52:52 2022]
rule krona_plot:
input: hecatomb_report/krona.txt
output: hecatomb_report/krona.html
log: hecatomb_out/STDERR/krona_plot.log
jobid: 2046
benchmark: hecatomb_out/BENCHMARKS/krona_plot.txt
reason: Missing output files: hecatomb_report/krona.html; Input files updated by another job: hecatomb_report/krona.txt
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
ktImportText hecatomb_report/krona.txt -o hecatomb_report/krona.html &> hecatomb_out/STDERR/krona_plot.log
rm hecatomb_out/STDERR/krona_plot.log
[Thu Feb 17 08:52:52 2022]
rule contig_krona_plot:
input: hecatomb_report/contigKrona.txt
output: hecatomb_report/contigKrona.html
log: hecatomb_out/STDERR/contig_krona_plot.log
jobid: 2042
reason: Missing output files: hecatomb_report/contigKrona.html; Input files updated by another job: hecatomb_report/contigKrona.txt
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
ktImportText hecatomb_report/contigKrona.txt -o hecatomb_report/contigKrona.html &> hecatomb_out/STDERR/contig_krona_plot.log
rm hecatomb_out/STDERR/contig_krona_plot.log
[Thu Feb 17 08:52:52 2022]
localrule all:
input: hecatomb_out/RESULTS/seqtable.fasta, hecatomb_out/RESULTS/sampleSeqCounts.tsv, hecatomb_out/RESULTS/seqtable.properties.tsv, hecatomb_out/
PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/assembly.fasta, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/MAPPING/contig_count_table.tsv, hecatom
b_out/RESULTS/assembly.properties.tsv, hecatomb_out/RESULTS/MMSEQS_AA_SECONDARY/AA_bigtable.tsv, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable
.tsv, hecatomb_out/RESULTS/bigtable.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt.tsv, hecatomb_out/PROCESSING/ASSEMBLY/C
ONTIG_DICTIONARY/FLYE/SECONDARY_nt_phylum_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt_class_summary.tsv, hecato
mb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt_order_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_n
t_family_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt_genus_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG
DICTIONARY/FLYE/SECONDARY_nt_species_summary.tsv, hecatomb_out/PROCESSING/MAPPING/assembly.seqtable.bam, hecatomb_out/PROCESSING/MAPPING/assembly.se
qtable.bam.bai, hecatomb_out/RESULTS/contigSeqTable.tsv, hecatomb_report/contigKrona.html, hecatomb_report/Step00_counts.tsv, hecatomb_report/Step01
counts.tsv, hecatomb_report/Step02_counts.tsv, hecatomb_report/Step03_counts.tsv, hecatomb_report/Step04_counts.tsv, hecatomb_report/Step05_counts.ts
v, hecatomb_report/Step06_counts.tsv, hecatomb_report/Step07_counts.tsv, hecatomb_report/Step08_counts.tsv, hecatomb_report/Step09_counts.tsv, hecato
mb_report/Step10_counts.tsv, hecatomb_report/Step11_counts.tsv, hecatomb_report/Step12_counts.tsv, hecatomb_report/Step13_counts.tsv, hecatomb_report
/Sankey.svg, hecatomb_report/hecatomb.samples.tsv, hecatomb_report/taxonLevelCounts.tsv, hecatomb_report/krona.html
jobid: 0
reason: Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv, hecatomb_report/contigKrona.html, hecatomb_report/krona.html, heca
tomb_report/taxonLevelCounts.tsv, hecatomb_out/RESULTS/contigSeqTable.tsv, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100
Job stats:
job count min threads max threads
SECONDARY_NT_generate_output_table 1 1 1
all 1 1 1
combine_AA_NT 1 1 1
contig_krona_plot 1 1 1
contig_krona_text_format 1 1 1
contig_read_taxonomy 1 2 2
krona_plot 1 1 1
krona_text_format 1 1 1
secondary_nt_calc_lca 1 24 24
secondary_nt_lca_table 1 1 1
tax_level_counts 1 2 2
total 11 1 24
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
What the hecatomb?
Adjust to fasta over fastq whenever possible
Hi,
For the documentation, can you add a link under each output files to a table that defines each column?
Ex: bigtable.tsv
seqID UID for each representative sequence
This would make the tool more easily accessible, but not clutter the doc.
Thank you,
Kathie
When I'm running the hecatomb, program stops at:
Downloading and installing remote packages...
No processes are running, the program is not executed.
Create a release for the base install of hecatomb before we start
There are issues with the mmseqs scripts in the /base dir. They were created using an older version of mmseqs2 so the logic doesnt work anymore (particularly with --shuffle). Needs to be updated to work with the current version of mmseqs2.
We should use the ICTV resources phage lineages. If there are missing lineages, post an issue on that website.
The ICTV will update this from time to time, but we should work with them to have a single URL we can download
"Non-standard" default snakemake profiles are giving a hecatomb error.
I have followed Mike's snakemake profile tutorial, but then added things to both the sbatch
command and the default-resources
.
For example, our cluster has a partition
setting (general
vs GPU
) with a default resource of general
that has been missed by hecatomb.
We need to make sure that hecatomb is compatible with any profile, and I think we need to ensure we incorporate the users default-resources
settings, and perhaps other settings too(?)
bbtools
appears to only recognize :1
& :2
or /1
& /2
for read pairs. This breaks everything at step 6 when we run repair.sh
Many reads downloaded from SRA use .1
and .2
to identify paired-end reads.
A simple solution is to rename all the reads before you begin to ensure that they end with /1
and /2
as appropriate. You can use change_fastq_pair_symbol to do that, but it is not the most elegant or efficient solution (but it works).
I am not sure where/how to implement a better fix for this, or if it is our responsibility to check read names,
Hi Mike,
Can you add to the documentation how to update hecatomb? Do we have to reinstall when there is an update? Will hecatomb alert us to major updates?
Thank you,
Kathie
I am trying to do the contig assembly for 523 samples and have run into this issue:
[2020-10-28 16:49:19] INFO: Simplifying the graph
[2020-10-28 19:21:09] ERROR: Looks like the system ran out of memory
[2020-10-28 19:21:09] ERROR: Command '['flye-modules', 'repeat', '--disjointigs', '/scratch/sahlab/RC2_IBD_virome/assembly/contig_dictionary/00-assembly/draft_assembly.fasta', '--reads', './assembly/contig_dictionary/all.mh.contigs_for_flye.fa', '--out-dir', '/scratch/sahlab/RC2_IBD_virome/assembly/contig_dictionary/20-repeat', '--config', '/opt/htcf/spack/opt/spack/linux-ubuntu16.04-x86_64/gcc-5.4.0/py-flye-2.7.1-36mvt7vew5klvjj37weoxusoqe4l33ka/lib/python3.6/site-packages/flye/config/bin_cfg/asm_subasm.cfg', '--log', '/scratch/sahlab/RC2_IBD_virome/assembly/contig_dictionary/flye.log', '--threads', '24', '--meta', '--min-ovlp', '1000', '--kmer', '31']' died with <Signals.SIGKILL: 9>.
[2020-10-28 19:21:09] ERROR: Pipeline aborted
This is what the next step SHOULD have been:
[2020-06-03 23:00:27] INFO: >>>STAGE: plasmids
[2020-06-03 23:00:27] INFO: Recovering short unassembled sequences
Here is the command that was running when it crashed:
flye --subassemblies $OUT/contig_dictionary/all.mh.contigs_for_flye.fa -t 24 --meta --plasmids -o $OUT/contig_dictionary -g 1g
Here are my memory and node requests:
#SBATCH --cpus-per-task=16
#SBATCH --mem=250G
This is my version of flye:
module load py-flye/2.7.1-python-3.6.5
I have looked for this issue re flye. One recommedation was to update to flye2.5, but we are more up to date than that.
fenderglass/Flye#142
The second recommendation was "Hard to tell what is going on, because all cluster environments are usually configured very differently. I would suggest to try to resume run with an increased number of requested threads (maybe 25 in PBS, but use -t 20 in Flye) and specify maximum RAM (say ~500Gb should be sufficient)."
fenderglass/Flye#138
Would appreciate any suggestions of what to try next. I would try the flye command in an interactive session
I can't go above 250G, but I could try increasing the cpus, -t 24 in flye
Hi Mike,
The tophit.m8 should have these columns:
query | target | evalue | pident | fident | nident | mismatch | qcov | tcov | qstart | qend | qlen | tstart | tend | tlen | alnlen | bits | qheader | theader | taxid | taxname | lineage
But the last 3 columns are empty: taxid taxname lineage. They will be important for parsing the contig taxonomy by kingdom, family etc.
From: rule PRIMARY_AA_taxonomy_assignment
Kathie
update the master branch to reflect the snakemake changes.
Hi Rob,
I am not sure if this will happen with Snakemake, but if I run out of space on the server and contaminant removal quits after sample 42 of 527, because those 42 completed, it exits with status 0 and no out of space message.
Kathie
Keep zipping / unzipping to a minimum for efficiency.
Hi,
I'm currently trying to get hecatomb working on a VM, but I've run into a Java-related error. The support for said VM and storage servers told me that this error was not related to the VM, and that I should contact the hecatomb developers. I've attached the error log. Please let me know if there's any further information that needs to be provided.
hs_err_pid170.log
I specify a custom config file with --configfile
e.g.
hecatomb run --configfile hecatomb.config.yaml ...
which specifies a custom location for the databases.
The pipeline runs fine until it hits the first rule that uses the script:
directive (instead of shell:
or run:
). The DB check function reverts to the old database location and throws the 'database not installed error'.
Hi,
The cluster I am trying to run hecatomb on blocks all internet access to compute nodes, which is very frustrating.
It looks like a couple of rules in 00_functions.smk rely on Snakemake wrappers - for bam_index and fasta_index rules and so require internet access.
The entire log file I get when I run hecatomb on the compute node is:
Config file /hpcfs/users/a1667917/myconda/envs/hecatomb/snakemake/workflow/../config/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
WorkflowError:
Failed to open source file https://github.com/snakemake/snakemake-wrappers/raw/0.77.0/bio/samtools/index/environment.yaml
ConnectTimeout: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /snakemake/snakemake-wrappers/raw/0.77.0/bio/samtools/index/environment.yaml (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f65c0836c80>, 'Connection to github.com timed out. (connect timeout=None)'))
I commented out the wrapper for the 2 rules and hecatomb ran fine other than the very final indexing step (I then uncommented and ran the indexing steps on the login node - which worked to complete the pipeline just fine).
Therefore, as far as I can tell, this is the only section of hecatomb that requires online access - perhaps you could implement an option to run offline only?
I tried to manually change the snakemake rule to something like samtools index {input} but that didn't work.
George
We should record which reads map to humans (and perhaps bacteria and other things) so we can use them in downstream analysis (e.g. contig oarfishing)
Can the documentation list which mouse genome is used? C57BL6 or something else?
There is an ongoing discussion around which mapper to use to remove host sequences. This seems like a simple issue, but as with all things it is multi-factorial.
Memory efficiency / speed: This is a big issue for running hecatomb on multiple architectures. Java memory management causes lots of user issues. This is not a simple issue as it is dependent on I/O, indexing, zipping/unzipping, etc.
Mapper outputs: bbmap has by far the best 'built-in' options for outputs. Lots of summary logs and files to be mined eliminating the need to generate parsing scripts to generate the tables you want. In particular, bbmap has nice functionality for directly incorporating taxonomy. This was used to annotate the taxonomy of the bacteria reads were hitting in step_8 (base/bacterial_contaminants.sh). Simple and clean.
Specificity: A lot of mappers are assessed based on their specificity. Correct mappings are required for accurate gene counting in RNA seq and SNP calling. However, neither of these are a real issue for host removal so this is somewhat of a non-issue for this specific application but may have issues in other parts of the pipeline.
Dependency creep: I know that conda fixes a lot of the issues with dependency installation, but there will always be issues with keeping up with updates for multiple dependencies. There is just no way around that. I think it is a noble goal to a minimum of dependencies even if some other minor issues are complicated because of this.
There are faster mappers than both bowtie and bbmap. For example, minimap2 is typically 10X faster than bowtie and is likely to be a better fit for hecatomb than either bowtie or bbmap (https://github.com/lh3/minimap2).
Another option is psuedoalignment mappers such as kallisto (https://pachterlab.github.io/kallisto/about). This would likely be our fastest option, and isn't an additional dependency as it is already used for abundance estimation of assembled contigs. Kallisto is so efficient in both index generation and alignment that a switch to this psuedoalignment approach (if it works!) would get us closer to the magical 'laptop version' of hecatomb.
So I wanted to propose the following. If you have comments or issues on this strategy please post them as comments here:
Remove step_8 mapping the virus masked bacterial genomes. This isn't exactly the topic at hand, but is an elimination of a major, somewhat crippling component in terms of computation and database curation. This should be a separate module.
Swap bbmap/bt2 to kallisto for host mapping
This was a long post, but there are several issues in our weekly calls and on GitHub that deal with mapping and memory issues so I wanted to try and summarize them here as a single place for continued discussion.
SH
If we had a script that would allow us to merge new results with a previous seq table, this could make processing projects with additional rounds of sequencing (or combining several projects related to the same disease or body site) more efficient.
Hi Mike,
In working with the SIV data, I realized that there are spaces in taxonomic fields. For example, for families:
Verrucomicrobia subdivision 3
Verrucomicrobia subdivision 6
This will make pulling reads by family more difficult. Could all spaces in taxonomy fields be replaced with underscores in the next update?
Thank you,
Kathie
This block of code
# Sumamrise per taxon counts
merged_table <- stmerge %>%
select(-id) %>%
unite("lineage", c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "_") %>%
group_by(lineage) %>%
summarise_if(is.numeric, funs(sum(as.numeric(.)))) %>%
separate("lineage", c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "_") %>%
ungroup()
Generates the warning:
Warning messages:
1: `funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
2: Expected 7 pieces. Additional pieces discarded in 2 rows [137, 138].
this typo causes errors.
Cluster error:
/bin/bash: line 10: 18485 Illegal instruction (core dumped) mmseqs filterdb hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/result hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/firsthit --extract-lines 1
Log file error report (same as above):
/bin/bash: line 10: 18485 Illegal instruction (core dumped) mmseqs filterdb hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/result hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/firsthit --extract-lines 1
hecatomb install
Checking and downloading database files
Running snakemake command:
snakemake -j 32 --use-conda --conda-frontend mamba --rerun-incomplete --printshellcmds --nolock --show-failed-logs --conda-prefix /home/vsingh/miniconda3/envs/hecatomb/snakemake/workflow/conda -s /home/vsingh/miniconda3/envs/hecatomb/snakemake/workflow/DownloadDB.smk -C Output=hecatomb_out
Config file /home/vsingh/miniconda3/envs/hecatomb/snakemake/workflow/../config/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
/usr/bin/bash: line 10: __conda_exe: command not found
Traceback (most recent call last):
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/init.py", line 699, in snakemake
success = workflow.execute(
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/workflow.py", line 973, in execute
self.scheduler = JobScheduler(
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/scheduler.py", line 348, in init
self._executor = CPUExecutor(
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/executors/init.py", line 453, in init
self.exec_job += self.get_additional_args()
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/executors/init.py", line 298, in get_additional_args
if self.workflow.conda_base_path and self.assume_shared_fs:
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/workflow.py", line 276, in conda_base_path
return Conda().prefix_path
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 439, in init
shell.check_output(self._get_cmd("conda info --json"))
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/shell.py", line 63, in check_output
return sp.check_output(cmd, shell=True, executable=executable, **kwargs)
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'conda info --json' returned non-zero exit status 127.
The downloading is the same, but we need to add the indexing to make sure we have the bt2 files if we have the use_bowtie
option
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.