Giter Site home page Giter Site logo

hecatomb's People

Contributors

beardymcjohnface avatar henr0089 avatar kant avatar leran10 avatar linsalrob avatar mihinduk avatar mroach-awri avatar nhatas avatar rachelrodgers avatar sarahbeecroft avatar shandley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hecatomb's Issues

hecatomb's same output files from the same dataset are with different sizes

Hi,

Me and My colleague are running hecatomb on the same dataset. But the same output tables generated from us are with different rows or different number of sequences:

mine:

wc -l bigtable.tsv
**33073** bigtable.tsv

seqkit stat seqtable.fasta
file            format  type  num_seqs     sum_len  min_len  avg_len  max_len
seqtable.fasta  FASTA   DNA    **124,014**  28,329,286       90    228.4      250

My colleague:

wc -l assembly.fasta
**23998** assembly.fasta

seqkit stat seqtable.fasta
file            format  type  num_seqs     sum_len  min_len  avg_len  max_len
seqtable.fasta  FASTA   DNA    **124,041**  28,332,842       90    228.4      250

We are not sure if these steps have finished or not. Because we both failed at the "sankey_diagram" step.

I wanted to checked .err files in the LOG folder to see if the steps that generated those files were finished or not, but couldn't make sure which folders are the right ones to go.

I think If there could be a final .log file says "The entire hecatomb pipeline has successfully finished! " generated after the whole pipeline is done, It'll be very helpful.

Thanks!
Leran

kill switch

Hi Mike,

Can you add to the documentation how to kill hecatomb? I killed the tmux session, but it is my understanding that to kill snakemake, I have to A each running process.

Kathie

megahit -> flye reformatting

Hi Scott,
This is the current reformat command between megahit and flye:

reformat.sh in={output.rename} out={output.size} \
        ml={config[MINLENGTH]} \
        ow=t \
        -Xmx{config[System][Memory]}g;

If you try to run flye with the same MINLENGTH as megahit, you may fail immediately because of duplicate names. This can be solved by adding this to the reformat command:
uniquenames=t

Kathie

two concerns with the convert_phyloseq_euk_viral_glom.r standardized count table (species)

1.) Check that coercion from factor to numeric is behaving as intended, for example:

# Adjust values
ps.melt.value.sp <- ps.melt.sp %>%
  select(c((ncol(MAP)+11):(ncol(ps.melt.sp)))) %>%
   mutate_if(is.factor, ~ as.numeric(levels(.x))[.x])

Coercion from factor to numeric requires you coerce to a character first.

2.) Merging the Baltimore classification table with the melted phyloseq data using a left_join() will likely drop taxa from your table (ie: families in your melted table that are not present in your Baltimore file), as the keep argument defaults to FALSE. Is this the intended behavior? If not, I would recommend changing this parameter to TRUE, then filtering any rows added to the table that have an empty OTU (the Baltimore file will pull in some families that may not match to any families in your melted table).

# may lose information here:
ps.melt.fixed.sp <- left_join(ps.melt.fixed.sp, baltimore, by = "Family")

3.) When generating the standardized count object at the genus level, the default option for tax_glom is to drop any taxa for which you are missing information at the specified rank. In this case it should never be an issue (NAs aren't expected) but it would be safer to go ahead and set this parameter to FALSE in the event something upstream has gone wrong:

ps0.ge.glom <- ps0.sp %>%
  speedyseq::tax_glom("Genus", NArm = FALSE)

Hecatomb v1.0.0.beta.2 crash

Hi Mike,

I only got 86% of the way through this time:

[Thu Nov 4 02:02:55 2021]
Finished job 1885.
1894 of 2200 steps (86%) done
Exiting because a job execution failed. Look above for error message

FATAL: Hecatomb encountered an error.
       Dumping all error logs to "hecatomb.errorLogs.txt"Complete log: /scratch/sahlab/RC2_IBD_virome/allt_results/.snakemake/log/2021-11-03T210405.313772.snakemake.log

Complete log: /scratch/sahlab/RC2_IBD_virome/allt_results/.snakemake/log/2021-11-03T210405.313772.snakemake.log:
rule concatentate_contig_count_tables:
benchmark: hecatomb_out/BENCHMARKS/concatentate_contig_count_tables.txt
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time
37.3387 0:00:37 5.26 20.88 1.20 1.48 172.50 172.50 0.68 0.25

cat hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/MAPPING/...
sed -i '1i sample_id contig_id length reads RPKM FPKM TPM avg_fold_cov contig_GC cov_perc cov_bases median_fold_cov' hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/MAPPING/contig_count_table.tsv; } 2> hecatomb_out/STDERR/concatentate_contig_count_tables.log
rm hecatomb_out/STDERR/concatentate_contig_count_tables.log


benchmark: hecatomb_out/BENCHMARKS/mmseqs_contig_annotation.txt
threads: 32
resources: mem_mb=64000, time=1440, jobs=100


    {
    mmseqs createdb hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/assembly.fasta hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/queryDB --dbtype 2;
    mmseqs search hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/queryDB /home/mihindu/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/nt/virus_primary_nt/sequenceDB hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/results/result hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/mmseqs_nt_tmp             --start-sens 2 -s 7 --sens-steps 3 --min-length 90 -e 1e-5             --search-type 3 ; } &> hecatomb_out/STDERR/mmseqs_contig_annotation.log
    rm hecatomb_out/STDERR/mmseqs_contig_annotation.log

Error submitting jobscript (exit code 1):

Job failed, going on with independent jobs.

[Thu Nov 4 00:27:44 2021]
rule coverage_calculations:

1. Checked the presence of the db:
ls /home/mihindu/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/nt/virus_primary_nt/

sequenceDB sequenceDB.dbtype sequenceDB_h sequenceDB_h.dbtype sequenceDB_h.index sequenceDB.index sequenceDB.lookup sequenceDB.source

2. Checking Flye output:
ls /scratch/sahlab/RC2_IBD_virome/allt_results/hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE
00-assembly 22-plasmids 40-polishing assembly_graph.gfa assembly_info.txt flye.log
20-repeat 30-contigger assembly.fasta assembly_graph.gv contig_dictionary.stats params.json

NOTE: no results or mmseqs_nt_tmp folders


Unfortunately, the crash log is empty:
wc -l hecatomb.crashreport.log
0 hecatomb.crashreport.log

Error downloading databases

I followed the directions to generate databases here and ran into an error. After decompressing the database tar file and running the following command: snakemake --configfile snakemake/config/sample_config.yaml --snakefile snakemake/workflow/download_databases.smk --cores 8 I get the output below. My snakemake version is 5.26.1.

Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job counts:
	count	jobs
	1	all
	1	cluster_uniprot
	1	download_id_taxonomy_mapping
	1	download_ncbi_taxonomy
	1	download_uniprot_viruses
	1	download_uniref50
	1	extract_ncbi_taxonomy
	1	line_sine_download
	1	make_bac_databases
	1	make_host_databases
	1	mmseqs_uniprot_clusters
	1	mmseqs_uniprot_taxdb
	1	mmseqs_urv
	1	mmseqs_urv_taxonomy
	1	uniprot_to_ncbi_mapping
	1	uniref_plus_viruses
	16

[Thu Oct  8 18:24:54 2020]
rule download_uniref50:
    output: databases/proteins/uniref50.fasta.gz
    jobid: 16


[Thu Oct  8 18:24:54 2020]
rule download_id_taxonomy_mapping:
    output: databases/taxonomy/idmapping.dat.gz
    jobid: 9


[Thu Oct  8 18:24:54 2020]
rule download_ncbi_taxonomy:
    output: databases/taxonomy/taxdump.tar.gz
    jobid: 14


[Thu Oct  8 18:24:54 2020]
rule make_bac_databases:
    input: databases/bac_giant_unique_species/bac_uniquespecies_giant.masked_Ns_removed.fasta
    output: databases/bac_giant_unique_species/ref
    jobid: 1
    resources: time_min=240, mem_mb=100000, cpus=16


[Thu Oct  8 18:24:54 2020]
rule download_uniprot_viruses:
    output: databases/proteins/uniprot_virus.faa
    jobid: 4


[Thu Oct  8 18:24:54 2020]
rule make_host_databases:
    input: databases/human_masked/human_virus_masked.fasta
    output: databases/human_masked/ref
    jobid: 2
    resources: time_min=240, mem_mb=100000, cpus=16


[Thu Oct  8 18:24:54 2020]
rule line_sine_download:
    output: databases/contaminants/line_sine.fasta
    jobid: 3

[Thu Oct  8 18:24:54 2020]
[Thu Oct  8 18:24:54 2020]
Error in rule download_id_taxonomy_mapping:
Error in rule download_uniprot_viruses:
    jobid: 9
    jobid: 4
    output: databases/taxonomy/idmapping.dat.gz
    output: databases/proteins/uniprot_virus.faa
    shell:
        
        cd databases/taxonomy;
        curl -LO "https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz"
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    shell:
        
        mkdir -p databases/proteins && curl -Lgo databases/proteins/uniprot_virus.faa "https://www.uniprot.org/uniprot/?query=taxonomy:%22Viruses%20[10239]%22&format=fasta&&sort=score&fil=reviewed:no"
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)


[Thu Oct  8 18:24:55 2020]
Error in rule line_sine_download:
    jobid: 3
    output: databases/contaminants/line_sine.fasta
    shell:
        
        (curl -L http://sines.eimb.ru/banks/SINEs.bnk &&                 curl -L http://sines.eimb.ru/banks/LINEs.bnk)                 | sed -e '/^>/ s/ /_/g' | seqtk rename                 > databases/contaminants/line_sine.fasta
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job line_sine_download since they might be corrupted:
databases/contaminants/line_sine.fasta
[Thu Oct  8 18:24:57 2020]
Finished job 14.
1 of 16 steps (6%) done
[Thu Oct  8 18:29:42 2020]
Finished job 2.
2 of 16 steps (12%) done
[Thu Oct  8 18:30:36 2020]
Finished job 1.
3 of 16 steps (19%) done
[Thu Oct  8 18:40:49 2020]
Finished job 16.
4 of 16 steps (25%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/pathogen1/stahan/hecatomb/.snakemake/log/2020-10-08T182453.577588.snakemake.log```

Error in secondary_nt_lca_table step

Hi,

One of my dataset (141 samples) have been running for 5 days, now it threw an error message said:

Error in rule secondary_nt_lca_table:
    jobid: 2734
    output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
    log: hecatomb_out/STDERR/secondary_nt_lca_table.log (check log file(s) for error message)
    cluster_jobid: Submitted batch job 35662963
Logfile hecatomb_out/STDERR/secondary_nt_lca_table.log:
DEBUG:root:Reading alignments and extracting taxon IDs


Error executing rule secondary_nt_lca_table on cluster (jobid: 2734, external: Submitted batch job 35662963, jobscript: /scratch/sahlab/AK/.snakemake/tmp.w_b1srs3/snakejob.secondary_nt_lca_table.2734.sh). For error details see the cluster log and the log files of the involved rule(s).

Anywhere else that I can check for more information?

Thanks!
Leran

Partition phage and eukaryotic viruses

In an early developmental version of hecatomb I merged the 'bigtable' (wasn't called that at the time!) to a list of phage taxonomies. This added a column called virus_type that allowed for simple partitioning go phage-to-nonphage viruses.

This should be easy to implement the same way we add Baltimore classifications.

I have the original phage list that I used for this (below).
Ampullaviridae
Atkinsviridae
Autographiviridae
Autolykiviridae
Bicaudaviridae
Blumeviridae
Caudovirales
Caudovirales_undefined_family
Clavaviridae
Corticoviridae
Crevaviridae
Cystoviridae
Duinviridae
Fiersviridae
Fuselloviridae
Globuloviridae
Guelinviridae
Guttaviridae
Inoviridae
Intestiviridae
Jelitoviridae
Leviviridae
Ligamenvirales
Ligamenvirales_undefined_family
Lipothrixviridae
Matshushitaviridae
Microviridae
Myoviridae
Paulinoviridae
Picobirnaviridae
Plasmaviridae
Pleolipoviridae
Podoviridae
Portogloboviridae
Rountreeviridae
Rudiviridae
Salasmaviridae
Schitoviridae
Simuloviridae
Siphoviridae
Solspiviridae
Sphaerolipoviridae
Spiraviridae
Steigviridae
Steitzviridae
Suoliviridae
Tectiviridae
Tinaiviridae
Tristromaviridae
Tubulavirales
Tubulavirales_undefined_family
Turriviridae
unidentified phage
Zobellviridae

Error in rule PRIMARY_NT_reformat

Hi,

I have been running hecatomb on a 190 sample-sized dataset, and it crashed at "rule PRIMARY_NT_reformat" step. This is what shows in hecatomb_out/STDERR/t09.mmseqs_PRIMARY_nt_summary.log file:

/bin/bash: line 10: 27845 Illegal instruction (core dumped) mmseqs filterdb hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/result hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/firsthit --extract-lines 1

I didn't remember I saw this error when I ran a subset (10 samples) of this 190 dataset. Can I just rebuild the hecatomb or is there something we should tune before rebuilding?

Thanks!
Leran

missing python package

When I run the test data, it dies with this error:

Error executing rule sankey_diagram on cluster (jobid: 213, external: Submitted batch job 33397859, jobscript: /scratch/sahlab/RC2_IBD_virome/te
st/.snakemake/tmp.1edmh2g2/snakejob.sankey_diagram.213.sh). For error details see the cluster log and the log files of the involved rule(s).
Job failed, going on with independent jobs.
[Thu Oct 28 01:20:39 2021]
Finished job 196.
214 of 216 steps (99%) done
Exiting because a job execution failed. Look above for error message
Complete log: /scratch/sahlab/RC2_IBD_virome/test/.snakemake/log/2021-10-27T220225.813094.snakemake.log
ERROR: Snakemake command failed

ERROR:snakemake.logging:RuleException:
ValueError in line 501 of /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/snakemake/workflow/rules/04_summaries.smk:

Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
$ pip install -U kaleido

File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 2357, in run
_wrapper
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/snakemake/workflow/rules/04_summaries.smk", line 501, in __rule_sankey_diagram
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/plotly/basedatatypes.py", line 3821, in write_image
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/plotly/io/_kaleido.py", line 268, in write_image
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/plotly/io/_kaleido.py", line 134, in to_image
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 574, in _cal
lback
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/concurrent/futures/thread.py", line 52, in run
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 560, in cach
ed_or_run
File "/opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb2/lib/python3.9/site-packages/snakemake/executors/init.py", line 2390, in run
_wrapper

Enhancement: Host

Hi Mike,
It would be VERY helpful if Hecatomb could link host information to the taxonomy. I have attached a file with a list of viral families and hosts that could help with this.

Thank you,
Kathie
2020_11_24_Viral_Family_host.xlsx

Example Profile Broken

Hi,

I tried to implement the example Snakemake profile following the tutorial instructions on my cluster - it did not work.

Full error message is below:

Running Hecatomb
Running snakemake command:
snakemake --profile slurm --default-resources mem_mb=2000 time=1440 jobs=100 --use-conda --conda-frontend mamba --rerun-incomplete --printshellcmds --nolock --show-failed-logs --conda-prefix /hpcfs/users/a1667917/myconda/envs/hecatomb/snakemake/workflow/conda --configfile hecatomb.config_prof.yaml -s /hpcfs/users/a1667917/myconda/envs/hecatomb/snakemake/workflow/Hecatomb.smk -C Reads=reads.txt Host=human Output=hecatomb_out SkipAssembly=True Fast=False
usage: snakemake [-h] [--dry-run] [--profile PROFILE] [--cache [RULE ...]]
[--snakefile FILE] [--cores [N]] [--jobs [N]]
[--local-cores N] [--resources [NAME=INT ...]]
[--set-threads RULE=THREADS [RULE=THREADS ...]]
[--max-threads MAX_THREADS]
[--set-resources RULE:RESOURCE=VALUE [RULE:RESOURCE=VALUE ...]]
[--set-scatter NAME=SCATTERITEMS [NAME=SCATTERITEMS ...]]
[--default-resources [NAME=INT ...]]
[--preemption-default PREEMPTION_DEFAULT]
[--preemptible-rules PREEMPTIBLE_RULES [PREEMPTIBLE_RULES ...]]
[--config [KEY=VALUE ...]] [--configfile FILE [FILE ...]]
[--envvars VARNAME [VARNAME ...]] [--directory DIR] [--touch]
[--keep-going] [--force] [--forceall]
[--forcerun [TARGET ...]] [--prioritize TARGET [TARGET ...]]
[--batch RULE=BATCH/BATCHES] [--until TARGET [TARGET ...]]
[--omit-from TARGET [TARGET ...]] [--rerun-incomplete]
[--shadow-prefix DIR] [--scheduler [{ilp,greedy}]]
[--wms-monitor [WMS_MONITOR]]
[--wms-monitor-arg [NAME=VALUE ...]]
[--scheduler-ilp-solver {COIN_CMD}]
[--scheduler-solver-path SCHEDULER_SOLVER_PATH]
[--conda-base-path CONDA_BASE_PATH] [--no-subworkflows]
[--groups GROUPS [GROUPS ...]]
[--group-components GROUP_COMPONENTS [GROUP_COMPONENTS ...]]
[--report [FILE]] [--report-stylesheet CSSFILE]
[--draft-notebook TARGET] [--edit-notebook TARGET]
[--notebook-listen IP:PORT] [--lint [{text,json}]]
[--generate-unit-tests [TESTPATH]] [--containerize]
[--export-cwl FILE] [--list] [--list-target-rules] [--dag]
[--rulegraph] [--filegraph] [--d3dag] [--summary]
[--detailed-summary] [--archive FILE]
[--cleanup-metadata FILE [FILE ...]] [--cleanup-shadow]
[--skip-script-cleanup] [--unlock] [--list-version-changes]
[--list-code-changes] [--list-input-changes]
[--list-params-changes] [--list-untracked]
[--delete-all-output] [--delete-temp-output]
[--bash-completion] [--keep-incomplete] [--drop-metadata]
[--version] [--reason] [--gui [PORT]] [--printshellcmds]
[--debug-dag] [--stats FILE] [--nocolor] [--quiet]
[--print-compilation] [--verbose] [--force-use-threads]
[--allow-ambiguity] [--nolock] [--ignore-incomplete]
[--max-inventory-time SECONDS] [--latency-wait SECONDS]
[--wait-for-files [FILE ...]] [--wait-for-files-file FILE]
[--notemp] [--all-temp] [--keep-remote] [--keep-target-files]
[--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...]]
[--max-jobs-per-second MAX_JOBS_PER_SECOND]
[--max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND]
[-T RESTART_TIMES] [--attempt ATTEMPT]
[--wrapper-prefix WRAPPER_PREFIX]
[--default-remote-provider {S3,GS,FTP,SFTP,S3Mocked,gfal,gridftp,iRODS,AzBlob,XRootD}]
[--default-remote-prefix DEFAULT_REMOTE_PREFIX]
[--no-shared-fs] [--greediness GREEDINESS] [--no-hooks]
[--overwrite-shellcmd OVERWRITE_SHELLCMD] [--debug]
[--runtime-profile FILE] [--mode {0,1,2}]
[--show-failed-logs] [--log-handler-script FILE]
[--log-service {none,slack,wms}]
[--cluster CMD | --cluster-sync CMD | --drmaa [ARGS]]
[--cluster-config FILE] [--immediate-submit]
[--jobscript SCRIPT] [--jobname NAME]
[--cluster-status CLUSTER_STATUS] [--drmaa-log-dir DIR]
[--kubernetes [NAMESPACE]] [--container-image IMAGE]
[--tibanna] [--tibanna-sfn TIBANNA_SFN]
[--precommand PRECOMMAND]
[--tibanna-config TIBANNA_CONFIG [TIBANNA_CONFIG ...]]
[--google-lifesciences]
[--google-lifesciences-regions GOOGLE_LIFESCIENCES_REGIONS [GOOGLE_LIFESCIENCES_REGIONS ...]]
[--google-lifesciences-location GOOGLE_LIFESCIENCES_LOCATION]
[--google-lifesciences-keep-cache] [--tes URL] [--use-conda]
[--conda-not-block-search-path-envvars] [--list-conda-envs]
[--conda-prefix DIR] [--conda-cleanup-envs]
[--conda-cleanup-pkgs [{tarballs,cache}]]
[--conda-create-envs-only] [--conda-frontend {conda,mamba}]
[--use-singularity] [--singularity-prefix DIR]
[--singularity-args ARGS] [--use-envmodules]
[target ...]
snakemake: error: Couldn't parse config file: mapping values are not allowed here
in "/home/a1667917/.config/snakemake/slurm/config.yaml", line 145, column 65

George

For NextSeq samples, maximum HTCF memory insufficient for merge seqtable step in R

I have tried with the max memory (#SBATCH --mem=250G) and 42 samples, but it ends in an out of memory error:
Failed, Run time 04:46:02, OUT_OF_MEMORY

/var/lib/slurm-llnl/slurmd/job28663469/slurm_script: line 13: 12927 Killed Rscript /scratch/sahlab/Jeffrey_NextSeq_IBD_VLP/seqtable_merge.R
slurmstepd: error: Detected 1 oom-kill event(s) in step 28663469.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

All samples make it through this step:
Parsed with column specification:
cols(
sequence = col_character(),
N646_I8216_32668_Jeffrey_Sup_10_NEBNext-Index-16C_CCGTCCGC_S11 = col_double()
)

and then the error occurs.

We may have to convert to python or somehow decrease memory usage for this script if we want to process NextSeq data.

Kathie

Error: EOF marker is absent.There appear to be different numbers of reads in the paired input files

Hi,

I ran Hecatomb (version beta4) on 141 paired-end samples, it crashed and gave this error message:

[W::bgzf_read_block] EOF marker is absent. The input may be truncated
java.lang.AssertionError: 
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process. It may be fixable by running repair.sh.
    at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:498)
    at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:363)
    at stream.ConcurrentGenericReadInputStream.run0(ConcurrentGenericReadInputStream.java:207)
    at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:183)
    at java.base/java.lang.Thread.run(Thread.java:834)

I double checked the sequences files, 141 R1 files and 141 R2 files, and it doesn’t seem the number of R1 and R2 are any different. And I don't understand what does "EOF marker is absent" mean.

Thanks!
Leran

How to convert fastq-dumps to R1 /R2

Since we use _R1 and _R2 to find mate pairs, here is how to convert fastq-dump output to R1/R2 format

for F in *; do O=$(echo $F | sed -e 's/pass_/pass_R/'); echo $F $O; mv $F $O; done

We should add this to the wiki

include info about potential false positives in README?

Should we work on a list of potential false positives with reasoning for the paper? I think we should include Ebrahim and Rob's work on the Poxviridae and the lines and sines. We could include a warning about the need for follow up with the large dsDNA viruses (CRISPR-Cas related function in mimiviruses; transposons)?

MANY missing output files; Input files updated by another job

Hi Mike,
After Hecatomb crashed, I ran this:
hecatomb run --reads RC2_freeze_2_samples_C.tsv --profile slurm --configfile heca
tomb.config.yaml --snake=-n --snake=--reason

                                                                                                                                       [142/1903]

[Thu Feb 17 08:52:52 2022]
rule secondary_nt_lca_table:
input: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.m8
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
log: hecatomb_out/STDERR/secondary_nt_lca_table.log
jobid: 2034
benchmark: hecatomb_out/BENCHMARKS/secondary_nt_lca_table.txt
reason: Missing output files: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin
resources: mem_mb=16000, disk_mb=893298, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule secondary_nt_calc_lca:
input: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin, /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb/snakemake/workflow/../..
/databases/tax/taxonomy
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/lca.lineage, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv
log: hecatomb_out/STDERR/secondary_nt_calc_lca.log
jobid: 2033
benchmark: hecatomb_out/BENCHMARKS/secondary_nt_calc_lca.txt
reason: Missing output files: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv; Input files updated by another job: hecatomb
_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin

threads: 24
resources: mem_mb=64000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

    {
    # calculate lca and lineage
    taxonkit lca -i 2 -s ';' --data-dir /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/tax/taxonomy h

ecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/all.lin | taxonkit lineage -i 3 --data-dir /opt/apps/labs/sahlab/software/miniconda3/envs
/hecatomb/snakemake/workflow/../../databases/tax/taxonomy | cut --complement -f 2 > hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/
results/lca.lineage 2> hecatomb_out/STDERR/secondary_nt_calc_lca.log

    # Reformat lineages
    awk -F '        ' '$2 != 0' hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/lca.lineage |             taxonkit reformat --data-dir /opt/apps

/labs/sahlab/software/miniconda3/envs/hecatomb/snakemake/workflow/../../databases/tax/taxonomy -i 3 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}
t{s}" -F --fill-miss-rank 2>> hecatomb_out/STDERR/secondary_nt_calc_lca.log |
cut --complement -f3 > hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv
} &> hecatomb_out/STDERR/secondary_nt_calc_lca.log
rm hecatomb_out/STDERR/secondary_nt_calc_lca.log

[Thu Feb 17 08:52:52 2022]
rule SECONDARY_NT_generate_output_table:
input: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/results/tophit.m8, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/SECONDARY_nt.tsv, hecatomb_out/RESULT
S/MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv, hecatomb_out/RESULTS/sampleSeqCounts.tsv, /opt/apps/labs/sahlab/software/miniconda3/envs/hecatomb
/snakemake/workflow/../../databases/tables/2020_07_27_Viral_classification_table_ICTV2019.txt
output: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv
log: hecatomb_out/STDERR/SECONDARY_NT_generate_output_table.log
jobid: 2026
benchmark: hecatomb_out/BENCHMARKS/SECONDARY_NT_generate_output_table.txt
reason: Missing output files: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv; Input files updated by another job: hecatomb_out/RESULTS/
MMSEQS_NT_SECONDARY/results/secondary_nt_lca.tsv

resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule combine_AA_NT:
input: hecatomb_out/RESULTS/MMSEQS_AA_SECONDARY/AA_bigtable.tsv, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv
output: hecatomb_out/RESULTS/bigtable.tsv
log: hecatomb_out/STDERR/combine_AA_NT.log
jobid: 2036
benchmark: hecatomb_out/BENCHMARKS/combine_AA_NT.txt
reason: Missing output files: hecatomb_out/RESULTS/bigtable.tsv; Input files updated by another job: hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_
bigtable.tsv

resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

    { cat hecatomb_out/RESULTS/MMSEQS_AA_SECONDARY/AA_bigtable.tsv > hecatomb_out/RESULTS/bigtable.tsv;
    tail -n+2 hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv >> hecatomb_out/RESULTS/bigtable.tsv; } &> hecatomb_out/STDERR/combine_AA_

NT.log
rm hecatomb_out/STDERR/combine_AA_NT.log

[Thu Feb 17 08:52:52 2022]
rule tax_level_counts:
input: hecatomb_out/RESULTS/bigtable.tsv
output: hecatomb_report/taxonLevelCounts.tsv
log: hecatomb_out/STDERR/tax_level_counts.log
jobid: 2045
reason: Missing output files: hecatomb_report/taxonLevelCounts.tsv; Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv
threads: 2
resources: mem_mb=16000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule contig_read_taxonomy:
input: hecatomb_out/PROCESSING/MAPPING/assembly.seqtable.bam, hecatomb_out/PROCESSING/MAPPING/assembly.seqtable.bam.bai, hecatomb_out/RESULTS/big
table.tsv
output: hecatomb_out/RESULTS/contigSeqTable.tsv
log: hecatomb_out/STDERR/contig_read_taxonomy.log
jobid: 2041
benchmark: hecatomb_out/BENCHMARKS/contig_read_taxonomy.txt
reason: Missing output files: hecatomb_out/RESULTS/contigSeqTable.tsv; Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv
threads: 2
resources: mem_mb=16000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule krona_text_format:
input: hecatomb_out/RESULTS/bigtable.tsv
output: hecatomb_report/krona.txt
log: hecatomb_out/STDERR/krona_text_format.log
jobid: 2047
benchmark: hecatomb_out/BENCHMARKS/krona_text_format.txt
reason: Missing output files: hecatomb_report/krona.txt; Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule contig_krona_text_format:
input: hecatomb_out/RESULTS/contigSeqTable.tsv
output: hecatomb_report/contigKrona.txt
log: hecatomb_out/STDERR/contig_krona_text_format.log
jobid: 2043
reason: Missing output files: hecatomb_report/contigKrona.txt; Input files updated by another job: hecatomb_out/RESULTS/contigSeqTable.tsv
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

[Thu Feb 17 08:52:52 2022]
rule krona_plot:
input: hecatomb_report/krona.txt
output: hecatomb_report/krona.html
log: hecatomb_out/STDERR/krona_plot.log
jobid: 2046
benchmark: hecatomb_out/BENCHMARKS/krona_plot.txt
reason: Missing output files: hecatomb_report/krona.html; Input files updated by another job: hecatomb_report/krona.txt
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

    ktImportText hecatomb_report/krona.txt -o hecatomb_report/krona.html &> hecatomb_out/STDERR/krona_plot.log
    rm hecatomb_out/STDERR/krona_plot.log

[Thu Feb 17 08:52:52 2022]
rule contig_krona_plot:
input: hecatomb_report/contigKrona.txt
output: hecatomb_report/contigKrona.html
log: hecatomb_out/STDERR/contig_krona_plot.log
jobid: 2042
reason: Missing output files: hecatomb_report/contigKrona.html; Input files updated by another job: hecatomb_report/contigKrona.txt
resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

    ktImportText hecatomb_report/contigKrona.txt -o hecatomb_report/contigKrona.html &> hecatomb_out/STDERR/contig_krona_plot.log
    rm hecatomb_out/STDERR/contig_krona_plot.log

[Thu Feb 17 08:52:52 2022]
localrule all:
input: hecatomb_out/RESULTS/seqtable.fasta, hecatomb_out/RESULTS/sampleSeqCounts.tsv, hecatomb_out/RESULTS/seqtable.properties.tsv, hecatomb_out/
PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/assembly.fasta, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/MAPPING/contig_count_table.tsv, hecatom
b_out/RESULTS/assembly.properties.tsv, hecatomb_out/RESULTS/MMSEQS_AA_SECONDARY/AA_bigtable.tsv, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable
.tsv, hecatomb_out/RESULTS/bigtable.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt.tsv, hecatomb_out/PROCESSING/ASSEMBLY/C
ONTIG_DICTIONARY/FLYE/SECONDARY_nt_phylum_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt_class_summary.tsv, hecato
mb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt_order_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_n
t_family_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/SECONDARY_nt_genus_summary.tsv, hecatomb_out/PROCESSING/ASSEMBLY/CONTIG
DICTIONARY/FLYE/SECONDARY_nt_species_summary.tsv, hecatomb_out/PROCESSING/MAPPING/assembly.seqtable.bam, hecatomb_out/PROCESSING/MAPPING/assembly.se
qtable.bam.bai, hecatomb_out/RESULTS/contigSeqTable.tsv, hecatomb_report/contigKrona.html, hecatomb_report/Step00_counts.tsv, hecatomb_report/Step01

counts.tsv, hecatomb_report/Step02_counts.tsv, hecatomb_report/Step03_counts.tsv, hecatomb_report/Step04_counts.tsv, hecatomb_report/Step05_counts.ts
v, hecatomb_report/Step06_counts.tsv, hecatomb_report/Step07_counts.tsv, hecatomb_report/Step08_counts.tsv, hecatomb_report/Step09_counts.tsv, hecato
mb_report/Step10_counts.tsv, hecatomb_report/Step11_counts.tsv, hecatomb_report/Step12_counts.tsv, hecatomb_report/Step13_counts.tsv, hecatomb_report
/Sankey.svg, hecatomb_report/hecatomb.samples.tsv, hecatomb_report/taxonLevelCounts.tsv, hecatomb_report/krona.html
jobid: 0
reason: Input files updated by another job: hecatomb_out/RESULTS/bigtable.tsv, hecatomb_report/contigKrona.html, hecatomb_report/krona.html, heca
tomb_report/taxonLevelCounts.tsv, hecatomb_out/RESULTS/contigSeqTable.tsv, hecatomb_out/RESULTS/MMSEQS_NT_SECONDARY/NT_bigtable.tsv

resources: mem_mb=2000, disk_mb=, tmpdir=/tmp, time=1440, jobs=100

Job stats:
job count min threads max threads


SECONDARY_NT_generate_output_table 1 1 1
all 1 1 1
combine_AA_NT 1 1 1
contig_krona_plot 1 1 1
contig_krona_text_format 1 1 1
contig_read_taxonomy 1 2 2
krona_plot 1 1 1
krona_text_format 1 1 1
secondary_nt_calc_lca 1 24 24
secondary_nt_lca_table 1 1 1
tax_level_counts 1 2 2
total 11 1 24

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

What the hecatomb?

Documentation

Hi,

For the documentation, can you add a link under each output files to a table that defines each column?
Ex: bigtable.tsv
seqID UID for each representative sequence

This would make the tool more easily accessible, but not clutter the doc.

Thank you,
Kathie

Problem with running

When I'm running the hecatomb, program stops at:
Downloading and installing remote packages...

No processes are running, the program is not executed.

snakemake profiles

"Non-standard" default snakemake profiles are giving a hecatomb error.

I have followed Mike's snakemake profile tutorial, but then added things to both the sbatch command and the default-resources.

For example, our cluster has a partition setting (general vs GPU) with a default resource of general that has been missed by hecatomb.

We need to make sure that hecatomb is compatible with any profile, and I think we need to ensure we incorporate the users default-resources settings, and perhaps other settings too(?)

read pair handling

bbtools appears to only recognize :1 & :2 or /1 & /2 for read pairs. This breaks everything at step 6 when we run repair.sh

Many reads downloaded from SRA use .1 and .2 to identify paired-end reads.

A simple solution is to rename all the reads before you begin to ensure that they end with /1 and /2 as appropriate. You can use change_fastq_pair_symbol to do that, but it is not the most elegant or efficient solution (but it works).

I am not sure where/how to implement a better fix for this, or if it is our responsibility to check read names,

updating hecatomb

Hi Mike,

Can you add to the documentation how to update hecatomb? Do we have to reinstall when there is an update? Will hecatomb alert us to major updates?

Thank you,
Kathie

flye "Looks like the system ran out of memory"

I am trying to do the contig assembly for 523 samples and have run into this issue:

[2020-10-28 16:49:19] INFO: Simplifying the graph
[2020-10-28 19:21:09] ERROR: Looks like the system ran out of memory
[2020-10-28 19:21:09] ERROR: Command '['flye-modules', 'repeat', '--disjointigs', '/scratch/sahlab/RC2_IBD_virome/assembly/contig_dictionary/00-assembly/draft_assembly.fasta', '--reads', './assembly/contig_dictionary/all.mh.contigs_for_flye.fa', '--out-dir', '/scratch/sahlab/RC2_IBD_virome/assembly/contig_dictionary/20-repeat', '--config', '/opt/htcf/spack/opt/spack/linux-ubuntu16.04-x86_64/gcc-5.4.0/py-flye-2.7.1-36mvt7vew5klvjj37weoxusoqe4l33ka/lib/python3.6/site-packages/flye/config/bin_cfg/asm_subasm.cfg', '--log', '/scratch/sahlab/RC2_IBD_virome/assembly/contig_dictionary/flye.log', '--threads', '24', '--meta', '--min-ovlp', '1000', '--kmer', '31']' died with <Signals.SIGKILL: 9>.
[2020-10-28 19:21:09] ERROR: Pipeline aborted

This is what the next step SHOULD have been:
[2020-06-03 23:00:27] INFO: >>>STAGE: plasmids
[2020-06-03 23:00:27] INFO: Recovering short unassembled sequences

Here is the command that was running when it crashed:
flye --subassemblies $OUT/contig_dictionary/all.mh.contigs_for_flye.fa -t 24 --meta --plasmids -o $OUT/contig_dictionary -g 1g

Here are my memory and node requests:
#SBATCH --cpus-per-task=16
#SBATCH --mem=250G

This is my version of flye:
module load py-flye/2.7.1-python-3.6.5

I have looked for this issue re flye. One recommedation was to update to flye2.5, but we are more up to date than that.
fenderglass/Flye#142
The second recommendation was "Hard to tell what is going on, because all cluster environments are usually configured very differently. I would suggest to try to resume run with an increased number of requested threads (maybe 25 in PBS, but use -t 20 in Flye) and specify maximum RAM (say ~500Gb should be sufficient)."
fenderglass/Flye#138

Would appreciate any suggestions of what to try next. I would try the flye command in an interactive session
I can't go above 250G, but I could try increasing the cpus, -t 24 in flye

missing fields in outfile

Hi Mike,

The tophit.m8 should have these columns:
query | target | evalue | pident | fident | nident | mismatch | qcov | tcov | qstart | qend | qlen | tstart | tend | tlen | alnlen | bits | qheader | theader | taxid | taxname | lineage

But the last 3 columns are empty: taxid taxname lineage. They will be important for parsing the contig taxonomy by kingdom, family etc.

From: rule PRIMARY_AA_taxonomy_assignment

Kathie

slurm and space

Hi Rob,

I am not sure if this will happen with Snakemake, but if I run out of space on the server and contaminant removal quits after sample 42 of 527, because those 42 completed, it exits with status 0 and no out of space message.

Kathie

Java Runtime Error when running test databases on VM

Hi,

I'm currently trying to get hecatomb working on a VM, but I've run into a Java-related error. The support for said VM and storage servers told me that this error was not related to the VM, and that I should contact the hecatomb developers. I've attached the error log. Please let me know if there's any further information that needs to be provided.
hs_err_pid170.log

settings in '--configfile' file not passed to 'script:' rules

I specify a custom config file with --configfile e.g.

hecatomb run --configfile hecatomb.config.yaml ...

which specifies a custom location for the databases.

The pipeline runs fine until it hits the first rule that uses the script: directive (instead of shell: or run:). The DB check function reverts to the old database location and throws the 'database not installed error'.

Run Offline

Hi,

The cluster I am trying to run hecatomb on blocks all internet access to compute nodes, which is very frustrating.

It looks like a couple of rules in 00_functions.smk rely on Snakemake wrappers - for bam_index and fasta_index rules and so require internet access.

The entire log file I get when I run hecatomb on the compute node is:

Config file /hpcfs/users/a1667917/myconda/envs/hecatomb/snakemake/workflow/../config/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
WorkflowError:
Failed to open source file https://github.com/snakemake/snakemake-wrappers/raw/0.77.0/bio/samtools/index/environment.yaml
ConnectTimeout: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /snakemake/snakemake-wrappers/raw/0.77.0/bio/samtools/index/environment.yaml (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f65c0836c80>, 'Connection to github.com timed out. (connect timeout=None)'))

I commented out the wrapper for the 2 rules and hecatomb ran fine other than the very final indexing step (I then uncommented and ran the indexing steps on the login node - which worked to complete the pipeline just fine).

Therefore, as far as I can tell, this is the only section of hecatomb that requires online access - perhaps you could implement an option to run offline only?

I tried to manually change the snakemake rule to something like samtools index {input} but that didn't work.

George

Add human taxonomy to a taxtable

We should record which reads map to humans (and perhaps bacteria and other things) so we can use them in downstream analysis (e.g. contig oarfishing)

mouse

Can the documentation list which mouse genome is used? C57BL6 or something else?

Mappers for host removal

There is an ongoing discussion around which mapper to use to remove host sequences. This seems like a simple issue, but as with all things it is multi-factorial.

  1. Memory efficiency / speed: This is a big issue for running hecatomb on multiple architectures. Java memory management causes lots of user issues. This is not a simple issue as it is dependent on I/O, indexing, zipping/unzipping, etc.

  2. Mapper outputs: bbmap has by far the best 'built-in' options for outputs. Lots of summary logs and files to be mined eliminating the need to generate parsing scripts to generate the tables you want. In particular, bbmap has nice functionality for directly incorporating taxonomy. This was used to annotate the taxonomy of the bacteria reads were hitting in step_8 (base/bacterial_contaminants.sh). Simple and clean.

  3. Specificity: A lot of mappers are assessed based on their specificity. Correct mappings are required for accurate gene counting in RNA seq and SNP calling. However, neither of these are a real issue for host removal so this is somewhat of a non-issue for this specific application but may have issues in other parts of the pipeline.

  4. Dependency creep: I know that conda fixes a lot of the issues with dependency installation, but there will always be issues with keeping up with updates for multiple dependencies. There is just no way around that. I think it is a noble goal to a minimum of dependencies even if some other minor issues are complicated because of this.

There are faster mappers than both bowtie and bbmap. For example, minimap2 is typically 10X faster than bowtie and is likely to be a better fit for hecatomb than either bowtie or bbmap (https://github.com/lh3/minimap2).

Another option is psuedoalignment mappers such as kallisto (https://pachterlab.github.io/kallisto/about). This would likely be our fastest option, and isn't an additional dependency as it is already used for abundance estimation of assembled contigs. Kallisto is so efficient in both index generation and alignment that a switch to this psuedoalignment approach (if it works!) would get us closer to the magical 'laptop version' of hecatomb.

So I wanted to propose the following. If you have comments or issues on this strategy please post them as comments here:

  1. Remove step_8 mapping the virus masked bacterial genomes. This isn't exactly the topic at hand, but is an elimination of a major, somewhat crippling component in terms of computation and database curation. This should be a separate module.

  2. Swap bbmap/bt2 to kallisto for host mapping

This was a long post, but there are several issues in our weekly calls and on GitHub that deal with mapping and memory issues so I wanted to try and summarize them here as a single place for continued discussion.

SH

taxonomy improvement

Hi Mike,

In working with the SIV data, I realized that there are spaces in taxonomic fields. For example, for families:
Verrucomicrobia subdivision 3
Verrucomicrobia subdivision 6

This will make pulling reads by family more difficult. Could all spaces in taxonomy fields be replaced with underscores in the next update?

Thank you,
Kathie

convert_phyloseq.R

This block of code

# Sumamrise per taxon counts
merged_table <- stmerge %>%
  select(-id) %>%
  unite("lineage", c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "_") %>%
  group_by(lineage) %>%
  summarise_if(is.numeric, funs(sum(as.numeric(.)))) %>%
  separate("lineage", c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "_") %>%
  ungroup()

Generates the warning:

Warning messages:
1: `funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
2: Expected 7 pieces. Additional pieces discarded in 2 rows [137, 138]. 

Error executing PRIMARY_NT_reformat

Cluster error:

/bin/bash: line 10: 18485 Illegal instruction (core dumped) mmseqs filterdb hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/result hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/firsthit --extract-lines 1

Log file error report (same as above):

/bin/bash: line 10: 18485 Illegal instruction (core dumped) mmseqs filterdb hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/result hecatomb_out/RESULTS/MMSEQS_NT_PRIMARY/results/firsthit --extract-lines 1

hecatomb install giving error

hecatomb install
Checking and downloading database files
Running snakemake command:
snakemake -j 32 --use-conda --conda-frontend mamba --rerun-incomplete --printshellcmds --nolock --show-failed-logs --conda-prefix /home/vsingh/miniconda3/envs/hecatomb/snakemake/workflow/conda -s /home/vsingh/miniconda3/envs/hecatomb/snakemake/workflow/DownloadDB.smk -C Output=hecatomb_out
Config file /home/vsingh/miniconda3/envs/hecatomb/snakemake/workflow/../config/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
/usr/bin/bash: line 10: __conda_exe: command not found
Traceback (most recent call last):
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/init.py", line 699, in snakemake
success = workflow.execute(
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/workflow.py", line 973, in execute
self.scheduler = JobScheduler(
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/scheduler.py", line 348, in init
self._executor = CPUExecutor(
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/executors/init.py", line 453, in init
self.exec_job += self.get_additional_args()
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/executors/init.py", line 298, in get_additional_args
if self.workflow.conda_base_path and self.assume_shared_fs:
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/workflow.py", line 276, in conda_base_path
return Conda().prefix_path
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 439, in init
shell.check_output(self._get_cmd("conda info --json"))
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/site-packages/snakemake/shell.py", line 63, in check_output
return sp.check_output(cmd, shell=True, executable=executable, **kwargs)
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/vsingh/miniconda3/envs/hecatomb/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'conda info --json' returned non-zero exit status 127.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.