Giter Site home page Giter Site logo

biobakery / biobakery_workflows Goto Github PK

View Code? Open in Web Editor NEW
92.0 8.0 32.0 328.71 MB

bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters.

Home Page: http://huttenhower.sph.harvard.edu/biobakery_workflows

License: Other

R 12.48% Python 83.06% WDL 4.45%
biobakery private analysis-workflows

biobakery_workflows's Introduction

bioBakery Workflows

bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters. Quality control and statistical summary reports are automatically generated for most data types, which include 16S amplicons, metagenomes, and metatranscriptomes. Workflows are run directly from the command line and tasks can be imported to create your own custom workflows. The workflows and tasks are built with AnADAMA2 which allows for parallel task execution locally and in a grid compute environment.

For additional information, see the bioBakery workflows tutorial.

Table of contents


Getting Started


Requirements

  1. AnADAMA2 (installed automatically)
  2. Python v2.7+
  3. See individual workflows and tasks for additional software requirements.

Installation

Install software

bioBakery workflows can be installed with Conda, Docker, or pip.

To install with Conda:

$ conda install -c biobakery biobakery_workflows
  • Before installing the tool, configure your channels so biobakery is at the top of the list.
  • This will install almost all of the dependencies for all workflows (ie Kneaddata, MetaPhlan2, etc.) excluding those dependencies that have licenses.

To install and run with Docker:

$ docker run -it biobakery/workflows bash
  • The image will include all dependencies for all workflows (ie Kneaddata, MetaPhlan2, etc.) excluding those dependencies that have licenses.

To install with pip:

$ pip install biobakery_workflows

Install databases

Install automatically

Once the software and dependencies are installed, the databases can be installed automatically.

Run the following command to install the databases required for a workflow:

$ biobakery_workflows_databases --install $WORKFLOW

  • Replace $WORKFLOW with the workflow name (ie wmgx, 16s, wmgx_wmtx, or wmgx_demo, isolate_assembly).
  • The databases will be installed at $HOME/biobakery_workflow_databases/ or /opt/biobakery_workflow_databases/ depending on permissions.
  • To install to a custom location add the option --location $FOLDER. With this option you will also need to set the environment variable $BIOBAKERY_WORKFLOWS_DATABASES to the folder so the workflows can find the installed databases.
  • The database install requires some of the dependencies from the corresponding workflow to build and install the databases. For example, installing the wmgx databases requires HUMAnN, KneadData, StrainPhlAn and bowtie2. Please install these dependencies prior to installing the databases. Depending on the method used to install the workflows you might need to install these dependencies in addition to the workflow.

Install manually

Alternatively the databases can be installed manually and then referenced with environment variables. The shotgun data processing workflows require Kneaddata (human, human transcriptome, and SILVA), HUMAnN (utility mapping, nucleotide, and protein databases), and StrainPhlAn (reference and marker) databases while the 16s data processing workflow requires the GreenGenes fasta, taxonomy, and usearch formatted files.

When manually installing the databases, the following environment variables need to be set.

  • Shotgun workflows: KNEADDATA_DB_HUMAN_GENOME, KNEADDATA_DB_RIBOSOMAL_RNA, KNEADDATA_DB_HUMAN_TRANSCRIPTOME, STRAINPHLAN_DB_REFERENCE, and STRAINPHLAN_DB_MARKERS.
  • 16s workflows: GREEN_GENES_USEARCH_DB, GREEN_GENES_FASTA_DB, and GREEN_GENES_TAXONOMY_DB.

How to Run

Basic Usage

All workflows follow the general command format:

$ biobakery_workflows $WORKFLOW --input $INPUT --output $OUTPUT

For a list of all available workflows, run:

$ biobakery_workflows --help

For specific options for a workflow, run:

$ biobakery_workflows $WORKFLOW --help

Data Processing Workflows

The basic command to run a data processing workflow, replacing $WORKFLOW with the workflow name, is:

$ biobakery_workflows $WORKFLOW --input $INPUT_DIR --output $DATA_OUTPUT_DIR

This command will run the workflow on the files in the input folder ($INPUT_DIR to be replaced with the path to the folder containing fastq files). It will write files to the output folder ($DATA_OUTPUT_DIR to be replaced with the folder to write output files).

Visualization Workflow

A single visualization workflow exists that can be used for any data processing workflow. The basic command to run a visualization workflow, replacing $WORKFLOW_VIS with the visualization workflow name, is:

$ biobakery_workflows $WORKFLOW_VIS --input $DATA_OUTPUT_DIR --project-name $PROJECT --output $OUTPUT_DIR

The input folder ($DATA_OUTPUT_DIR to be replaced with the path to the folder) in this command this is a subset of the output folder from the data processing workflow; Run the workflow with the option --help to determine which files are required and which are optional to run the workflow. The folder ($OUTPUT_DIR to be replaced with the path to the output folder) will contain the output files from the visualization workflow. The project name should replace $PROJECT in the command so the report can include the name.

Parallelization Options

When running any workflow you can add the following command line options to make use of existing computing resources:

  • --local-jobs <1> : Run multiple tasks locally in parallel. Provide the max number of tasks to run at once. The default is one task running at a time.
  • --grid-jobs <0> : Run multiple tasks on a grid in parallel. Provide the max number of grid jobs to run at once. The default is zero tasks are submitted to a grid resulting in all tasks running locally.
  • --grid <slurm> : Set the grid available on your machine. This will default to the grid found on the machine with options of slurm and sge.
  • --partition <serial_requeue> : Jobs will be submitted to the partition selected. The default partition selected is based on the default grid.

For additional workflow options, see the AnADAMA2 user manual.


Data Processing Workflows


bioBakery workflows includes a collection of workflows for shotgun sequences and 16s data processing. Most workflows can be run on the command line with the following syntax:

$ biobakery_workflows $WORKFLOW --input $INPUT --output $OUTPUT

See the section on parallelization options to optimize the workflow run based on your computing resources.

Whole Metagenome Shotgun (wmgx)

Super Tasks

  1. [Quality control]
  2. [Taxonomic profiling]
  3. [Functional profiling]
  4. Strain profiling
  5. Assembly (not run by default)

Requirements

  1. KneadData (v0.12.0+)
    1. Install with: $ conda install -c biobakery kneaddata OR $ pip install kneaddata
  2. MetaPhlAn
    1. Install with: $ conda install -c bioconda metaphlan
  3. HUMAnN
    1. Install with: $ conda install -c biobakery humann OR $ pip install humann
  4. StrainPhlAN
    1. Install with: $ conda install -c bioconda strainphlan
  5. Prokka (Only required if running in assembly mode)
  6. MegaHit (Only required if running in assembly mode)
  7. seqtk (Only required if running in assembly mode)

Inputs

  1. A set of fastq (or fastq.gz) files (single-end or paired-end). The files are expected to be named $SAMPLE.fastq.gz,$SAMPLE.R1.fastq.gz, or $SAMPLE.R2.fastq.gz where $SAMPLE is the sample name or identifier corresponding to the sequences. $SAMPLE can contain any characters except spaces or periods.

The workflow will detect if paired-end files are present. By default the workflow identifies paired end reads based on file names containing ".R1" and ".R2" strings. If your paired end reads have different identifiers, use the option --pair-identifier .R1 to provide the identifier string for the first file in the set of pairs.

The workflow by default expects input files with the extension "fastq.gz". If your files are not gzipped, run with the option --input-extension fastq.

To run the workflow

  • $ biobakery_workflows wmgx --input $INPUT --output $OUTPUT
  • In the command replace $INPUT with the path to the folder containing your fastq input files and $OUTPUT with the path to the folder to write output files.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The workflow runs with the default settings for all main tool subtasks. These settings will work for most data sets. However, if you need to customize your workflow settings for the KneadData and StrainPhlAn subtasks please read the documentation for each tool to determine the optimum settings. Then apply these settings by using options for each tool. For example, --qc-options="$OPTIONS" will modify the default settings when running the KneadData subtask and --strain-profiling-options="$OPTIONS" will modify the options when running the StrainPhlAn subtask (replacing the $OPTIONS in each with your selected settings).
  • Add the option --run-assembly to add the tasks to run assembly.

To run a demo

  • Single-end
    • $ biobakery_workflows wmgx --input examples/wmgx/single/ --output workflow_output
  • Paired-end
    • $ biobakery_workflows wmgx --input examples/wmgx/paired/ --output workflow_output
  • Demo input files can be found in the biobakery_workflow source examples folder.

Whole Metagenome and Metatranscriptome Shotgun (wmgx_wmtx)

Super Tasks

  1. [Quality control]
  2. [Taxonomic profiling]
  3. [Functional profiling]
  4. Strain profiling

Requirements

  1. KneadData (v0.12.0+)
    1. Install with: $ conda install -c biobakery kneaddata OR $ pip install kneaddata
  2. MetaPhlAn
    1. Install with: $ conda install -c bioconda metaphlan
  3. HUMAnN
    1. Install with: $ conda install -c biobakery humann OR $ pip install humann
  4. StrainPhlAN
    1. Install with: $ conda install -c bioconda strainphlan

Inputs

  1. Two sets of fastq (or fastq.gz) files (single-end or paired-end). One set is of whole metagenome shotgun data and the other is whole metatranscriptome shotgun data. The files are expected to be named $SAMPLE.fastq.gz,$SAMPLE.R1.fastq.gz, or $SAMPLE.R2.fastq.gz where $SAMPLE is the sample name or identifier corresponding to the sequences. $SAMPLE can contain any characters except spaces or periods.
  2. Optionally, provide a mapping file. This file will have two columns and be tab delimited. The first column is the sample names for the metatranscriptomes and the second is the corresponding metagenome sample. See the demo mapping file for an example.

The workflow will detect if paired-end files are present. By default the workflow identifies paired end reads based on file names containing ".R1" and ".R2" strings. If your paired end reads have different identifiers, use the option --pair-identifier .R1 to provide the identifier string for the first file in the set of pairs.

The workflow by default expects input files with the extension "fastq.gz". If your files are not gzipped, run with the option --input-extension fastq.

To run the workflow

  • $ biobakery_workflows wmgx_wmtx --input-metagenome $INPUT_WMS --input-metatranscriptome $INPUT_WTS --input-mapping $INPUT_MAPPING --output $OUTPUT
  • In the command replace $INPUT_WMS with the path to the folder containing your whole metagenome shotgun fastq.gz input files, $INPUT_WTS with the path to the folder containing your whole metatranscriptome shotgun fastq.gz input files, and $OUTPUT with the path to the folder to write output files. Replace $INPUT_MAPPING with your file of mapping between the metagenome and metatranscriptome samples.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The workflow runs with the default settings for all main tool subtasks. These settings will work for most data sets. However, if you need to customize your workflow settings for the KneadData subtasks please read the documentation for KneadData to determine the optimum settings. Then apply these settings by using the option --qc-options="$OPTIONS".

To run a demo

  • $ biobakery_workflows wmgx_wmtx --input-metagenome examples/wmgx_wmtx/wms/ --input-metatranscriptome examples/wmgx_wmtx/wts/ --input-mapping examples/wmgx_wmtx/mapping.tsv --output workflow_output
  • Demo input files can be found in the biobakery_workflow source examples folder.

16S rRNA (16s)

The 16s workflow has two methods that can be used: UPARSE (with either USEARCH or VSEARCH (default)) and DADA2. All methods perform quality control and generate taxonomic tables.

Workflow diagrams

Super Tasks

  1. Demultiplex (only for raw reads)
  2. Merge Samples and Rename
  3. Quality Control (Learn Error Rates for DADA2 method)
  4. Taxonomic Profiling
  5. Functional Profiling (with PICRUSt v1 or v2; v2 requires python3)

Requirements

  1. Vsearch
  2. Usearch (v9.0.2132) (Not required for DADA2 or VSEARCH method)
  3. PICRUSt v1.1 (or v2)
    1. Install with: $ conda install -c bioconda picrust
    2. OR Install with: $ conda install -c bioconda picrust2
  4. BIOM v2 (Not required for DADA2 method)
    1. Install with: $ pip install biom-format
  5. Clustal Omega
    1. Install with: $ conda install -c bioconda clustal-omega
  6. EA-Utils (Only required for raw reads)
    1. Install with: $ conda install -c bioconda ea-utils
  7. FastTree
    1. Install with: $ conda install -c bioconda fasttree
  8. R and the following packages: dada2, gridExtra, tools, ggplot2, seqinr (Only required for DADA2 method)
    1. Follow the DADA2 install directions
  9. FIGARO (Only required if using FIGARO to estimate truncation lengths for DADA2)

Inputs

  1. A set of fastq (or fastq.gz) files (single-end or paired-end; only pair-end for DADA2) raw or demultiplexed. If the files are demultiplexed, the files are expected to be named $SAMPLE.fastq.gz,$SAMPLE_R1_001.fastq.gz, or $SAMPLE_R2_001.fastq.gz where $SAMPLE is the sample name or identifier corresponding to the sequences. $SAMPLE can contain any characters except spaces or periods.
  2. A barcode file (only required for raw reads).

The workflow will detect if paired-end files are present. By default the workflow identifies paired end reads based on file names containing "_R1_001" and "_R2_001" strings. If your paired end reads have different identifiers, use the option --pair-identifier .R1. to provide the identifier string for the first file in the set of pairs.

The workflow by default expects input files with the extension "fastq.gz". If your files are not gzipped, run with the option --input-extension fastq.

To run the workflow

  • $ biobakery_workflows 16s --input $INPUT --output $OUTPUT
  • In the command replace $INPUT with the path to the folder containing your fastq input files and $OUTPUT with the path to the folder to write output files.
  • By default the workflow implements the UPARSE workflow using VSEARCH plus additional tasks. To run with USEARCH as the executable add the option --method dada2.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The workflow runs with default settings for the subtasks. Depending on the lengths of your reads you might want to change the setting --trunc-len-max 200, if running the VSEARCH/USEARCH method, to a smaller value. Reading through the maxee table will help to determine the length to use for trimming based on the joined reads and their quality scores. For other default settings, please run the workflow with the --help option. All of the other settings will work for most data sets. If there are any you would like to change, please review the usearch documentation to determine the optimal settings.
  • Add the option --method dada2 to run the DADA2 method instead of VSEARCH.
  • Add the option --amplicon-length <N> to run FIGARO to estimate truncation lengths when running with the DADA2 method.

Isolate Assembly (isolate_assembly)

This workflow will assemble and annotate sequenced microbial isolate genomes. It runs the raw sequences through quality control, assembly (with SPAdes), annotation (with Prokka), functional annotation, quality assessment, and then creates a final annotated contig file.

Workflow diagram

Super Tasks

  1. Quality control
  2. Assembly
  3. Annotation
  4. Functional annotation
  5. Quality assessment

Requirements

  1. KneadData (v0.12.0+)
    1. Install with: $ conda install -c biobakery kneaddata OR $ pip install kneaddata
  2. SPAdes
    1. Install with: $ conda install -c bioconda spades
  3. Prokka
    1. Install with: $ conda install -c bioconda prokka
  4. Quast
    1. Install with: $ conda install -c bioconda quast
  5. CheckM (plus databases)
  6. run_DBcan (plus databases)
  7. Emapper (version 2+) (databases installed with biobakery utility, biobakery_workflows_databases)

Inputs

  1. A set of fastq (or fastq.gz) files (single-end or paired-end) raw. This workflow does not allow for multiplexed files as input. The files are expected to be named $SAMPLE.fastq.gz,$SAMPLE_R1_001.fastq.gz, or $SAMPLE_R2_001.fastq.gz where $SAMPLE is the sample name or identifier corresponding to the sequences. $SAMPLE can contain any characters except spaces or periods.

The workflow will detect if paired-end files are present. By default the workflow identifies paired end reads based on file names containing "_R1_001" and "_R2_001" strings. If your paired end reads have different identifiers, use the option --pair-identifier .R1. to provide the identifier string for the first file in the set of pairs.

To run the workflow

  • $ biobakery_workflows isolate_assembly --input $INPUT --species-name $SPECIES --output $OUTPUT
  • In the command replace $INPUT with the path to the folder containing your fastq input files, $SPECIES with the name of the isolate sequenced and $OUTPUT with the path to the folder to write output files. The $SPECIES input string is used as the basename of the contig files.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.

Visualization Workflows

bioBakery workflows includes a single universal visualization workflow for shotgun sequences and 16s data. The workflow can be run on the command line with the following syntax:

$ biobakery_workflows vis --input $INPUT --project-name $PROJECT --output $OUTPUT

A subset of files from the $OUTPUT folder of a data processing workflow can be used in the $INPUT folder to the visualization workflow. For detailed information on the input files required for the visualization workflow, see the help message for the workflow by running the command:

$ biobakery_workflows vis --help

Visualization for Whole Metagenome Shotgun and 16S (vis)

This workflow generates a document of tables, bar plots, a PCoA plot, scatter plots, and heatmaps using the output of the wmgx or 16S workflows as input.

Requirements

  1. Pweave (installed automatically)
  2. NumPy and SciPy
    1. Install with: $ pip install numpy AND $ pip install scipy
  3. Matplotlib
    1. Install with: $ pip install matplotlib
  4. LaTeX
  5. Pandoc (<version2 required)
    1. Install with: $ conda install pandoc
  6. Hclust2
    1. Install with: $ conda install -c biobakery hclust2
  7. R with the vegan package

Inputs

  1. An input folder containing the final products from the wmgx or 16S data workflow.
    1. OPTIONAL: A file of the KneadData read counts for the wmgx samples (single or paired end).
    2. A file of the merged taxonomic profile or closed reference OTU table or ASV table.
    3. OPTIONAL: A file of the merged pathway abundances (normalized).
    4. OPTIONAL: A file of the HUMAnN alignment counts.
    5. OPTIONAL: A file of the HUMAnN feature counts.
    6. OPTIONAL: A file of the read counts per sample (including total reads, classified, and unclassified).
    7. OPTIONAL: A file of the eestats for all samples.
    8. The log file from the corresponding data processing workflow.
  2. The project name (Optional).
  3. Introduction text (Optional).
  4. The report format (Options: pdf/html, pdf is default).

Outputs

  1. A pdf (or html) report.
  2. A directory of figures included in the report.
    1. Quality control section
      1. Paired end read count table and barchart
      2. Orphan read count table and barchart
      3. Microbial read proportion table
    2. Taxonomy section
      1. Species count table
      2. Ordination (PCoA)
      3. Heatmap of top species abundances
      4. Stacked barplot of top species abundances
    3. Functional profiling section
      1. Pathway abundance
        1. Top pathways by abundance (heatmap and table)
        2. Top pathways by variance (heatmap and table) ## Feature counts section
      2. Scatter plot of the aligned reads (nucleotide/translated)
      3. Scatter plot of the gene families counts (nucleotide/translated)
      4. Scatter plot of the ECs counts (nucleotide/translated)
      5. Scatter plot of the pathway counts (nucleotide/translated)
  3. A directory of read count tables
  4. A zip archive containing the pdf (or html) report, figures, and data files

To run the workflow

  • $ biobakery_workflows vis --input $INPUT --project-name $PROJECT --output $OUTPUT
  • In the command replace $INPUT with a folder containing files from the output folder created by running the wmgx or 16S data processing workflow, $PROJECT with the name of the project, and $OUTPUT with the path to the folder to write output files.

Stats workflow

The stats workflow takes as input feature tables generated from the wmgx or 16s workflows. It can also be used with any tab-delimited feature table.

$ biobakery_workflows stats --input $INPUT --input-metadata $INPUT_METADATA --output $OUTPUT --project-name $PROJECT

Requirements

  1. MaAsLin2
  2. HUMAnN (for pathways barcharts stratified by species)
  3. HAllA
  4. R plus vegan, ggplot2, optparse, gridExtra, permute, reshape2, RColorBrewer, cowplot, plyr, ade4, viridis

Inputs

The workflow requires five arguments and allows for seventeen optional arguments.

Workflow arguments can be provided on the command line or with an optional config file using the AnADAMA2 built-in option "--config ".

Required

  1. INPUT: The folder containing the input data files.

    • Can be txt, tsv or biom (also compressed okay, like PICRUSt2 outputs)
    • Looks for taxonomic profile, pathway abundance, ec abundance, etc.
    • Can identify filetype based on contents.
    • Works with wmgx and 16s input files.
    • Data file can have the samples as rows or columns.
  2. INPUT_METADATA: The metadata file for input.

    • File can have the samples as rows or columns.
  3. OUTPUT: The folder to write the output files.ย 

  4. PROJECT: The name of the project (string for the report header).

Optional

  1. Transform: Used with MaAsLin2.
  2. Longitudinal: Set if the study is longitudinal (default is univariate/multivariate).
  3. Fixed effects: Used to generate the model (MaAsLin2 and beta diversity).
  4. Multivariate fixed effects: Used to generate the model (MaAsLin2 and beta diversity). These variables are listed first in the beta diversity covariate equation.
  5. Random effects: Used to generate the model (MaAsLin2).
  6. Bypass Maaslin: Skip running MaAsLin2 on all input files.
  7. Permutations: Total number of permutations to apply with the permanova (default=4999). (Only for longitudinal studies)
  8. Static covariates: Metadata variable names. (Only for longitudinal studies)
  9. Scale: The scale to apply to the permanova. (Only for longitudinal studies)
  10. Min abundance: Min filtering value (default 0.0001)
  11. Min prevalence: Min filtering value (default 0.1)
  12. Max missing: Max percentage of values for a metadata variable that are missing (default 20%)
  13. Format: Format for the report (html or pdf)
  14. Top pathways: Max number of pathways to display for stratified abundance plots.
  15. Metadata categorical: Names of metadata variables that are categorical. (Only for stratified pathways plots)
  16. Metadata continuous: Names of the metadata variables that are continuous. (Only for stratified pathways plots)
  17. Metadata exclude: Exclude these variables from the metadata. (Only for stratified pathways plots)
  18. Introduction text: The text written to the beginning of the report.

Workflow steps

  1. Identify data file types.

    • Reads through the start of all tsv/txt and biom files in the input folder to find taxonomic profiles, pathway abundances, ec abundances, etc (with files of unknown referenced in the report with their name).
    • Looks for files of those types for both wmgx and 16s.
    • Requires at least one taxonomic profile.
  2. Determine the study type based on the input data files (ie. wmgx or 16s).

  3. If biom files are provided, convert biom files to tsv.

  4. Checks for sample names in feature tables that are not included in metadata file. Throws an error to request the user add the sample names to the metadata file.

  5. Create feature tables for all input files. These files are compatible with all downstream processing tasks (ie maaslin2, humann_barplots).

  6. Run mantel tests compairing all data files input.

  7. If pathway abundance files are provided, generate stratified pathways plots.

    • For the top N pathways (based on significant results from maaslin2), generate plot for each categorical metadata variable.
  8. If longitudinal, run the permanova script.

  9. If not longitudinal, run the beta diversity script to generate stacked barplots of R-squared and P Values for adonis run on each metadata variable and again on all variables.

  10. Run MaAsLin2 on all input files.

  11. Run HAllA on all input files (minus gene families due to size).

  12. Create a report with figures.

Run a demo

Using the HMP2 (IBDMDB) merged data files provided by the project, run the stats workflow to generate a report.

$ biobakery_workflows stats --input HMP2_data/ --input-metadata HMP2_metadata.tsv --fixed-effects="diagnosis,dysbiosisnonIBD,dysbiosisUC,dysbiosisCD,antibiotics,age" --random-effects="site,subject" --project-name HMP2 --output HMP2_stats_output --longitudinalย  --static-covariates="age" --permutations 10 --maaslin-options="reference='diagnosis,nonIBD'"

The files in the input folder are the taxonomic profile, pathway abundance, and ec abundance. Fixed and random effect variables are specified for the MaAsLin2 runs. The metadata type selected is longitudinal and the static covariate in the study metadata is "age". The reduced number of permutations reduces the runtime for the three permanova calculations. The reference is provided to MaAsLiN2 since diagnosis is a variable with more than two levels to set the reference variable of "nonIBD" for the module and the resulting box plots.

Outputs include a folder for each MaAsLin2 run plus figures folders and a report.

WDL workflow

A mtx workflow based on the bioBakery ANADAMA2 wmgx workflows.

This workflow is currently installed in the Terra workspace: https://app.terra.bio/#workspaces/rjxmicrobiome/mtx_workflow .

The WDL is located in this repository at: biobakery_workflows/workflows/wtx.wdl .

Inputs

The workflow has eleven required inputs and nine optional inputs.

Required inputs The workflow requires eleven inputs for each run. Five inputs can be modified for each project where as the other six inputs would only be modified with software version changes.

  • ProjectName : The name of the sequencing project. The final output report and zip archive will use this name (only alphanumeric characters allowed).
  • InputExtension : The extension for all of the input files (example ".fastq.gz")
  • InputRead1Identifier : The identifier in the file name for those files that are read1 (example ".R1")
  • InputRead2Identifier : The identifier in the file name for those files that are read2 (example ".R2")
  • InputRead1Files : A file path (in google bucket) with a list of all of the read1 files. This file must have the full paths to all of the files and is only expected to include the read1 files (not those for read2). The names for each of the samples will be computed based on the read pair identifier and the input file extension provided. For example a file named SAMPLE1.R1.fastq.gz would have a sample name of "SAMPLE1", a read1 identifier of ".R1". and an extension of ".fastq.gz". It is expected that each sample with have two files (one file for each read of the pair).

To generate a file to use as input for InputRead1Files, follow the Terra instructions https://support.terra.bio/hc/en-us/articles/360033353952-Creating-a-list-file-of-reads-for-input-to-a-workflow , adding to command #2 the InputRead1Identifier and the InputExtension. For example with InputRead1Identifier = ".R1" and InputExtension = ".fastq.gz" command #2 would now be gsutil ls gs:/your_data_Google_bucket_id/ | grep ".fastq.gz" | grep ".R1" > ubams.list . Also since for this workflow we are looking for fastq or fastq.gz input files you might change the name of the file list in this command from "ubams.list" to "fastq_list.txt" .

These six required inputs would only be modified if the versions of Kneaddata and HUMAnN v2 change. These are databases that are specifically tied to the software version.

  • versionSpecificChocophlan : The Chocophlan database used by HUMAnN. This is located at databases/humann/full_chocophlan_plus_viral.v0.1.1.tar.gz in this workspace google bucket.
  • versionSpecifichumanDB : The human reference database used by Kneaddata. This is located at databases/kneaddata/Homo_sapiens_hg37_human_contamination_Bowtie2_v0.1.tar.gz in this workspace google bucket.
  • versionSpecifictranscriptDB : The human rrna reference database used by Kneaddata. This is located at databases/kneaddata/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz in this workspace google bucket.
  • versionSpecificrrnaDB : The SILVA rrna reference database used by Kneaddata. This is located at databases/kneaddata/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.2.tar.gz in this workspace google bucket.
  • versionSpecificUniRef90 : The uniref90 reference database used by HUMAnN. This is located at databases/humann/uniref90_annotated_1_1.tar.gz in this workspace google bucket.
  • versionSpecificUtilityMapping : The utility mapping database used by HUMAnN. This is located at databases/humann/full_utility_mapping_1_1.tar.gz in this workspace google bucket.

Optional inputs There are an additional ten optional inputs for each workflow run. These are not required. If not set, the default values will be used.

  • bypassFunctionalProfiling (Default = false): This set to true will bypass running functional profiling and all of the downstream tasks including normalization and merging of the functional data products including gene families, ecs, and pathways.
  • dataType (Default = "mtx"): This is the sequencing data type (mtx or mgx).
  • inputMetadataFile (Default = None) : This file is used with the visualization task to annotate the figures with metadata.
  • MaxMemGB_FunctionalProfileTasks (Default = 32 Gb): This is the max memory to request for each HUMAnN v3.0 task. This might need to be increased for larger input files.
  • MaxMemGB_QualityControlTasks (Default = 8 Gb): This is the max memory to request for each Kneaddata task. This might need to be increased for larger input files.
  • MaxMemGB_TaxonomicProfileTasks (Default = 24 Gb): This is the max memory to request for each MetaPhlAn v3.0 task. This might need to be increased for larger input files.
  • preemptibleAttemptsOverride (Default = 2): This setting determines how many times to rerun one of the main compute tasks (HUMAnN v3.0, Kneaddata, and MetaPhlAn v3.0) on pre-emptible instances. If set to zero a non-pre-emptible instance will be used.

There are three additional optional inputs that can be used to run with one or more custom databases.

  • customQCDB1 (Default = None) : Provide a custom bowtie2 formatted datatabase to use for the quailty control step instead of the default human reference.
  • customQCDB2 (Default = None) : Provide a second custom bowtie2 formatted database to be used in addition to the other custom database provided.
  • customQCDB3 (Default = None) : Provide a third custom bowtie2 formatted database to be used in addition to the other custom database provided. There are an additional nine optional inputs for each workflow run. These are not required. If not set, the default values will be used.

Outputs

The workflow has several intermediate outputs and a final zip archive that includes a report of exploratory figures plus compiled data tables. Each task has its own folder in the google bucket output folder with a sub-folder for each time it is run. The outputs of interest, including their respective folders, are described below. $SAMPLE_NAME is the name of the sample included in the original raw files. For example, SAMPLE1.R1.fastq.gz would have a sample name of "SAMPLE1".

  • call-QualityControl (with sub-folders for each sample, in each subfolder there are the following files)
    • $SAMPLE_NAME.fastq.gz : This is the file of reads after running through QC.
    • $SAMPLE_NAME.log : This is the log from Kneaddata that includes read counts.
    • glob*/$SAMPLE_NAME_DB_contam*.fastq.gz : These are the reads that mapped to the reference database (with name $DB) for this sample.
    • glob*/$SAMPLE_NAME_[R1|R2].[html|zip] : These are the output files from running fastqc on read1 and read2 prior to running quality control.
  • call-FunctionalProfile (with sub-folders for each sample, in each subfolder there are the following files)
    • $SAMPLE_NAME.log : This is the log from the HUMAnN v2.0 that includes read alignment counts.
    • glob*/$SAMPLE_NAME_bowtie2_unaligned.fa : These are the unaligned reads from running the nucleotide search.
    • glob*/$SAMPLE_NAME_diamond_unaligned.fa : These are the unaligned reads from running the translated search.
  • call-VisualizationReport
    • $PROJECT_NAME_visualization.zip : This folder contains a visualization report plus final compiled data tables.
      • wmgx_report.pdf : This is the exploratory report of tables and figures.
      • data/humann2_feature_counts.tsv : This contains the feature counts (pathways, gene families, ecs) for each sample.
      • data/humann2_read_and_species_counts.tsv : This contains the counts of reads aligning at each step plus the total number of species identified for each sample.
      • data/kneaddata_read_count_table.tsv : This contains the read counts (split into pairs and orphans) for each step in the quality control process for each sample.
      • data/metaphlan2_taxonomic_profiles.tsv : This contains the merged taxonomic profiles for all samples.
      • data/microbial_counts_table.tsv : This table includes counts ratios for each step of the quality control process for all samples.
      • data/pathabundance_relab.tsv : This is a merged table of the pathway abundances for all samples normalized to relative abundance.
      • data/qc_counts_orphans_table.tsv : This is table with the total number of orphan reads not aligning to each of the reference tables.
      • data/qc_counts_pairs_table.tsv : This is table with the total number of paired reads not aligning to each of the reference tables.
      • data/taxa_counts_table.tsv: This table includes the total number of species and genera before and after filtering.
      • data/top_average_pathways_names.tsv : This table includes the top pathways by average abundance, with their full names, including average abundance and variance.

Run a demo

A demo data set is included in the Terra workspace. The demo set includes six paired samples (three MTX and three MGX) from IBDMDB plus a small metadata file. Using preemptive instances, this demo set will cost about $5 to run.

IBDMDB (6 sample) demo run configuration:

  • ProjectName : "ibdmdb_demo" (this can be any string you would like)
  • InputExtension : ".fastq.gz"
  • InputRead1Identifier : "_R1"
  • InputRead2Identifier : "_R2"
  • InputRead1Files : "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/ibdmdb_file_list.txt"
  • inputMetadataFile : "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/ibdmdb_demo_metadata.txt"

Required software specific databases:

  • versionSpecificChocophlan : "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/humann/full_chocophlan_plus_viral.v0.1.1.tar.gz"
  • versionSpecifichumanDB : "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/Homo_sapiens_hg37_human_contamination_Bowtie2_v0.1.tar.gz"
  • versionSpecifictranscriptDB: "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.2.tar.gz"
  • versionSpecificrrnaDB : "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz"
  • versionSpecificUniRef90 : "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/humann/uniref90_annotated_1_1.tar.gz"
  • versionSpecificUtilityMapping : `"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/humann/full_utility_mapping_1_1.tar.gz"``

Optional custom databases (to run with one or more custom databases instead of the default references used in QC)

  • customQCDB1 : "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/Clupus_bowtie2.tar.gz"
  • customQCDB2 : "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/ClupusRNA_bowtie2.tar.gz"

Refer to the section above for descriptions of the output files generated by running the workflow.

Example output files from running the IBDMDB data set with metadata can be found in this workspace in the folder IBDMDB/final_outputs/ibdmdb_demo_visualizations.zip.

Contributions

Thanks go to these wonderful people:

biobakery_workflows's People

Contributors

anamailyan avatar carze avatar ljmciver avatar sagun98 avatar tkuntz-hsph avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biobakery_workflows's Issues

TypeError: object of type 'dict_keyiterator' has no len()

Dear developers,
I'm trying to install on a HPC:

  • OS: Oracle Linux (Red Hat) 8.6
  • installation method: conda create -y -p ${TARGETDIR} -c conda-forge -c bioconda -c biobakery "biobakery_workflows=3.0.0a7"
  • postinstall: I downloaded kneaddata databases with kneaddata_database command

I tried to execute with the example data form this repository:

biobakery_workflows wmgx \
        --input biobakery_workflows/examples/wmgx/paired \
        --output outdir \
        --bypass-strain-profiling

but always I get this error:

Traceback (most recent call last):
  File "/apps/biobakery-workflows/3.0.0a7/bin/wmgx.py", line 184, in <module>
    workflow.go()
  File "/apps/biobakery-workflows/3.0.0a7/lib/python3.9/site-packages/anadama2/workflow.py", line 800, in go
    _runner.run_tasks(task_idxs)
  File "/apps/biobakery-workflows/3.0.0a7/lib/python3.9/site-packages/anadama2/runners.py", line 141, in run_tasks
    self.ctx._handle_task_started(idx)
  File "/apps/biobakery-workflows/3.0.0a7/lib/python3.9/site-packages/anadama2/workflow.py", line 884, in _handle_task_started
    self._reporter.task_started(task_no)
  File "/apps/biobakery-workflows/3.0.0a7/lib/python3.9/site-packages/anadama2/reporters.py", line 196, in task_started
    r.task_started(task_no)
  File "/apps/biobakery-workflows/3.0.0a7/lib/python3.9/site-packages/anadama2/reporters.py", line 641, in task_started
    self.log_event("ready and waiting for resources", task_no, self._daginfo(task_no))
  File "/apps/biobakery-workflows/3.0.0a7/lib/python3.9/site-packages/anadama2/reporters.py", line 597, in _daginfo
    return msg.format(len(parents), parents, len(children), children)
TypeError: object of type 'dict_keyiterator' has no len()

I don't know how to troubleshoot. Any direction, hint or idea would be welcome.
I couldn't find any related issues here, on the anadama2 github page or the biobakery forum

Thanks

Visualization for Whole Metagenome Shotgun (wmgx_vis)

Dear mantainers,
I'm using Biobakery for shotgun metagenomics analyis. I am profiling viruses and bacteria but i would like to separate the visualization output obtaining two output, one for bacteria and one for viruses. Is it possible to perform such task using wmgx_vis? Would you rather suggest to conduct two separated workflow?
Thank you in advance for your answer and consideration.

Marco

About the parameters of kneaddata

When I use workflow of biobakery, I can't change the parameters of kneaddata. I can only use the default parameters of workflow, right๏ผŸ

nanopore reads?

Hi
Just to confirm, none of these tools is appropriate for amplicon sequencing (e.g. whole 16S sequencing on Oxford Nanopore platforms)?
Wherever I look I only find mentions to "shotgun"..

Thanks in advance for any reply
Martin

biobakery_workflows_databases not working

Dear Whom It May Concern,

I am attempting to run the workflows tutorial (https://github.com/biobakery/biobakery/wiki/biobakery_workflows#121-install-with-conda) on Ubuntu 20.04 LTS on Windows Subsystem for Linux 2. The conda script was not working, so I moved on the the pip method and everything seemed to work installing-wise. But I get the following error when attempting to download the database with biobakery_workflows v0.15.1:

(base) root@DESKTOP-6BMVSG8:/mnt/c/Users/bradshawd# biobakery_workflows_databases --install wmgx_demo
Traceback (most recent call last):
File "/usr/local/bin/biobakery_workflows_databases", line 5, in
from biobakery_workflows.biobakery_workflows_databases import main
File "/usr/local/lib/python3.8/dist-packages/biobakery_workflows/biobakery_workflows_databases.py", line 30, in
from . import utilities
File "/usr/local/lib/python3.8/dist-packages/biobakery_workflows/utilities.py", line 33, in
from anadama2.tracked import TrackedDirectory
File "/usr/local/lib/python3.8/dist-packages/anadama2/init.py", line 83, in
from .workflow import Workflow
File "/usr/local/lib/python3.8/dist-packages/anadama2/workflow.py", line 24, in
from . import backends
File "/usr/local/lib/python3.8/dist-packages/anadama2/backends.py", line 7, in
import leveldb
SystemError: bad call flags

How do I solve this issue?

Thank you for your time and help. Please let me know if you need more information.

Sincerely,

David

Cannot install full shotgun databases (wmgx) in the biobakery VM image

Dear developer(s),

I am new to the biobakery VM images and currently I am going through the biobakery_workflows tutorials following:

https://github.com/biobakery/biobakery/wiki/biobakery_workflows

However, I found it difficult to install the full shotgun databases following the "1.2.5 Install databases" in the tutorial. Below is my command and the returns as shown in the terminal:

**vagrant@biobakery:~$ biobakery_workflows_databases --install wmgx
Installing humann utility mapping database
Download URL: http://huttenhower.sph.harvard.edu/humann2_data/full_mapping_v201901.tar.gz
Downloading file of size: 2.55 GB

2.55 GB 100.00 % 9.49 MB/sec 0 min -0 sec
Extracting: /home/vagrant/biobakery_workflows_databases/humann/full_mapping_v201901.tar.gz

Database installed: /home/vagrant/biobakery_workflows_databases/humann/utility_mapping

Unable to write to the HUMAnN config file.
Unable to install database. Error running command: humann_databases --download utility_mapping full /home/vagrant/biobakery_workflows_databases/humann**

May be something is wrong with the HUMAnN? I tried re-downloading the humann but it didn't work. May you please advise how to address this issue?

Thanks in advance

Warm regards

Hong

error installing databases "no module named 'leveldb'"

Hi,
I hope you can help me with this, I installed the workflow with conda and when I try to install the databases I keep getting this error. Thanks!!

$ biobakery_workflows_databases --install wmgx
Traceback (most recent call last):
File "/hpc/group/mp345/miniconda3/envs/shotgun/bin/biobakery_workflows_databases", line 6, in
from biobakery_workflows.biobakery_workflows_databases import main
File "/hpc/group/mp345/miniconda3/envs/shotgun/lib/python3.9/site-packages/biobakery_workflows/biobakery_workflows_databases.py", line 30, in
from . import utilities
File "/hpc/group/mp345/miniconda3/envs/shotgun/lib/python3.9/site-packages/biobakery_workflows/utilities.py", line 34, in
from anadama2.tracked import TrackedDirectory
File "/hpc/group/mp345/miniconda3/envs/shotgun/lib/python3.9/site-packages/anadama2/init.py", line 83, in
from .workflow import Workflow
File "/hpc/group/mp345/miniconda3/envs/shotgun/lib/python3.9/site-packages/anadama2/workflow.py", line 24, in
from . import backends
File "/hpc/group/mp345/miniconda3/envs/shotgun/lib/python3.9/site-packages/anadama2/backends.py", line 7, in
import leveldb
ModuleNotFoundError: No module named 'leveldb'

Problem with databases install

Hi you all, we have installed in a cluster with Conda, biobakery_workflows together with all the dependencies it needs for whole genome sequencing, MetaPhlAn, kneaddata, humann, but I am getting this error when running
biobakery_workflows_databases --install wmgx ERROR: Unable to find tool required for database install: strainphlan.py. Please install StrainPhlAn.

StrainPhlan is installed and the file is also present. We are working in a Conda environment, could it be a problem related to this?

Many thanks for your help.

error:biobakery_workflows_databases --install 16s_usearch

(biobakery_workflows) lizhihua@lizhihua-S2600CP:$ biobakery_workflows --versionbiobakery_workflows v3.0.0-alpha.4
(biobakery_workflows) lizhihua@lizhihua-S2600CP:
$ kneaddata --version
kneaddata v0.7.4
(biobakery_workflows) lizhihua@lizhihua-S2600CP:~$ biobakery_workflows_databases --install 16s_usearch
Downloading green genes database files
Downloading ftp://greengenes.microbio.me/greengenes_release/gg_13_8_otus/rep_set/97_otus.fasta
Downloading file of size: 136.58 MB

136.59 MB 100.00 % 3.75 MB/sec 0 min -0 sec

Downloading ftp://greengenes.microbio.me/greengenes_release/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt
Downloading file of size: 9.61 MB

9.61 MB 100.01 % 1.20 MB/sec 0 min -0 sec

Traceback (most recent call last):
File "/home/lizhihua/miniconda3/envs/biobakery_workflows/bin/biobakery_workflows_databases", line 11, in
sys.exit(main())
File "/home/lizhihua/miniconda3/envs/biobakery_workflows/lib/python3.7/site-packages/biobakery_workflows/biobakery_workflows_databases.py", line 244, in main
run_command(["download_picrust_files.py"])
File "/home/lizhihua/miniconda3/envs/biobakery_workflows/lib/python3.7/site-packages/biobakery_workflows/biobakery_workflows_databases.py", line 73, in run_command
id=subprocess.check_call(command, shell=shell)
File "/home/lizhihua/miniconda3/envs/biobakery_workflows/lib/python3.7/subprocess.py", line 358, in check_call
retcode = call(*popenargs, **kwargs)
File "/home/lizhihua/miniconda3/envs/biobakery_workflows/lib/python3.7/subprocess.py", line 339, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/lizhihua/miniconda3/envs/biobakery_workflows/lib/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/home/lizhihua/miniconda3/envs/biobakery_workflows/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'download_picrust_files.py': 'download_picrust_files.py'

Incorrect download_eggnog_data.py arguments

Hello,

I recently came across your tool and wanted to give it a try. I've successfully installed biobakery_workflows using conda via the conda command: conda install -c biobakery biobakery_workflows. Next I tried to download databases corresponding to the isolate_assembly workflow:

biobakery_workflows_databases --install isolate_assembly --location $LOCATION

When running the above command, I encounter a FileNotFoundError pointing to download_eggnog_data.py. I have tried and failed to find this file in the GitHub repo. Is the repo not synchronized with the biobakery conda recipe? Is download_eggnog_data.py on another branch that I may download and place in the required location? The received traceback is below. Thanks!

Downloading eggnog mapper databases
Traceback (most recent call last):
  File "/home/evan/miniconda3/envs/project/bin/biobakery_workflows_databases", line 11, in <module>
    sys.exit(main())
  File "/home/evan/miniconda3/envs/project/lib/python3.7/site-packages/biobakery_workflows/biobakery_workflows_databases.py", line 238, in main
    run_command(["download_eggnog_data.py","--data_dir",eggnog_install_path,"-y"])
  File "/home/evan/miniconda3/envs/project/lib/python3.7/site-packages/biobakery_workflows/biobakery_workflows_databases.py", line 73, in run_command
    id=subprocess.check_call(command, shell=shell)
  File "/home/evan/miniconda3/envs/project/lib/python3.7/subprocess.py", line 358, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/home/evan/miniconda3/envs/project/lib/python3.7/subprocess.py", line 339, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/home/evan/miniconda3/envs/project/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/home/evan/miniconda3/envs/project/lib/python3.7/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'download_eggnog_data.py': 'download_eggnog_data.py'

Download biobakery_workflows v3.0.0a7 and find that linux-aarch64 is not supported

Dear developer(s),

I am new to biobakery_workflows. I need to use linux-aarch64 biobakery_workflows v3.0.0a7 for business development, but I found at this address (https://anaconda.org/biobakery/biobakery_workflows) that linux-aarch64 is not supported. The screenshot of the download address is as follows :
image

What should I do if I want to support linux-aarch64? May you please advise how to address this issue?

Thanks in advance

warm regards

wj

mpa_v30_CHOCOPhlAn_201901 missing (but mpa_v31_CHOCOPhlAn_201901 present instead)

Hello,

While installing the biobakery-workflow dependencies according to https://github.com/biobakery/biobakery_workflows/wiki/Installing-the-workflows-with-pip, during the step:

$ biobakery_workflows_databases --install wmgx

I get:

Could not locate a Bowtie index corresponding to basename "/u/local/apps/PYTHON-VIRT-ENVS/3.6.8/biobakery/lib/python3.6/site-packages/metaphlan/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901"

looking in /u/local/apps/PYTHON-VIRT-ENVS/3.6.8/biobakery/lib/python3.6/site-packages/metaphlan/metaphlan_databases/ I see the following:

$ ls /u/local/apps/PYTHON-VIRT-ENVS/3.6.8/biobakery/lib/python3.6/site-packages/metaphlan/metaphlan_databases/mpa_v31_CHOCOPhlAn_201901.
mpa_v31_CHOCOPhlAn_201901.1.bt2      mpa_v31_CHOCOPhlAn_201901.3.bt2      mpa_v31_CHOCOPhlAn_201901.fna.bz2    mpa_v31_CHOCOPhlAn_201901.pkl        mpa_v31_CHOCOPhlAn_201901.rev.2.bt2
mpa_v31_CHOCOPhlAn_201901.2.bt2      mpa_v31_CHOCOPhlAn_201901.4.bt2      mpa_v31_CHOCOPhlAn_201901.md5        mpa_v31_CHOCOPhlAn_201901.rev.1.bt2  mpa_v31_CHOCOPhlAn_201901.tar

and if I try to reinstall the DB I get:

$ metaphlan --install --nproc 2 
The database is installed

Is it possible that the metaphian DB has been updated from version 30 to 31 and that biobakery needs to be updated?

Thanks,

RD

biobakery_workflows wmgx_vis error

biobakery_workflows wmgx_vis --input workflow_output/ --project-name PRJNA685168 --output PRJNA685168_vis_no_metadata
(May 15 15:57:34) [0/2 - 0.00%] **Ready ** Task 0: document
(May 15 15:57:34) [0/2 - 0.00%] **Started ** Task 0: document
Exception:
float division by zero
Error messages will be included in output document
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:588: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_under( bottom_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:594: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_over( top_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:588: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_under( bottom_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:594: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_over( top_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:588: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_under( bottom_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:594: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_over( top_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:588: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_under( bottom_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:594: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_over( top_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:588: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_under( bottom_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:594: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_over( top_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:588: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_under( bottom_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:594: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_over( top_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:588: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_under( bottom_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:594: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_over( top_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:588: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_under( bottom_col )
/home/lk/miniconda3/envs/metagenome/bin/hclust2.py:594: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. This has been deprecated since 3.3 and in 3.6, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = mpl.cm.get_cmap("bbcry").copy()
cm.set_over( top_col )
(May 15 15:58:43) [1/2 - 50.00%] **Failed ** Task 0: document
(May 15 15:58:43) [2/2 - 100.00%] **Failed ** Task 6: archive
Run Finished
Task 0 failed
Name: document
Original error:
Error executing action 0. Original Exception:
Traceback (most recent call last):
File "/home/lk/.local/lib/python3.7/site-packages/anadama2/runners.py", line 201, in _run_task_locally
action_func(task)
File "/home/lk/.local/lib/python3.7/site-packages/anadama2/document.py", line 290, in create
sh(pandoc_command.format(intermediate_template, temp_report),log_command=True)()
File "/home/lk/.local/lib/python3.7/site-packages/anadama2/helpers.py", line 89, in actually_sh
ret = _sh(s, **kwargs)
File "/home/lk/.local/lib/python3.7/site-packages/anadama2/util/init.py", line 320, in sh
raise ShellException(proc.returncode, msg.format(cmd, ret[0], ret[1]))
anadama2.util.ShellException: [Errno 2] Command pandoc /home/lk/PRJNA685168/PRJNA685168_vis_no_metadata/tmp3g_k7qej/wmgx_report.md -o /home/lk/PRJNA685168/PRJNA685168_vis_no_metadata/tmp3g_k7qej/wmgx_report..pdf --variable=linkcolor:Blue --variable=toccolor:Blue --pdf-engine=pdflatex --standalone --toc' failed. Out: b'' Err: b"pandoc: unrecognized option --pdf-engine=pdflatex'\nTry pandoc --help for more information.\n"

Task 6 failed
Name: archive
Original error:
Task failed because parent task `0' failed
Traceback (most recent call last):
File "/home/lk/.local/bin/wmgx_vis.py", line 133, in
workflow.go()
File "/home/lk/.local/lib/python3.7/site-packages/anadama2/workflow.py", line 801, in go
self._handle_finished()
File "/home/lk/.local/lib/python3.7/site-packages/anadama2/workflow.py", line 833, in _handle_finished
raise RunFailed()
anadama2.workflow.RunFailed

ValueError: Unable to find binary or script `metaphlan'

Hi all,

I am trying to run the biobakery workflow and run into this issue:

(base) (squirrel) [arh0899@qhimem0014 ~]$ biobakery_workflows wmgx --input ~/Saimiri_shotgun --output ~/squirrel/biobakery_workflow
Traceback (most recent call last):
  File "/home/arh0899/squirrel/bin/wmgx.py", line 132, in <module>
    merged_taxonomic_profile, taxonomy_tsv_files, taxonomy_sam_files = shotgun.taxonomic_profile(workflow,
  File "/home/arh0899/squirrel/lib/python3.9/site-packages/biobakery_workflows/tasks/shotgun.py", line 350, in taxonomic_profile
    depends=[depend_fastq,TrackedExecutable("metaphlan")],
  File "/home/arh0899/squirrel/lib/python3.9/site-packages/anadama2/tracked.py", line 199, in __new__
    real_key = cls.key(key)
  File "/home/arh0899/squirrel/lib/python3.9/site-packages/anadama2/tracked.py", line 689, in key
    raise ValueError(
ValueError: Unable to find binary or script `metaphlan'

Any idea what it could be? Thank you!

Error while removing intermediate files after running biobakery 16s pipeline

Error log:

[Errno 1] Command `remove_if_exists.py /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2 --is-folder ; picrust2_pipeline.py -s /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/all_samples_closed_reference_picrust_reformatted_input.fasta -i /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/all_samples_taxonomy_closed_reference_picrust_reformatted_input.tsv -o /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2 -p 8 ' failed. 
  Out: b''
  Err: b'\nError running this command:\nplace_seqs.py --study_fasta /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/all_samples_closed_reference_picrust_reformatted_input.fasta --ref_dir /home/tgb2/picrust2/picrust2/default_files/prokaryotic/pro_ref --out_tree /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2/out.tre --processes 8 --intermediate /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2/intermediate/place_seqs --min_align 0.8 --chunk_size 5000 --placement_tool epa-ng\n\nStandard error of the above failed command:\n\nError running this command:\nepa-ng --tree /home/tgb2/picrust2/picrust2/default_files/prokaryotic/pro_ref/pro_ref.tre --ref-msa /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2/intermediate/place_seqs/ref_seqs_hmmalign.fasta --query /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2/intermediate/place_seqs/study_seqs_hmmalign.fasta --chunk-size 5000 -T 8 -m /home/tgb2/picrust2/picrust2/default_files/prokaryotic/pro_ref/pro_ref.model -w /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2/intermediate/place_seqs/epa_out --filter-acc-lwr 0.99 --filter-max 100\n\nStandard output of the above failed command:\nINFO Selected: Output dir: /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2/intermediate/place_seqs/epa_out/\nINFO Selected: Query file: /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2/intermediate/place_seqs/study_seqs_hmmalign.fasta\nINFO Selected: Tree file: /home/tgb2/picrust2/picrust2/default_files/prokaryotic/pro_ref/pro_ref.tre\nINFO Selected: Reference MSA: /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2/intermediate/place_seqs/ref_seqs_hmmalign.fasta\nINFO Selected: Filtering by accumulated threshold: 0.99\nINFO Selected: Maximum number of placements per query: 100\nINFO Selected: Automatic switching of use of per rate scalers\nINFO Selected: Preserving the root of the input tree\nINFO Selected: Specified model file: /home/tgb2/picrust2/picrust2/default_files/prokaryotic/pro_ref/pro_ref.model\nINFO Selected: Reading queries in chunks of: 5000\nINFO Selected: Using threads: 8\nINFO     ______ ____   ___           _   __ ______\n        / ____// __ \\ /   |         / | / // ____/\n       / __/  / /_/ // /| | ______ /  |/ // / __  \n      / /___ / ____// ___ |/_____// /|  // /_/ /  \n     /_____//_/    /_/  |_|      /_/ |_/ \\____/ (v0.3.8)\nINFO Using model parameters:\nINFO    Rate heterogeneity: GAMMA (4 cats, mean),  alpha: 0.453141 (user),  weights&rates: (0.25,0.0250674) (0.25,0.220229) (0.25,0.782933) (0.25,2.97177) \n        Base frequencies (user): 0.229585 0.22008 0.298596 0.251739 \n        Substitution rates (user): 1.00319 2.79077 1.5301 0.87441 3.83966 1\nINFO Output file: /mnt/D4C4F524C4F5098E/Tanay/tanay/16s_stool_sample/output1/picrust2/intermediate/place_seqs/epa_out/epa_result.jplace\n\n\n'
  
Traceback (most recent call last):
  File "/home/tgb2/anaconda3/envs/biobakery/bin/16s.py", line 184, in <module>
    workflow.go()
  File "/home/tgb2/anaconda3/envs/biobakery/lib/python3.6/site-packages/anadama2/workflow.py", line 801, in go
    self._handle_finished()
  File "/home/tgb2/anaconda3/envs/biobakery/lib/python3.6/site-packages/anadama2/workflow.py", line 833, in _handle_finished
    raise RunFailed()
anadama2.workflow.RunFailed

If you could help me with a possible fix, it would be really helpful!

problems with writing reslults to a file in rna_dna_norm.py

When I ran biobakery in the wmtx_wmgs workflow it crashed when it reached to the task of rna_dna_norm.py
You can follow the description of my problem here:
(https://forum.biobakery.org/t/wmgx-wmtx-worfklow-rna-dna-norm-py-failure-why-is-metagenome-needed-in-metatranscriptome-analysis/2856/2)

I will copy the error message I got:

Error executing action 0. Original Exception: 
  Traceback (most recent call last):
    File "~/miniconda3/envs/$ENV_NAME/lib/python3.7/site-packages/anadama2/runners.py", line 201, in _run_task_locally
      action_func(task)
    File "~/miniconda3/envs/$ENV_NAME/lib/python3.7/site-packages/anadama2/helpers.py", line 89, in actually_sh
      ret = _sh(s, **kwargs)
    File "~/miniconda3/envs/$ENV_NAME/lib/python3.7/site-packages/anadama2/util/__init__.py", line 320, in sh
      raise ShellException(proc.returncode, msg.format(cmd, ret[0], ret[1]))
  anadama2.util.ShellException: [Errno 1] Command `rna_dna_norm.py --input-dna $OUTPUT_PATH/whole_metagenome_shotgun/humann/merged/genefamilies.tsv --input-rna $OUTPUT_PATH/whole_metatranscriptome_shotgun/humann/merged/genefamilies.tsv --output $OUTPUT_PATH/humann/rna_dna_norm/genes --reduce-sample-name' failed. 
  Out: b'Reading RNA table\nReading DNA table\nCompute unstratified features\nNormalize DNA\nNormalize RNA\nCompute stratified features\nNormalize DNA\nNormalize RNA\nCompute only classified features\nNormalize DNA\nNormalize RNA\nWriting unstratified table\n'
  Err: b'Traceback (most recent call last):\n  File "~/miniconda3/envs/$ENV_NAME/bin/rna_dna_norm.py", line 272, in <module>\n    main()\n  File "~/miniconda3/envs/$ENV_NAME/bin/rna_dna_norm.py", line 258, in main\n    output_unstrat_file)\n  File "~/miniconda3/envs/$ENV_NAME/bin/rna_dna_norm.py", line 180, in write_file\n    file_handle.write("\\t".join(column_labels)+"\\n")\nTypeError: a bytes-like object is required, not \'str\'\n'

I think one of the problems of why this task has crashed was because you open the results file using 'wb' mode (i.e. in binary format) instead of simply 'w'.
I changed the code in rna_dna_norm.py, line 179 from:

with open(file, "wb") as file_handle:

to:

with open(file, "w") as file_handle:

This helped the code to not crash, however, I don't think it's the last of the problems. As I mentioned in the attached link I also had problems when I included a mapping file mapping RNA sample IDs to the DNA ones.

workflow 16s error

Dear mantainers,
I'd like to use the biobakery workflow to perform my analysis on 16s data.
I'm currently using the Docker container (https://hub.docker.com/r/biobakery/workflows).
However, after (1) downloading the gg references,
wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz tar -xf gg_13_8_otus.tar.gz
(2) properly setting the environment variables
export GREEN_GENES_DADA2_DB="./Ref_Biobakery_DB/green_genes_dada2_db"

export GREEN_GENES_FASTA_DB="./Ref_Biobakery_DB/green_genes_dada2_db/99_otus.fasta" # extracted by 'gg_13_8_otus/rep_set_aligned'

export GREEN_GENES_TAXONOMY_DB="./Ref_Biobakery_DB/green_genes_dada2_db/99_otu_taxonomy.txt" # extracted by 'gg_13_8_otus/taxonomy'

and, (3) launching the workflow, I got stuck with this error:
Traceback (most recent call last): File "/usr/local/bin/16s.py", line 182, in <module> workflow.go() File "/usr/local/lib/python3.6/dist-packages/anadama2/workflow.py", line 775, in go task_idxs = nx.algorithms.dag.topological_sort(self.dag, reverse=True) TypeError: topological_sort() got an unexpected keyword argument 'reverse'
How could I solve this issue? I was expected that everything was perfect inside the container.

Thanks in advance,
Mattia

Error installing

I'm installing biobakery workflows in a conda environment

I started well

conda create --name biobakery conda activate biobakery conda install -c biobakery biobakery_workflows conda install -c biobakery leveldb

but when I tryed to install databases, errors appeared.

biobakery_workflows_databases --install wmgx --location /media/microviable/e/bwfdb/ Installing humann utility mapping database Download URL: http://huttenhower.sph.harvard.edu/humann_data/full_mapping_v201901b.tar.gz CRITICAL ERROR: Unable to download and extract from URL: http://huttenhower.sph.harvard.edu/humann_data/full_mapping_v201901b.tar.gz WARNING: Unable to install database. Error running command: humann_databases --download utility_mapping full /media/microviable/e/bwfdb/humann Unable to find strainphlan install.

Any idea?

Thankyou

Problem with setting environment variable $BIOBAKERY_WORKFLOWS_DATABASES

Hi

I have been trying to install the environment variable BIOBAKERY WORKFLOWS with pip (https://github.com/biobakery/biobakery_workflows/wiki/Installing-the-workflows-with-pip) and then BIOBAKERY WORKFLOWS DATABASES using $ biobakery_workflows_databases --install --wmgx. Afetr solving for many errors, I have now hit the wall in the form of "setting environment variable $BIOBAKERY_WORKFLOWS_DATABASES"

As suggested in "Installing the workflows with pip" I have installed the workflows in a static environment: install_workflows. However, due to the disk space restrains from the univ management, the databases need to be installed at a custom location (in my case, it is /scratch/users/nus/saxena/miniconda3/envs/biobakery/opt/biobakery_workflows_databases). Now, for that I have set the environment variable by specifying
"export BIOBAKERY_WORKFLOWS_DATABASES=/scratch/users/nus/saxena/miniconda3/envs/biobakery/opt/biobakery_workflows_databases" in .bashrc file. However, I am still getting the error "A custom install location was selected. Please set the environment variable $BIOBAKERY_WORKFLOWS_DATABASES to the install location."

Requesting your help

Regards
Gourvendu

Env_Var_err
Exp_var

Humann Failed! Error message : Error executing action 0. Original Exception:

Hi:

I am havinf problems with this workflow. It could run kneaddata and petaphlan (with --bypass-functional-profiling --bypass-strain-profiling options), but, when I try to run humann or strainphlan, I receive task failed errors always.

My last code was this:

biobakery_workflows wmgx --input /media/microviable/d/DATOSSECUENCIACION/MVT-prj10_rendimiento_deportivo/BIKE/ --output /media/microviable/e/workflows_output --local-jobs 2 --threads 4 --bypass-strain-profiling

And this is the file:
anadama.log

I would appreciate any help.

This is the part

2022-02-02 22:35:28,286 LoggerReporter log_event INFO: task 135, humann____BIKE011B_lib527520_7698_3 : starting to run
2022-02-02 22:35:28,415 LoggerReporter task_command INFO: Tracked executable version: humann v3.0.0.alpha.4

2022-02-02 22:35:28,415 LoggerReporter task_command INFO: Executing with shell: humann --input /media/microviable/e/workflows_output/kneaddata/main/BIKE011B_lib527520_7698_3.fastq --output /media/microviable/e/workflows_output/humann/main --o-log /media/microviable/e/workflows_output/humann/main/BIKE011B_lib527520_7698_3.log --threads 4 --taxonomic-profile /media/microviable/e/workflows_output/metaphlan/main/BIKE011B_lib527520_7698_3_taxonomic_profile.tsv
2022-02-02 22:35:28,416 LoggerReporter log_event INFO: task 136, humann____BIKE011C_lib527528_7698_2 : starting to run
2022-02-02 22:35:28,536 LoggerReporter task_command INFO: Tracked executable version: humann v3.0.0.alpha.4

2022-02-02 22:35:28,548 LoggerReporter task_command INFO: Executing with shell: humann --input /media/microviable/e/workflows_output/kneaddata/main/BIKE011C_lib527528_7698_2.fastq --output /media/microviable/e/workflows_output/humann/main --o-log /media/microviable/e/workflows_output/humann/main/BIKE011C_lib527528_7698_2.log --threads 4 --taxonomic-profile /media/microviable/e/workflows_output/metaphlan/main/BIKE011C_lib527528_7698_2_taxonomic_profile.tsv
2022-02-03 00:21:34,856 LoggerReporter log_event INFO: task 137, humann____BIKE011D_lib527536_7690_4 : starting to run
2022-02-03 00:21:34,855 LoggerReporter task_failed ERROR: task 135, humann____BIKE011B_lib527520_7698_3 : Failed! Error message : Error executing action 0. Original Exception:
Traceback (most recent call last):
File "/home/microviable/.local/lib/python3.8/site-packages/anadama2/runners.py", line 201, in _run_task_locally
action_func(task)
File "/home/microviable/.local/lib/python3.8/site-packages/anadama2/helpers.py", line 89, in actually_sh
ret = _sh(s, **kwargs)
File "/home/microviable/.local/lib/python3.8/site-packages/anadama2/util/init.py", line 320, in sh
raise ShellException(proc.returncode, msg.format(cmd, ret[0], ret[1]))
anadama2.util.ShellException: [Errno 137] Command `humann --input /media/microviable/e/workflows_output/kneaddata/main/BIKE011B_lib527520_7698_3.fastq --output /media/microviable/e/workflows_output/humann/main --o-log /media/microviable/e/workflows_output/humann/main/BIKE011B_lib527520_7698_3.log --threads 4 --taxonomic-profile /media/microviable/e/workflows_output/metaphlan/main/BIKE011B_lib527520_7698_3_taxonomic_profile.tsv ' failed.
Out: b''
Err: b'Killed\n'

groups: invalid choice: 'uniref90_level4ec' (choose from 'uniref90_rxn', 'uniref50_rxn')\n"`

Workflow error to run "biobakery_workflows wmgx". Error message is as below. It looks like humann3 should use "uniref90_rxn" but in the shotgun.py code, it use uniref90_level4ec.

Task 29 failed Name: humann_renorm_ecs_relab____HD42R4_subsample Original error: Task failed because parent task 20' failed
Task 19 failed
Name: humann_regroup_UniRef2EC____HD32R1_subsample
Original error:
Error executing action 0. Original Exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/anadama2/runners.py", line 201, in _run_task_locally
action_func(task)
File "/usr/local/lib/python3.6/dist-packages/anadama2/helpers.py", line 89, in actually_sh
ret = _sh(s, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/anadama2/util/init.py", line 320, in sh
raise ShellException(proc.returncode, msg.format(cmd, ret[0], ret[1]))
anadama2.util.ShellException: [Errno 2] Command humann_regroup_table --input /data/result_wmgs_1/humann/main/HD32R1_subsample_genefamilies.tsv --output /data/result_wmgs_1/humann/regrouped/HD32R1_subsample_ecs.tsv --groups uniref90_level4ec' failed. Out: b'' Err: b"usage: humann_regroup_table [-h] [-i INPUT] [-g {uniref90_rxn,uniref50_rxn}]\n [-c CUSTOM] [-r] [-f {sum,mean}] [-e PRECISION]\n [-u {Y,N}] [-p {Y,N}] [-o OUTPUT]\nhumann_regroup_table: error: argument -g/--groups: invalid choice: 'uniref90_level4ec' (choose from 'uniref90_rxn', 'uniref50_rxn')\n"

worflow issue using conda

Hello, I have been trying to create an environment with biobakery workflows.
When trying to use conda I get the following error:

conda create -n biobakery -c biobakery biobakery_workflows

Traceback (most recent call last):
      File "/home/bench-user/.apps/conda/lib/python3.10/site-packages/conda/gateways/repodata/__init__.py", line 161, in conda_http_errors
        yield
      File "/home/bench-user/.apps/conda/lib/python3.10/site-packages/conda/gateways/repodata/__init__.py", line 127, in repodata
        response.raise_for_status()
      File "/home/bench-user/.apps/conda/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/biobakery/linux-64/current_repodata.json

So I tried to install pip. But then the error is in the DB:
Unable to find database KNEADDATA_DB_HUMAN_GENOME

I tried to run, without success
KNEADDATA_DB_HUMAN_GENOME=/home/biobakery/biobakery_workflows/kneaddata_db_human_genome

Can I get help to get the pipeline setup please.

Errors:"biobakery_workflows_databases --install isolate_assembly"

root@ce16e5f0ff77:/# biobakery_workflows_databases --install isolate_assembly
Downloading eggnog mapper databases
Traceback (most recent call last):
File "/usr/local/bin/biobakery_workflows_databases", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/biobakery_workflows/biobakery_workflows_databases.py", line 238, in main
run_command(["download_eggnog_data.py","--data_dir",eggnog_install_path,"-y"])
File "/usr/local/lib/python3.6/dist-packages/biobakery_workflows/biobakery_workflows_databases.py", line 73, in run_command
id=subprocess.check_call(command, shell=shell)
File "/usr/lib/python3.6/subprocess.py", line 306, in check_call
retcode = call(*popenargs, **kwargs)
File "/usr/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'download_eggnog_data.py': 'download_eggnog_data.py'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.