Giter Site home page Giter Site logo

sanger-bentley-group / gps-pipeline Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 3.0 1.38 MB

Nextflow Pipeline for processing Streptococcus pneumoniae sequencing raw reads (FASTQ files) by the GPS Project (Global Pneumococcal Sequencing Project)

License: GNU General Public License v3.0

Nextflow 58.85% Shell 26.64% Python 14.51%
fastq genomics insilico nextflow pipeline streptococcus-pneumoniae

gps-pipeline's Introduction

GPS Pipeline

Nextflow run with docker run with singularity Launch on Nextflow Tower

The GPS Pipeline is a Nextflow pipeline designed for processing raw reads (FASTQ files) of Streptococcus pneumoniae samples. After preprocessing, the pipeline performs initial assessment based on the total bases in reads. Passed samples will be further assess based on assembly, mapping, and taxonomy. If the sample passes all quality controls (QC), the pipeline also provides the sample's serotype, multi-locus sequence typing (MLST), lineage (based on the Global Pneumococcal Sequence Cluster (GPSC)), and antimicrobial resistance (AMR) against multiple antimicrobials.

The pipeline is designed to be easy to set up and use, and is suitable for use on local machines and high-performance computing (HPC) clusters alike. Additionally, the pipeline only downloads essential files to enable the analysis, and no data is uploaded from the local environment, making it an ideal option for cases where the FASTQ files being analysed is confidential. After initialisation or the first successful complete run, the pipeline can be used offline unless you have changed the selection of any database or container image.

The development of this pipeline is part of the GPS Project (Global Pneumococcal Sequencing Project).

 

Table of contents

 

Workflow

Workflow

 

Usage

Requirements

Software

Hardware

It is recommended to have at least 16GB of RAM and 50GB of free storage

ℹ️ Details on storage

  • The pipeline core files use ~5MB
  • All default databases use ~8GB in total
  • All Docker images use ~13GB in total; alternatively, Singularity images use ~4.5GB in total
  • The pipeline generates ~1.8GB intermediate files for each sample on average
    (These files can be removed when the pipeline run is completed, please refer to Clean Up)
    (To further reduce storage requirement by sacrificing the ability to resume the pipeline, please refer to Experimental)

Accepted Inputs

  • Only Illumina paired-end short reads are supported
  • Each sample is expected to be a pair of raw reads following this file name pattern:
    • *_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}
      • example 1: SampleName_R1_001.fastq.gz, SampleName_R2_001.fastq.gz
      • example 2: SampleName_1.fastq.gz, SampleName_2.fastq.gz
      • example 3: SampleName_R1.fq, SampleName_R2.fq

Setup

  1. Clone the repository (if Git is installed on your system)

    git clone https://github.com/sanger-bentley-group/gps-pipeline.git
    

    or

    Download and unzip/extract the latest release

  2. Go into the local directory of the pipeline and it is ready to use without installation (the directory name might be different)

    cd gps-pipeline
    
  3. (Optional) You could perform an initialisation to download all required additional files and container images, so the pipeline can be used at any time with or without the Internet afterwards.

    ⚠️ Docker or Singularity must be running, and an Internet connection is required.

    • Using Docker as the container engine
      ./run_pipeline --init
      
    • Using Singularity as the container engine
      ./run_pipeline --init -profile singularity
      

Run

⚠️ Docker or Singularity must be running.

⚠️ If this is the first run and initialisation was not performed, an Internet connection is required.

ℹ️ By default, Docker is used as the container engine and all the processes are executed by the local machine. See Profile for details on running the pipeline with Singularity or on a HPC cluster.

  • You can run the pipeline without options. It will attempt to get the raw reads from the default location (i.e. input directory inside the gps-pipeline local directory)
    ./run_pipeline
    
  • You can also specify the location of the raw reads by adding the --reads option
    ./run_pipeline --reads /path/to/raw-reads-directory
    
  • For a test run, you could obtain a small test dataset by running the included download_test_input script. The dataset will be saved to the test_input directory inside the pipeline local directory. You can then run the pipeline on the test data
    ./download_test_input
    ./run_pipeline --reads test_input
    
    • 9870_5#52 will fail the Taxonomy QC and hence Overall QC, therefore without analysis results
    • 17175_7#59 and 21127_1#156 should pass Overall QC, therefore with analysis results

Profile

  • By default, Docker is used as the container engine and all the processes are executed by the local machine. To change this, you could use Nextflow's built-in -profile option to switch to other available profiles

    ℹ️ -profile is a built-in Nextflow option, it only has one leading -

    ./run_pipeline -profile [profile name]
    
  • Available profiles:
    Profile Name Details
    standard
    (Default)
    Docker is used as the container engine.
    Processes are executed locally.
    singularity Singularity is used as the container engine.
    Processes are executed locally.
    lsf The pipeline should be launched from a LSF cluster head node with this profile.
    Singularity is used as the container engine.
    Processes are submitted to your LSF cluster via bsub by the pipeline.
    (Tested on Wellcome Sanger Institute farm5 LSF cluster only)
    (Option --kraken2_memory_mapping default change to false.)

Resume

  • If the pipeline is interrupted mid-run, Nextflow's built-in -resume option can be used to resume the pipeline execution instead of starting from scratch again
  • You should use the same command of the original run, only add -resume at the end (i.e. all pipeline options should be identical)

    ℹ️ -resume is a built-in Nextflow option, it only has one leading -

    • If the original command is
      ./run_pipeline --reads /path/to/raw-reads-directory
      
    • The command to resume the pipeline execution should be
      ./run_pipeline --reads /path/to/raw-reads-directory -resume
      

Clean Up

  • During the run of the pipeline, Nextflow generates a considerable amount of intermediate files
  • If the run has been completed and you do not intend to use the -resume option or those intermediate files, you can remove the intermediate files using one of the following ways:
    • Run the included clean_pipeline script
      • It runs the commands in manual removal for you
      • It removes the work directory and log files within the gps-pipeline local directory
      ./clean_pipeline
      
    • Manual removal
      • Remove the work directory and log files within the gps-pipeline local directory
      rm -rf work
      rm -rf .nextflow.log*
      
    • Run nextflow clean command
      • This built-in command cleans up cache and work directories
      • By default, it only cleans up the latest run
      • For details and available options of nextflow clean, refer to the Nextflow documentation
      ./nextflow clean
      

Nextflow Tower (Optional)

The pipeline is compatible with Launchpad of Nextflow tower and Nextflow -with-tower option. For more information, please refer to the Nextflow Tower documentation.

 

Pipeline Options

  • The tables below contain the available options that can be used when you run the pipeline
  • Usage:
    ./run_pipeline [option] [value]
    

ℹ️ To permanently change the value of an option, edit the nextflow.config file inside the gps-pipeline local directory.

ℹ️ $projectDir is a Nextflow built-in implicit variables, it is defined as the local directory of gps-pipeline.

ℹ️ Pipeline options are not built-in Nextflow options, they are lead with -- instead of -

Alternative Workflows

Option Values Description
--init true or false
(Default: false)
Use alternative workflow for initialisation, which means downloading all required additional files and container images, and creating databases.
Can be enabled by including --init without value.
--version true or false
(Default: false)
Use alternative workflow for showing versions of pipeline, container images, tools and databases.
Can be enabled by including --version without value.
(This workflow pulls the required container images if they are not yet available locally)
--help true or false
(Default: false)
Show help message.
Can be enabled by including --help without value.

Input and Output

⚠️ --output overwrites existing results in the target directory if there is any

⚠️ --db does not accept user provided local databases, directory content will be overwritten

Option Values Description
--reads Any valid path
(Default: "$projectDir/input")
Path to the input directory that contains the reads to be processed.
--output Any valid path
(Default: "$projectDir/output")
Path to the output directory that save the results.
--db Any valid path
(Default: "$projectDir/databases")
Path to the directory saving databases used by the pipeline.
--assembly_publish "link" or "symlink" or "copy"
(Default: "link")
Method used by Nextflow to publish the generated assemblies.
(The default setting "link" means hard link, therefore will fail if the output directory is set to outside of the working file system)

QC Parameters

ℹ️ Read QC does not have directly accessible parameters. The minimum base count in reads of Read QC is based on the multiplication of --length_low and --depth of Assembly QC (i.e. default value is 38000000).

Option Values Description
--spneumo_percentage Any integer or float value
(Default: 60.00)
Minimum S. pneumoniae percentage in reads to pass Taxonomy QC.
--non_strep_percentage Any integer or float value
(Default: 2.00)
Maximum non-Streptococcus genus percentage in reads to pass Taxonomy QC.
--ref_coverage Any integer or float value
(Default: 60.00)
Minimum reference coverage percentage by the reads to pass Mapping QC.
--het_snp_site Any integer value
(Default: 220)
Maximum non-cluster heterozygous SNP (Het-SNP) site count to pass Mapping QC.
--contigs Any integer value
(Default: 500)
Maximum contig count in assembly to pass Assembly QC.
--length_low Any integer value
(Default: 1900000)
Minimum assembly length to pass Assembly QC.
--length_high Any integer value
(Default: 2300000)
Maximum assembly length to pass Assembly QC.
--depth Any integer or float value
(Default: 20.00)
Minimum sequencing depth to pass Assembly QC.

Assembly

ℹ️ The output of SPAdes-based assembler is deterministic for a given count of threads. Hence, using --assembler_thread with a specific value can guarantee the generated assemblies will be reproducible for others using the same value.

Option Values Description
--assembler "shovill" or "unicycler"
(Default: "shovill")
Using which SPAdes-based assembler to assemble the reads.
--assembler_thread Any integer value
(Default: 0)
Number of threads used by the assembler. 0 means all available.
--min_contig_length Any integer value
(Default: 500)
Minimum legnth of contig to be included in the assembly.

Mapping

Option Values Description
--ref_genome Any valid path to a .fa or .fasta file
(Default: "$projectDir/data/ATCC_700669_v1.fa")
Path to the reference genome for mapping.

Taxonomy

Option Values Description
--kraken2_db_remote Any valid URL to a Kraken2 database in .tar.gz or .tgz format
(Default: Minikraken v1)
URL to a Kraken2 database.
--kraken2_memory_mapping true or false
(Default: true)
Using the memory mapping option of Kraken2 or not.
true means not loading the database into RAM, suitable for memory-limited or fast storage environments.

Serotype

Option Values Description
--seroba_db_remote Any valid URL to a SeroBA release in .tar.gz or .tgz format
(Default: SeroBA v1.0.7)
URL to a SeroBA release.
--seroba_kmer Any integer value
(Default: 71)
Kmer size for creating the KMC database of SeroBA.

Lineage

Option Values Description
--poppunk_db_remote Any valid URL to a PopPUNK database in .tar.gz or .tgz format
(Default: GPS v8 - Reference Only)
URL to a PopPUNK database.
--poppunk_ext_remote Any valid URL to a PopPUNK external clusters file in .csv format
(Default: GPS v8 GPSC Designation)
URL to a PopPUNK external clusters file.

Other AMR

Option Values Description
--ariba_ref Any valid path to a .fa or .fasta file
(Default: "$projectDir/data/ariba_ref_sequences.fasta")
Path to the reference sequences for preparing ARIBA database.
--ariba_metadata Any valid path to a tsv file
(Default: "$projectDir/data/ariba_metadata.tsv")
Path to the metadata file for preparing ARIBA database.
--resistance_to_mic Any valid path to a tsv file
(Default: "$projectDir/data/resistance_to_MIC.tsv")
Path to the resistance phenotypes to MIC (minimum inhibitory concentration) lookup table.

Singularity

ℹ️ This section is only valid when Singularity is used as the container engine

Option Values Description
--singularity_cachedir Any valid path
(Default: "$projectDir/singularity_cache")
Path to the directory where Singularity images should be saved to.

Experimental

Option Values Description
--lite true or false
(Default: false)
⚠️ Enable this option breaks Nextflow resume function.
Reduce storage requirement by removing intermediate .sam and .bam files once they are no longer needed while the pipeline is still running.
The quantity of reduction of storage requirement cannot be guaranteed.
Can be enabled by including --lite without value.

Output

  • By default, the pipeline outputs the results into the output directory inside the gps-pipeline local directory
  • It can be changed by adding the option --output
    ./run_pipeline --output /path/to/output-directory
    

Output Content

  • The following directories and files are output into the output directory
    Directory / File Description
    assemblies This directory contains all assemblies (.fasta) generated by the pipeline
    results.csv This file contains all the information generated by the pipeline on each sample
    info.txt This file contains information regarding the pipeline and parameters of the run

Details of results.csv

  • The following fields can be found in the output results.csv

    ℹ️ The output fields in Other AMR / Virulence type depends on the provided ARIBA reference sequences and metadata file, and resistance phenotypes to MIC lookup table, the below table is based on the defaults.

    ℹ️ The inferred Minimum Inhibitory Concentration (MIC) range of an antimicrobial in "Other AMR" type is only provided if it is included in the resistance phenotypes to MIC lookup table. The default lookup table is based on 2014 CLSI guidelines.

    ℹ️ For resistance phenotypes: S = Sensitive/Susceptible; I = Intermediate; R = Resistant

    ℹ️ For virulence genes: POS = Positive; NEG = Negative

    ⚠️ If the result of Overall_QC of a sample is READ_ONE_CORRUPTED, READ_TWO_CORRUPTED or both, the specific read file is found to be corrupted (i.e. incomplete/damaged Gzip file, mis-match(s) in read length and quality-score length). You might want to reacquire the read file from its source, or discard the sample if the source file is corrupted as well.

    ⚠️ If the result of Overall_QC of a sample is ASSEMBLER FAILURE, the assembler has crashed when trying to assembly the reads. You might want to re-run the sample with another assembler, or discard the sample if it is a low quality one.

    ⚠️ If the result of Serotype of a sample is SEROBA FAILURE, SeroBA has crashed when trying to serotype the sample.

    Field Type Description
    Sample_ID Identification Sample ID based on the raw reads file name
    Read_QC QC Read quality control result
    Assembly_QC QC Assembly quality control result
    Mapping_QC QC Mapping quality control result
    Taxonomy_QC QC Taxonomy quality control result
    Overall_QC QC Overall quality control result
    (Based on Assembly_QC, Mapping_QC and Taxonomy_QC)
    Bases Read Number of bases in the reads
    (Default: ≥ 38 Mb to pass Read QC)
    Contigs# Assembly Number of contigs in the assembly
    (Default: ≤ 500 to pass Assembly QC)
    Assembly_Length Assembly Total length of the assembly
    (Default: 1.9 - 2.3 Mb to pass Assembly QC)
    Seq_Depth Assembly Sequencing depth of the assembly
    (Default: ≥ 20x to pass Assembly QC)
    Ref_Cov_% Mapping Percentage of reference covered by reads
    (Default: ≥ 60% to pass Mapping QC)
    Het-SNP# Mapping Non-cluster heterozygous SNP (Het-SNP) site count
    (Default: ≤ 220 to pass Mapping QC)
    S.Pneumo_% Taxonomy Percentage of reads assigned to Streptococcus pneumoniae
    (Default: ≥ 60% to pass Taxonomy QC)
    Top_Non-Strep_Genus Taxonomy The most abundant non-Streptococcus genus in reads
    Top_Non-Strep_Genus_% Taxonomy Percentage of reads assigned to the most abundant non-Streptococcus genus
    (Default: ≤ 2% to pass Taxonomy QC)
    GPSC Lineage GPSC Lineage
    Serotype Serotype Serotype
    ST MLST Sequence Type (ST)
    aroE MLST Allele ID of aroE
    gdh MLST Allele ID of gdh
    gki MLST Allele ID of gki
    recP MLST Allele ID of recP
    spi MLST Allele ID of spi
    xpt MLST Allele ID of xpt
    ddl MLST Allele ID of ddl
    pbp1a PBP AMR Allele ID of pbp1a
    pbp2b PBP AMR Allele ID of pbp2b
    pbp2x PBP AMR Allele ID of pbp2x
    AMO_MIC PBP AMR Estimated minimum inhibitory concentration (MIC) of amoxicillin (AMO)
    AMO_Res PBP AMR Inferred resistance phenotype against AMO
    CFT_MIC PBP AMR Estimated MIC of ceftriaxone (CFT)
    CFT_Res(Meningital) PBP AMR Inferred resistance phenotype against CFT in meningital form
    CFT_Res(Non-meningital) PBP AMR Inferred resistance phenotype against CFT in non-meningital form
    TAX_MIC PBP AMR Estimated MIC of cefotaxime (TAX)
    TAX_Res(Meningital) PBP AMR Inferred resistance phenotype against TAX in meningital form
    TAX_Res(Non-meningital) PBP AMR Inferred resistance phenotype against TAX in non-meningital form
    CFX_MIC PBP AMR Estimated MIC of cefuroxime (CFX)
    CFX_Res PBP AMR Inferred resistance phenotype against CFX
    MER_MIC PBP AMR Estimated MIC of meropenem (MER)
    MER_Res PBP AMR Inferred resistance phenotype against MER
    PEN_MIC PBP AMR Estimated MIC of penicillin (PEN)
    PEN_Res(Meningital) PBP AMR Inferred resistance phenotype against PEN in meningital form
    PEN_Res(Non-meningital) PBP AMR Inferred resistance phenotype against PEN in non-meningital form
    CHL_MIC Other AMR Inferred MIC of Chloramphenicol (CHL)
    CHL_Res Other AMR Estimated resistance phenotype against CHL
    CHL_Determinant Other AMR Known determinants that estimated the CHL resistance phenotype
    CLI_MIC Other AMR Inferred MIC of Clindamycin (CLI)
    CLI_Res Other AMR Estimated resistance phenotype against CLI
    CLI_Determinant Other AMR Known determinants that estimated the CLI resistance phenotype
    COT_MIC Other AMR Inferred MIC of Co-Trimoxazole (COT)
    COT_Res Other AMR Estimated resistance phenotype against COT
    COT_Determinant Other AMR Known determinants that estimated the COT resistance phenotype
    DOX_MIC Other AMR Inferred MIC of Doxycycline (DOX)
    DOX_Res Other AMR Estimated resistance phenotype against DOX
    DOX_Determinant Other AMR Known determinants that estimated the DOX resistance phenotype
    ERY_MIC Other AMR Inferred MIC of Erythromycin (ERY)
    ERY_Res Other AMR Estimated resistance phenotype against ERY
    ERY_Determinant Other AMR Known determinants that estimated the ERY resistance phenotype
    ERY_CLI_Res Other AMR Estimated resistance phenotype against Erythromycin (ERY) and Clindamycin (CLI)
    ERY_CLI_Determinant Other AMR Known determinants that estimated the ERY and CLI resistance phenotype
    FQ_Res Other AMR Estimated resistance phenotype against Fluoroquinolones (FQ)
    FQ_Determinant Other AMR Known determinants that estimated the FQ resistance phenotype
    KAN_Res Other AMR Estimated resistance phenotype against Kanamycin (KAN)
    KAN_Determinant Other AMR Known determinants that estimated the KAN resistance phenotype
    LFX_MIC Other AMR Inferred MIC of Levofloxacin (LFX)
    LFX_Res Other AMR Estimated resistance phenotype against LFX
    LFX_Determinant Other AMR Known determinants that estimated the LFX resistance phenotype
    RIF_MIC Other AMR Inferred MIC of Rifampin (RIF)
    RIF_Res Other AMR Estimated resistance phenotype against RIF
    RIF_Determinant Other AMR Known determinants that estimated the RIF resistance phenotype
    SMX_Res Other AMR Estimated resistance phenotype against Sulfamethoxazole (SMX)
    SMX_Determinant Other AMR Known determinants that estimated the SMX resistance phenotype
    TET_MIC Other AMR Inferred MIC of Tetracycline (TET)
    TET_Res Other AMR Estimated resistance phenotype against TET
    TET_Determinant Other AMR Known determinants that estimated the TET resistance phenotype
    TMP_Res Other AMR Estimated resistance phenotype against Trimethoprim (TMP)
    TMP_Determinant Other AMR Known determinants that estimated the TMP resistance phenotype
    VAN_MIC Other AMR Inferred MIC of Vancomycin (VAN)
    VAN_Res Other AMR Estimated resistance phenotype against VAN
    VAN_Determinant Other AMR Known determinants that estimated the VAN resistance phenotype
    PILI1 Virulence Expression of PILI-1
    PILI1_Determinant Virulence Known determinants that estimated the PILI-1 expression
    PILI2 Virulence Expression of PILI-2
    PILI2_Determinant Virulence Known determinants that estimated the PILI-2 expression

 

Credits

This project uses open-source components. You can find the homepage or source code of their open-source projects along with license information below. I acknowledge and am grateful to these developers for their contributions to open source.

ARIBA

BCFtools and SAMtools

BWA

Docker Images of ARIBA, BCFtools, BWA, fastp, Kraken 2, mlst, PopPUNK, QUAST, SAMtools, Shovill, Unicycler

Docker Image of network-multitool

Docker Image of Pandas

fastp

GPSC_pipeline_nf

Kraken 2

mecA-HetSites-calculator

mlst

Nextflow

PopPUNK

QUAST

SeroBA

  • SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data. Epping L, van Tonder, AJ, Gladstone RA, GPS Consortium, Bentley SD, Page AJ, Keane JA, Microbial Genomics 2018, doi: 10.1099/mgen.0.000186
  • License (GPL-3.0): https://github.com/sanger-pathogens/seroba/blob/master/LICENSE
  • This project uses a Docker image of a fork
    • The fork provides SeroBA with the latest updates as the original repository is no longer maintained
    • The Docker image provides the containerised environment with SeroBA for GET_SEROBA_DB and SEROTYPE processes of the serotype.nf module

resistanceDatabase

Shovill

SPN-PBP-AMR

Unicycler

  • Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017.
  • License (GPL-3.0): https://github.com/rrwick/Unicycler/blob/main/LICENSE
  • This tool is used in ASSEMBLY_UNICYCLER process of the assembly.nf module

gps-pipeline's People

Contributors

harryhung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

gps-pipeline's Issues

Remove Nextflow workaround for string variables containing equal sign `=`

Currently, Nextflow cannot output string variables containing equal sign =

Reported at nextflow-io/nextflow#3553, fix available at nextflow-io/nextflow@40d5673 and will be released as part of 23.04.0 (ETA 2023 April)

Remove workarounds from the beneath sections when the included Nextflow is updated to 23.04.0
https://github.com/HarryHung/gps-unified-pipeline/blob/faf1e5c0d9a9a941a8e5ba73bb024357dffae254/main.nf#L148
https://github.com/HarryHung/gps-unified-pipeline/blob/faf1e5c0d9a9a941a8e5ba73bb024357dffae254/modules/amr.nf#L16-L47

Remove Nextflow workaround for Docker CPU Limit

Currently when using Docker as the container engine, if --cpus is not specify in the runOptions and there is no cpus directive in the process, the underlying docker run launched by Nextflow uses --cpus=1 by default.

Queried at nextflow-io/nextflow#3808, fix available at nextflow-io/nextflow@b38c388 and will be released as part of 23.04.0 (ETA 2023 April)

Remove workarounds from the beneath sections when the included Nextflow is updated to 23.04.0
https://github.com/HarryHung/gps-unified-pipeline/blob/121cb37f68040381f6e67bef491999f0c6e8d63f/nextflow.config#L110-L112

Cannot download seroBA database

When I do "./run_pipeline --init", the following erorr messages was displayed. What should I do?


ERROR ~ Error executing process > 'INIT:GET_SEROBA_DB'

Caused by:
Process INIT:GET_SEROBA_DB terminated with an error exit status (4)

Command executed:

DB_REMOTE="https://github.com/sanger-bentley-group/seroba/archive/refs/tags/v1.0.7.tar.gz"
DB_LOCAL="databases/seroba"
KMER="71"
JSON_FILE="done_seroba.json"

source check-create_seroba_db.sh

Command exit status:
4

Command output:
(empty)

Command error:
Retrying.

--2024-03-29 06:58:21-- (try: 9) https://github.com/sanger-bentley-group/seroba/archive/refs/tags/v1.0.7.tar.gz
Connecting to github.com (github.com)|20.27.177.113|:443... failed: Connection timed out.
Retrying.

////

Work dir:
Path_to/gps-pipeline/work/f9/9a442f870c02dd24788dc35af570b4

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check '.nextflow.log' file for details

headers missing

We need to insert headers in the output file when creating the final version

ERROR ~ Error executing process > 'PIPELINE:SEROTYPE (NP-0087-IDRL-AKU_S92_trimmed)'

Hello Harry, while executing the gps-unified-pipeline on a batch of 12 samples (FASTQ sequences), I encountered following error on one of my sample:

ERROR ~ Error executing process > 'PIPELINE:SEROTYPE (NP-0087-IDRL-AKU_S92_trimmed)'

Caused by:
  Process `PIPELINE:SEROTYPE (NP-0087-IDRL-AKU_S92_trimmed)` terminated with an error exit status (1)

Command executed:

  SEROBA_DIR="seroba"
  DATABASE="database"
  READ1="processed-NP-0087-IDRL-AKU_S92_trimmed_1.fastq.gz"
  READ2="processed-NP-0087-IDRL-AKU_S92_trimmed_2.fastq.gz"
  SAMPLE_ID="NP-0087-IDRL-AKU_S92_trimmed"
  
  source get_serotype.sh

Command exit status:
  1

Command output:
  cluster detected 1 threads available to it
  cluster reported completion
  cluster_3 detected 1 threads available to it
  cluster_3 reported completion
  cluster_4 detected 1 threads available to it
  cluster_4 reported completion
  cluster_6 detected 1 threads available to it
  cluster_6 reported completion
  
  0.15055493003056136
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/Swiss_NT/Swiss_NT intersect temp.kmco93qiglx/inter
  0.0
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/43/43 intersect temp.kmco93qiglx/inter
  0.01530841963079694
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/39X/39X intersect temp.kmco93qiglx/inter
  0.03521542396222703
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/25A/25A intersect temp.kmco93qiglx/inter
  0.023278764658075005
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/17A/17A intersect temp.kmco93qiglx/inter
  0.02832162954006418
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/47F/47F intersect temp.kmco93qiglx/inter
  0.0592279596075342
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/10A/10A intersect temp.kmco93qiglx/inter
  0.018366189193022266
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/28F/28F intersect temp.kmco93qiglx/inter
  0.12212593297600562
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/37/37 intersect temp.kmco93qiglx/inter
  0.0
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/11C/11C intersect temp.kmco93qiglx/inter
  0.0007471607890017932
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/42/42 intersect temp.kmco93qiglx/inter
  0.035696073431922486
  15C
  {'15A': 0, '15B': 0, '15C': 0, '15F': 16}
  15A
  {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}
  15B
  {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}
  15C
  {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}
  15C
  15F
  {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}
  {'15A': -1, '15B': 0, '15C': -2.5, '15F': 15}
  {'15A': {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}, '15B': {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}, '15C': {'genes': [], 'pseudo': ['wciZ'], 'allele': [], 'snps': []}, '15F': {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}}
  ['15A', '15C']
  15A
  15B/15C
  15C
  None

Command error:
  0.01530841963079694
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/39X/39X intersect temp.kmco93qiglx/inter
  0.03521542396222703
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/25A/25A intersect temp.kmco93qiglx/inter
  0.023278764658075005
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/17A/17A intersect temp.kmco93qiglx/inter
  0.02832162954006418
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/47F/47F intersect temp.kmco93qiglx/inter
  0.0592279596075342
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/10A/10A intersect temp.kmco93qiglx/inter
  0.018366189193022266
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/28F/28F intersect temp.kmco93qiglx/inter
  0.12212593297600562
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/37/37 intersect temp.kmco93qiglx/inter
  0.0
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/11C/11C intersect temp.kmco93qiglx/inter
  0.0007471607890017932
  /seroba-1.0.2/build/kmc_tools simple temp.kmco93qiglx/NP-0087-IDRL-AKU_S92_trimmed seroba/database/kmer_db/42/42 intersect temp.kmco93qiglx/inter
  0.035696073431922486
  15C
  {'15A': 0, '15B': 0, '15C': 0, '15F': 16}
  15A
  {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}
  15B
  {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}
  15C
  {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}
  15C
  15F
  {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}
  {'15A': -1, '15B': 0, '15C': -2.5, '15F': 15}
  {'15A': {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}, '15B': {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}, '15C': {'genes': [], 'pseudo': ['wciZ'], 'allele': [], 'snps': []}, '15F': {'genes': [], 'pseudo': [], 'allele': [], 'snps': []}}
  ['15A', '15C']
  15A
  15B/15C
  15C
  None
  Traceback (most recent call last):
    File "/usr/local/bin/ser
![image](https://github.com/HarryHung/gps-unified-pipeline/assets/87472403/973c6172-850f-47d4-badb-061373d7c6df)
oba", line 4, in <module>
      __import__('pkg_resources').run_script('seroba==1.0.2', 'seroba')
    File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 658, in run_script
      self.require(requires)[0].run_script(script_name, ns)
    File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1445, in run_script
      exec(script_code, namespace, namespace)
    File "/usr/local/lib/python3.6/dist-packages/seroba-1.0.2-py3.6.egg/EGG-INFO/scripts/seroba", line 86, in <module>
    File "/usr/local/lib/python3.6/dist-packages/seroba-1.0.2-py3.6.egg/seroba/tasks/sero_run.py", line 19, in run
    File "/usr/local/lib/python3.6/dist-packages/seroba-1.0.2-py3.6.egg/seroba/serotyping.py", line 481, in run
    File "/usr/local/lib/python3.6/dist-packages/seroba-1.0.2-py3.6.egg/seroba/serotyping.py", line 453, in _prediction
    File "/usr/local/lib/python3.6/dist-packages/seroba-1.0.2-py3.6.egg/seroba/serotyping.py", line 397, in _find_serotype
  TypeError: argument of type 'NoneType' is not iterable

Work dir:
  /home/samiah/Desktop/Desktop/AKU_System/gps-unified-pipeline/work/18/310a53cf47065e514606a58c8a7669

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

The error shows that the SeroBA failed to assigned the serotype to this particular sample but I am unable to understand the reason. Moreover the pipeline halted without creating the final result.tsv file. Is there any way to get the AMR analysis, MLST and other analysis that executed successfully?

suggestions to singularity preflight

singularityPreflight(workflow.configFiles[0], params.singularity_cachedir)

@HarryHung thanks very much for sharing your code. A good idea to predownload images prior to running workflows.

Here, it seems that you're searching the first config file for a 'container' flag. However, if someone else were to adopt this and run with multiple config files, this would be quite unstable.

Nextflow seems to have the workflow.container variable that would be useful in this case?

Better still, I think the preflight code deserves its own git repo, so you can use it in multiple pipelines with something like a git submodule. Would this be something you're interested in? Let's discuss this further?

process LINEAGE fatal error when not all CPUs set to available to Docker

Issues:

On macOS with Docker Desktop, when Resources -> Advanced -> CPUs is not set to all,
It will result in fatal error when executing process LINEAGE

For example

Error executing process > 'LINEAGE (1)'

Caused by:
  Process `LINEAGE (1)` terminated with an error exit status (125)

Command executed:

  sed 's/^/prefix_/' qfile.txt > safe_qfile.txt
  poppunk_assign --db GPS_v6 --external-clustering GPS_v6_external_clusters.csv --query safe_qfile.txt --output output --threads 8
  sed 's/^prefix_//' output/output_external_clusters.csv > result.csv

Command exit status:
  125

Command output:
  (empty)

Command error:
  docker: Error response from daemon: Range of CPUs is from 0.01 to 6.00, as there are only 6 CPUs available.
  See 'docker run --help'.

Work dir:
  /Users/user/local-repo/gps-unified-pipeline/work/8d/7645f6bc5ce546d4c07e369a4074d9

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Potential Cause:
Try to use all CPUs on host, instead of CPUs available to the container
https://github.com/HarryHung/gps-unified-pipeline/blob/e3e5afde30492aea44d9d32ae69a3e335d5ba15b/nextflow.config#L91

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.