fmalmeida / mpgap Goto Github PK

View Code? Open in Web Editor NEW

52.0 4.0 10.0 85.37 MB

Multi-platform genome assembly pipeline for Illumina, Nanopore and PacBio reads

Home Page: https://mpgap.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Nextflow 94.00% Dockerfile 6.00%

hybrid-assemblies illumina pipeline genome-assembly polish pacbio nanopore unicycler spades flye

mpgap's Introduction

Hello 😁 👋

Hello there, my name is Felipe Almeida, a brazilian scientist, bioinformatician, pipeline developer and problem solver. My main interests are: Bioinformatics, genomic surveillance, precision medicine, and microbial genomics. You can also find me on twitter @fmarquesalmeida, stackoverflow and linkedin.

Academic info

I'm a PhD student at the University of Brasilia, at the CompGen (Computational Genomics) laboratory with academic guidance from PhD. Prof. Georgios J. Pappas Jr.

Some of my favourite tools:

My stats

mpgap's People

Contributors

Stargazers

Watchers

Forkers

pythseq vikash84 jennomics mxrcon bennuru edwardbirdlab fredrickkebaso adamtaranto santosrac

mpgap's Issues

Add an option for multiple samples

Add an option to facilitate and organize the execution of the pipeline for multiple samples. Maybe create something using a YAML syntax as it is done in bacannot.

add automatic samplesheet for bacannot

Include the automatic generation of a samplesheet that can be readily used for bacannot.
Question here is: put all generated genomes? only polished ones? only final?

Enhance documentation (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Create an "Output" page to facilitate users on the output structure and refer the correct tools-specific links as it is done in the bacannot documentation page, which gives users the interpretation of the generated results, including the directory structure and the relevant links for the tool-specific reference material.

add trycycler

Add trycycler tool to generate a consensus assembly of long reads tools as an option.

add polypolish tool

Pilon is the tool used for polishing long reads assemblies in the pipeline.

It would be nice to also add polypolish tools as the second short-reads polisher for long reads assembly together with pilon.

By default, the pipeline would polish long reads assemblies with both, but users could chose to skip or not one of them.

Update CLI help message

Some of the workflow parameters are explained in the online documentation (readthedocs) but they are not explained in the command line help! Fix it!

nf tower options
parallel jobs options

add 3 hybrid strategy

Add another hybrid strategy for samples where this might be the best option.

This strategy is to perform a short reads assembly and then scaffold with long reads.

Incomplete pipeline and different errors when using nanopore reads files with different sizes (900 mb vs 11Gb)

Describe the bug
I encountered an issue while running the pipeline with two barcoded genome samples (Barcode04 and Barcode06). These samples produced exceptionally large output files: Barcode04 resulted in an 11GB file, while Barcode06 generated a massive 980GB file. Both runs also exhibited different errors. It's worth noting that the reference genome size for these samples, which are from hummingbirds, is approximately 1.5GB. The sequencing was performed using the Nanopore Promethion 10.4.1 platform, and basecalling was done with the Super Accurate algorithm.

To Reproduce
Steps to reproduce the behavior:
Run the following command line with the files in the respective folders
nextflow run fmalmeida/mpgap --output output_barcode04_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode04.yml" -profile docker

nextflow run fmalmeida/mpgap --output output_barcode06_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode06.yml" -profile docker

Expected behavior
Output folders with the results of the pipe

Archive.zip

mkdir: cannot create directory

Hi, so thanks again for the help. I'm slowly going through the pipeline in my hpc. It seems to be only a matter of specificying enviromental variables to make sure the root directoyry and /tmp/are not filled, and then allocating enough resources (cpus and memory).
So at last, it's doing the pilon processes, but I've encountered an error I wanted to ask you about. Maybe related to the pipeline not resuming properly even if -resume provided to the nextflow command?

This may be related with the pipeline and not my system? Checking out the logs I see the same error, similar with other folders, such as "wtdbg2" instead of "flye". I guess for the different Pilon runs on the different assemblies? There maybe some mkdir or mv commands that should be forced to allow for replace existing files and resuming runs? Thanks!

quast generating empty files

So, I've consistently observed that the quast step fails, and even if the pipeline points to the directory of the work, so the files can be checked, these appear to be empty.
This is the folder:

I attach the logs, and the files that were not 0-sized.
Will let you know if it keeps failing during my tests, and if the fix you provided in the config file allows to bypass the erro and avoid the pipeline crash. Thank you so much
nextflow.log.txt
output.log.txt
quast_files.zip

Requesting support with error "Explicit 'name separator' in class"

Hi,

Thanks for your kind support in the past. I've been using mpgap routinely in our projects, but now I've been struggling for a while with a new installation... maybe trivial but I'm stuck and wondered if you could comment on it.
I'm using singularity. Everything seems to be working, and the pipeline starts, but then ends with the error:

Explicit 'name separator' in class near index 8
[dataset/pacbio.fastq]
^
-- Check script '/.nextflow/assets/fmalmeida/mpgap/./workflows/parse_samples.nf' at line: 66 or see '.nextflow.log' file for more details

I believe I've used the same syntax as always, and the one suggested in the manual in the yml samplesheet:
samplesheet:

id: Sol_test_1
pacbio:

'dataset/pacbio.fastq'
genome_size: 39.11m
wtdbg2_technology: rs
corrected_long_reads: false

I'm attaching the nextflow.log. May be something trivial and nextflow-related... but could you comment please? I've unsuccesfully tried to use different quotes in the yml file, or even placing the file on the same folder.

Thanks!

Add option hifi

Add option to use Pacbio hifi in assemblers were an option for it is available, such as Canu, Flye and etc.

Check new assemblers?

A small list of different assemblers possibilities to add to the pipeline.

Obs: Read and understand the assemblers to evaluate whether or not to add them.

Add hifiasm for long reads assemble

Hi Dr Almeida,

Thank you for the detailed genome assembly workflow. Could you please include another assembler, hifiasm(https://github.com/chhylp123/hifiasm)? It is particularly suited for long-read data

Best
Chia-Wei

problem with longreads_only assembly

[intergalactic_knuth] Nextflow Workflow Report.pdf

Hi,

I am trying to assemble plant genome (~800m) from PacBio Revio reads.

here is the command I use
nextflow -bg run fmalmeida/mpgap --output _ASSEMBLY --max_cpus 20 --skeep_wtdbg2 --genome_size 800m --input MPGAP_samplesheet1.yml -profile docker

here is the yml file contents

samplesheet:
  - id: sample_5
    pacbio: HMW_DNA_m84126_231020_112323_s3.hifi_reads.fastq.gz

The process started but at some points I get the error messages similar to the following for all the assemblers

[Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: LONGREADS_ONLY:canu (sample_5); status: COMPLETED; exit: 1; error: -; workDir: /mnt/data/guyh/Trifolium/Revio/work/a8/a8499e26430751241cde25981ce53b]

A pdf version of mpgap report is attached

Can you please advice?

Thank you in advance.
Guy

add testing git actions for PRs

To facilitate contribution and updates, would be great to create github actions to test the pipeline with the available profiles and technologies.

update tool versions

Check if there are new version available of tools and update it.

Add skip parameter for sreads polishers and fix multiqc report names

Currently, in hybrid mode, all short-reads polishers are used. Should add a parameter to allow skipping one or another.
Also, the new MultiQC report produced in dev branch now has some weird entries with ".stat" and ".err" files ... this should be fixed.

add homopolish tool

Add the possibility to run homopolish if desired by the user:

https://github.com/ythuang0522/homopolish

Improve quality assessment

Add an option in the pipeline, such as a parameter called --eukaryotes, in which will tell the Quality assessment tools (Quast and Busco) to be performed using its configurations for eukaryotes.

use nf-core framework for CLI help and log messages

Change the pipeline configurations a little bit to:

Better reorganize the config files, updating standard to do not load any other profile as is common for NF pipelines
Better separate defaults params from the main config and script
Use label resources to better manage parallel jobs as it is done by nf-core
Use more of nf-core framework and Groovy libs to provide beautiful and cleaner CLI help and log messages

100% missing in Busco

Hi Dr Almeida,

I was playing around the pipeline, everything want well, however, I found issue in Busco.
The output summary showed no buscos were found in query genome (100% missing) against bacteria_odb9. I also found some issues in quast github, seems related to LD_LIBRARY_PATH (ablab/quast#88).

I also tried on standalone Busco container (v5.4.7) with bacteria_odb10, and got 100% complete single copy. It confirms that the assembled genome is well.

Could you please kindly check if it could be fixed?

Best,
CW

new directory called "final_output"

Add in the pipeline a rule so all the assemblies of a sample have a copy stored in a single folder, e.g. final_output, so that it is easier for users to further select and retrieve the assemblies they want.

Add more parallel jobs

Add the option to execute more jobs in parallel, being each job up to N threads. As it happens in bacannot!

update documentation about the configuration in either config or samplesheet

The pipeline now has a few configuration parameters that can be set globally to all samples at once (when set via the CLI or config file), or specifically to a single sample (when passed inside the samplesheet).

Although a few cases of them are documented in the manual, others are not.

So we need to revise the documentation to make sure all these parameters are properly described in the manual and not only inside the config file.

Also revise the help message.

Include option for high quality long reads

Add a parameter that tells the pipeline to treat the high quality input long reads as corrected reads. This should trigger, whenever available, the parameters in each assembler that is specific for corrected long reads.

Examples:

Flye:
- --pacbio-corr
- --nano-corr
Canu:
- -corrected

etc.

change to unicycler v0.5.0?

Unicycler has now made a huge release to v0.5.0. So, it would be nice to have the pipeline now using this version.

For that, a few fixes in the pipeline's environment and scripts should be done would be required:

Unicycler now accepts the newest SPAdes version thus the v3.13 binaries would no be necessary anymore
Unycler now do not correct reads prior to assembly, thus, the information about --no_correct should be remove
Unicycler now do not polishes the assembly in the end, thus, a new step for pilon polish is required after it's assembly
- This already happens for hybrid assemblies, however, should also be performed for Illumina assemblies.
Unicycler has now descontinued it's script unicycler_polish which was used inside the MpGAP's pilon polish module for paired end reads
- Thus, this module needs to be updated to do not use this script and perform only a single polishing with pilon either with single (which is already this way) or paired end reads.
- This, removes the dependency on ALE binaries

Obs: For now, this release will not impact the pipeline since it is stick to the v0.4.8. However, for using the new one, these observations should be addressed.

No such variable: USER

Dear developers of MpGAP,

I'm trying to run you assembly pipeline on ONT reads (after adapter removl with Porechop and length filtering with Filtlong) using the following command:

./nextflow run fmalmeida/mpgap --longreads /storage/ONT_results_FLO-MIN/pass/ONT_FLO-MIN_filtered.fastq.gz --lr_type nanopore --assembly_type longreads-only --try_canu --try_flye --try_unicycler --genomeSize 5m --outdir /storage/ONT_results_FLO-MIN/assemblies --threads 8

And I get the following error message:

N E X T F L O W ~ version 20.10.0
Launching fmalmeida/mpgap [nasty_pasteur] - revision: 9860b84 [master]
Docker-based, fmalmeida/mpgap, generic genome assembly pipeline

No such variable: USER

Is there a way to fix this in the CLI so othat the pipeline runs correctly?

Thanks in advance for your kind help.

Best wishes

Shovill with all assemblers?

To date, Shovill is executed by default with "spades" assembler as base. However, the software also supports using megahit and skesa.

Although possible for users to change the default assembler for shovill, e.g. --shovill_additional_parameters " --assembler skesa ".

However, this will only change the assembler selected and execute only the selected one. The idea is:

Is it possible to create a rule by default that makes the pipeline create a shovill assembly with each possible assembler?

add the possibility of running directly from SRA IDs

Include a way to automatically download data from SRA and run the pipeline.

Bottleneck here is identifying a way so that the pipeline can fetch multiple SRAs for a single sample, in case of a hybrid assembly for example.

Add example of non-bacterial dataset analysis (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Generate a new page in the web documentation, showing the analysis of a fungi or plant sequencing dataset. Make sure that they have the necessary command lines from input to output, so one can reproduce, but also, add an overview of the generated results in the web page.

Once done, check how easily one can we update the paper to provide an additional Zenodo for the non-bacterial analysis (ngs-preprocess + MpGAP).

Add a simple parameter to handle starting memory settings

This issue relates to issues #52 and #59 where users seemed to face memory errors and had to adapt the config so that they could use more memory from the first try, instead of having to wait for retries.

By default, the pipeline first tries with a small amount, then it uses the fully amount specified by the max parameter:

// Assemblies will first try to adjust themselves to a parallel execution
    // If it is not possible, then it waits to use all the resources allowed
    withLabel:process_assembly {
      cpus   = {  if (task.attempt == 1) { check_max( 6 * task.attempt, 'cpus'       ) } else { params.max_cpus   } }
      memory = {  if (task.attempt == 1) { check_max( 20.GB * task.attempt, 'memory' ) } else { params.max_memory } }
      time   = {  if (task.attempt == 1) { check_max( 24.h * task.attempt, 'time'    ) } else { params.max_time   } }
      
      // retry at least once to try it with full resources
      errorStrategy = { task.exitStatus in [1,21,143,137,104,134,139,247] ? 'retry' : 'finish' }
      maxRetries    = 1
      maxErrors     = '-1'
    }

    // Quast sometimes can take too long
    withName:quast {
      cpus   = {  if (task.attempt == 1) { check_max( 4 * task.attempt, 'cpus'       ) } else { params.max_cpus   } }
      memory = {  if (task.attempt == 1) { check_max( 10.GB * task.attempt, 'memory' ) } else { params.max_memory } }
      time   = {  if (task.attempt == 1) { check_max( 12.h * task.attempt, 'time'    ) } else { params.max_time   } }
      
      // retry at least once to try it with full resources
      errorStrategy = { task.exitStatus in [21,143,137,104,134,139,247] ? 'retry' : 'finish' }
      maxRetries    = 1
      maxErrors     = '-1'
    }

Probably would be good to also define a parameter, to configure the starting memory amount&threads, which would be used in the first attempt of these modules.

Maybe, --start_asm_mem & --start_asm_cpus.

conda?

Looks like a great pipeline. Any chance you can have it in conda?

No valid choice in the pipeline parameters for hybrid strategies and problem with -c config

Hi, thanks for the impressive work and pipeline

So I'm trying to use it on our data, and getting a couple of errors for starters.
If I ran the pipeline with --input XXXX.yml and --hybrid_strategy 2, I get the error:

Launching https://github.com/fmalmeida/mpgap [golden_bell] DSL2 - revision: c1d2ab6 [master]
ERROR: Validation of pipeline parameters failed!

--hybrid_strategy: '2' is not a valid choice (Available choices: 1, 2, both)

--hybrid_strategy: '2' is not a valid choice (Available choices: 1, 2, both)

both seems to be working with this syntaxis, but not 1 nor 2

If I ran the pipeline trying to provide the config file with -c, I get the error

N E X T F L O W ~ version 22.04.5
Unknown method invocation call on BigDecimal type -- Did you mean?
scale

I don't have much experience with nextflkow, so I may be missing something "easy". Hope you can comment and help. Thanks for the support