fmalmeida / mpgap Goto Github PK
View Code? Open in Web Editor NEWMulti-platform genome assembly pipeline for Illumina, Nanopore and PacBio reads
Home Page: https://mpgap.readthedocs.io/en/latest/
License: GNU General Public License v3.0
Multi-platform genome assembly pipeline for Illumina, Nanopore and PacBio reads
Home Page: https://mpgap.readthedocs.io/en/latest/
License: GNU General Public License v3.0
Add the possibility to run homopolish if desired by the user:
A small list of different assemblers possibilities to add to the pipeline.
Obs: Read and understand the assemblers to evaluate whether or not to add them.
When I tried to test the complete workflow of your code, I found that the case file at https://figshare.com/ndownloader/articles/14036585/versions/4 can pass through the ngs-preprocess pipeline, but when I continue with the MpGAP pipeline, an error occurs. The code I tried is as follows:
nextflow run fmalmeida/mpgap
--output mpgap_assmbly
--max_cpus 20
--genome_size 6m
--input ./mpgap_samplesheet.yml
--hybrid-strategy both
-profile docker
running log
.nextflow.log
The pipeline now has a few configuration parameters that can be set globally to all samples at once (when set via the CLI or config file), or specifically to a single sample (when passed inside the samplesheet).
Although a few cases of them are documented in the manual, others are not.
So we need to revise the documentation to make sure all these parameters are properly described in the manual and not only inside the config file.
Also revise the help message.
Describe the bug
I encountered an issue while running the pipeline with two barcoded genome samples (Barcode04 and Barcode06). These samples produced exceptionally large output files: Barcode04 resulted in an 11GB file, while Barcode06 generated a massive 980GB file. Both runs also exhibited different errors. It's worth noting that the reference genome size for these samples, which are from hummingbirds, is approximately 1.5GB. The sequencing was performed using the Nanopore Promethion 10.4.1 platform, and basecalling was done with the Super Accurate algorithm.
To Reproduce
Steps to reproduce the behavior:
Run the following command line with the files in the respective folders
nextflow run fmalmeida/mpgap --output output_barcode04_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode04.yml" -profile docker
or
nextflow run fmalmeida/mpgap --output output_barcode06_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode06.yml" -profile docker
Expected behavior
Output folders with the results of the pipe
Add an option to facilitate and organize the execution of the pipeline for multiple samples. Maybe create something using a YAML syntax as it is done in bacannot.
Hi,
Thanks for your kind support in the past. I've been using mpgap routinely in our projects, but now I've been struggling for a while with a new installation... maybe trivial but I'm stuck and wondered if you could comment on it.
I'm using singularity. Everything seems to be working, and the pipeline starts, but then ends with the error:
Explicit 'name separator' in class near index 8
[dataset/pacbio.fastq]
^
-- Check script '/.nextflow/assets/fmalmeida/mpgap/./workflows/parse_samples.nf' at line: 66 or see '.nextflow.log' file for more details
I believe I've used the same syntax as always, and the one suggested in the manual in the yml samplesheet:
samplesheet:
- id: Sol_test_1
pacbio:
- 'dataset/pacbio.fastq'
genome_size: 39.11m
wtdbg2_technology: rs
corrected_long_reads: false
I'm attaching the nextflow.log. May be something trivial and nextflow-related... but could you comment please? I've unsuccesfully tried to use different quotes in the yml file, or even placing the file on the same folder.
Thanks!
Add in the pipeline a rule so all the assemblies of a sample have a copy stored in a single folder, e.g. final_output
, so that it is easier for users to further select and retrieve the assemblies they want.
So, I've consistently observed that the quast step fails, and even if the pipeline points to the directory of the work, so the files can be checked, these appear to be empty.
This is the folder:
I attach the logs, and the files that were not 0-sized.
Will let you know if it keeps failing during my tests, and if the fix you provided in the config file allows to bypass the erro and avoid the pipeline crash. Thank you so much
nextflow.log.txt
output.log.txt
quast_files.zip
Hi Dr Almeida,
Thank you for the detailed genome assembly workflow. Could you please include another assembler, hifiasm(https://github.com/chhylp123/hifiasm)? It is particularly suited for long-read data
Best
Chia-Wei
Include the automatic generation of a samplesheet that can be readily used for bacannot.
Question here is: put all generated genomes? only polished ones? only final?
Add the option to execute more jobs in parallel, being each job up to N threads. As it happens in bacannot!
Change the pipeline configurations a little bit to:
Hi, thanks for the impressive work and pipeline
So I'm trying to use it on our data, and getting a couple of errors for starters.
If I ran the pipeline with --input XXXX.yml and --hybrid_strategy 2, I get the error:
Launching
https://github.com/fmalmeida/mpgap
[golden_bell] DSL2 - revision: c1d2ab6 [master]
ERROR: Validation of pipeline parameters failed!
- --hybrid_strategy: '2' is not a valid choice (Available choices: 1, 2, both)
- --hybrid_strategy: '2' is not a valid choice (Available choices: 1, 2, both)
both seems to be working with this syntaxis, but not 1 nor 2
If I ran the pipeline trying to provide the config file with -c, I get the error
N E X T F L O W ~ version 22.04.5
Unknown method invocationcall
on BigDecimal type -- Did you mean?
scale
I don't have much experience with nextflkow, so I may be missing something "easy". Hope you can comment and help. Thanks for the support
Add another hybrid strategy for samples where this might be the best option.
This strategy is to perform a short reads assembly and then scaffold with long reads.
When running a samplesheet that has longreads assemblies, if one forget to give the --genome_size
parameter, the pipeline exits without showing the error message in the console. The message appears only in the .nextflow.log
Problem is in file https://github.com/fmalmeida/MpGAP/blob/master/nf_functions/writeCSV.nf
Instead of println
should be log.error
and instead of exit 1
should be System.exit(1)
.
Add an option in the pipeline, such as a parameter called --eukaryotes
, in which will tell the Quality assessment tools (Quast and Busco) to be performed using its configurations for eukaryotes.
This issue relates to issues #52 and #59 where users seemed to face memory errors and had to adapt the config so that they could use more memory from the first try, instead of having to wait for retries.
By default, the pipeline first tries with a small amount, then it uses the fully amount specified by the max parameter:
// Assemblies will first try to adjust themselves to a parallel execution
// If it is not possible, then it waits to use all the resources allowed
withLabel:process_assembly {
cpus = { if (task.attempt == 1) { check_max( 6 * task.attempt, 'cpus' ) } else { params.max_cpus } }
memory = { if (task.attempt == 1) { check_max( 20.GB * task.attempt, 'memory' ) } else { params.max_memory } }
time = { if (task.attempt == 1) { check_max( 24.h * task.attempt, 'time' ) } else { params.max_time } }
// retry at least once to try it with full resources
errorStrategy = { task.exitStatus in [1,21,143,137,104,134,139,247] ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
}
// Quast sometimes can take too long
withName:quast {
cpus = { if (task.attempt == 1) { check_max( 4 * task.attempt, 'cpus' ) } else { params.max_cpus } }
memory = { if (task.attempt == 1) { check_max( 10.GB * task.attempt, 'memory' ) } else { params.max_memory } }
time = { if (task.attempt == 1) { check_max( 12.h * task.attempt, 'time' ) } else { params.max_time } }
// retry at least once to try it with full resources
errorStrategy = { task.exitStatus in [21,143,137,104,134,139,247] ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
}
Probably would be good to also define a parameter, to configure the starting memory amount&threads, which would be used in the first attempt of these modules.
Maybe, --start_asm_mem
& --start_asm_cpus
.
Background
This issue is meant to address the comments received on the paper review here.
Description
Generate a new page in the web documentation, showing the analysis of a fungi or plant sequencing dataset. Make sure that they have the necessary command lines from input to output, so one can reproduce, but also, add an overview of the generated results in the web page.
Once done, check how easily one can we update the paper to provide an additional Zenodo for the non-bacterial analysis (ngs-preprocess + MpGAP).
Add a parameter that tells the pipeline to treat the high quality input long reads as corrected reads. This should trigger, whenever available, the parameters in each assembler that is specific for corrected long reads.
Examples:
--pacbio-corr
--nano-corr
-corrected
etc.
To date, Shovill is executed by default with "spades" assembler as base. However, the software also supports using megahit and skesa.
Although possible for users to change the default assembler for shovill, e.g. --shovill_additional_parameters " --assembler skesa "
.
However, this will only change the assembler selected and execute only the selected one. The idea is:
Is it possible to create a rule by default that makes the pipeline create a shovill assembly with each possible assembler?
Include a way to automatically download data from SRA and run the pipeline.
Bottleneck here is identifying a way so that the pipeline can fetch multiple SRAs for a single sample, in case of a hybrid assembly for example.
The pipeline (in dev
branch) now has two params:
--corrected_longreads
and high_quality_longreads
.
Maybe would be worthy to make hifiasm
only run when one of these are available?
This is a follow up of #53
@scintilla9 any strong opinion?
Check if there are new version available of tools and update it.
To facilitate contribution and updates, would be great to create github actions to test the pipeline with the available profiles and technologies.
Some of the workflow parameters are explained in the online documentation (readthedocs) but they are not explained in the command line help! Fix it!
Dear developers of MpGAP,
I'm trying to run you assembly pipeline on ONT reads (after adapter removl with Porechop and length filtering with Filtlong) using the following command:
./nextflow run fmalmeida/mpgap --longreads /storage/ONT_results_FLO-MIN/pass/ONT_FLO-MIN_filtered.fastq.gz --lr_type nanopore --assembly_type longreads-only --try_canu --try_flye --try_unicycler --genomeSize 5m --outdir /storage/ONT_results_FLO-MIN/assemblies --threads 8
And I get the following error message:
N E X T F L O W ~ version 20.10.0
Launching fmalmeida/mpgap
[nasty_pasteur] - revision: 9860b84 [master]
Docker-based, fmalmeida/mpgap, generic genome assembly pipeline
No such variable: USER
Is there a way to fix this in the CLI so othat the pipeline runs correctly?
Thanks in advance for your kind help.
Best wishes
JL
Hi Dr Almeida,
I was playing around the pipeline, everything want well, however, I found issue in Busco.
The output summary showed no buscos were found in query genome (100% missing) against bacteria_odb9. I also found some issues in quast github, seems related to LD_LIBRARY_PATH (ablab/quast#88).
I also tried on standalone Busco container (v5.4.7) with bacteria_odb10, and got 100% complete single copy. It confirms that the assembled genome is well.
Could you please kindly check if it could be fixed?
Best,
CW
Hi, so thanks again for the help. I'm slowly going through the pipeline in my hpc. It seems to be only a matter of specificying enviromental variables to make sure the root directoyry and /tmp/are not filled, and then allocating enough resources (cpus and memory).
So at last, it's doing the pilon processes, but I've encountered an error I wanted to ask you about. Maybe related to the pipeline not resuming properly even if -resume provided to the nextflow command?
This may be related with the pipeline and not my system? Checking out the logs I see the same error, similar with other folders, such as "wtdbg2" instead of "flye". I guess for the different Pilon runs on the different assemblies? There maybe some mkdir or mv commands that should be forced to allow for replace existing files and resuming runs? Thanks!
[intergalactic_knuth] Nextflow Workflow Report.pdf
Hi,
I am trying to assemble plant genome (~800m) from PacBio Revio reads.
here is the command I use
nextflow -bg run fmalmeida/mpgap --output _ASSEMBLY --max_cpus 20 --skeep_wtdbg2 --genome_size 800m --input MPGAP_samplesheet1.yml -profile docker
here is the yml file contents
samplesheet:
- id: sample_5
pacbio: HMW_DNA_m84126_231020_112323_s3.hifi_reads.fastq.gz
The process started but at some points I get the error messages similar to the following for all the assemblers
[Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: LONGREADS_ONLY:canu (sample_5); status: COMPLETED; exit: 1; error: -; workDir: /mnt/data/guyh/Trifolium/Revio/work/a8/a8499e26430751241cde25981ce53b]
A pdf version of mpgap report is attached
Can you please advice?
Thank you in advance.
Guy
Add trycycler tool to generate a consensus assembly of long reads tools as an option.
Unicycler has now made a huge release to v0.5.0. So, it would be nice to have the pipeline now using this version.
For that, a few fixes in the pipeline's environment and scripts should be done would be required:
--no_correct
should be removeunicycler_polish
which was used inside the MpGAP's pilon polish module for paired end reads
ALE
binariesObs: For now, this release will not impact the pipeline since it is stick to the v0.4.8
. However, for using the new one, these observations should be addressed.
Background
This issue is meant to address the comments received on the paper review here.
Description
Create an "Output" page to facilitate users on the output structure and refer the correct tools-specific links as it is done in the bacannot documentation page, which gives users the interpretation of the generated results, including the directory structure and the relevant links for the tool-specific reference material.
Assess what is required and how to implement some nice features performed in https://github.com/gbouras13/hybracter such as the coupled polishing of polypolish+pypolca which seem to work nicely in a complementary manner. Maybe the plasmid assembly step. And maybe the chr reorientation step.
Check what and how can be done.
Add option to use Pacbio hifi in assemblers were an option for it is available, such as Canu, Flye and etc.
Looks like a great pipeline. Any chance you can have it in conda?
Currently, in hybrid mode, all short-reads polishers are used. Should add a parameter to allow skipping one or another.
Also, the new MultiQC report produced in dev
branch now has some weird entries with ".stat" and ".err" files ... this should be fixed.
Pilon is the tool used for polishing long reads assemblies in the pipeline.
It would be nice to also add polypolish tools as the second short-reads polisher for long reads assembly together with pilon.
By default, the pipeline would polish long reads assemblies with both, but users could chose to skip or not one of them.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.