Describe the bug I encountered an issue while running the pipelin

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Also, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Incomplete pipeline and different errors when using nanopore reads files with different sizes (900 mb vs 11Gb),about fmalmeida/mpgap

Comments (18)

SergioChile81 commented on May 25, 2024 1

Hello @fmalmeida,

Thank you for improving the pipeline with the new --hq-longreads option. I wonder if later on you could include a QC of the reads (adapter, quality trimming with porechop, quality filtering with Filling, Remove Lambda with NanoLyse and evaluation with Nanoplot) -- See the beginning of this other great pipeline (https://nf-co.re/mag/2.5.0)

I will get the folder contests and upload it here.

Cheers,

Sergio

from mpgap.

SergioChile81 commented on May 25, 2024 1

Thank you @fmalmeida,

I will give the pipelines a try. About the work files of barcode04, I had to rerun the pipeline because I was using the same folder to create the output of barcode04 and 06, and I think the first run results got lost. I am attaching the files for barcode04 in the new run with in this link. I also included the directory that shows up in the first error (in the html pipeline report).
Let me know if you need more information.

https://drive.google.com/drive/folders/1mxBNmZ7g7clanqaeLvQ6Yy9irYK5Dp-t?usp=share_link

Thanks,

from mpgap.

SergioChile81 commented on May 25, 2024 1

Hi @fmalmeida,

Thank you for the new version of the pipeline. I tried to follow your instructions but got this error:

N E X T F L O W ~ version 23.10.0
Pulling fmalmeida/mpgap ...
fmalmeida/mpgap contains uncommitted changes -- cannot pull from repository
-c: command not found

Perhaps I am using a different version of Nextflow. Let me know how to proceed.

from mpgap.

fmalmeida commented on May 25, 2024

Hi @SergioChile81 ,
Thanks for sharing.
I will take a look at the files during the week and will reach back to you shortly.

Also, if you have an example of a hummingbird public nanopore data that I can give it a try, please let me know 😄

from mpgap.

fmalmeida commented on May 25, 2024

I think the problem relates to the fact the pipeline currently has settings for the long reads raw and long reads corrcted.

However, I still did not implement the parameter for high quality long reads.

I will do so this week, I will update the tools that have new versions, and add a new parameter in the pipeline for when input long reads are high quality.

Then I let you know so we can test this new version before releasing 🙂

from mpgap.

fmalmeida commented on May 25, 2024

Hi @SergioChile81 ,
When inspecting the .nextflow.log for barcode4, I saw that the error message was not complete. Can you send me the contents of the following working directory?

/data/disk1_SSD_8TB/scratch/Sergio_Marchant_sep_2023/Colibries-20Oct-23/work/41/c93dc1f1f22a9ea43c902f1fbf1055

Also, for Barcode6, I saw that Canu complained about read coverage, I believe this is happening due to it applying algorithms for uncorrected reads in your dataset that is 'high-quality long reads' ...

For this, I am now updating the pipeline to have a parameter --hq-longreads so that you can use to pass on information to assemblers that have algorithms for such reads, like canu and flye.

I am currently developing the update with the publicly available High-Quality ONT reads R10.4 from D. melanogaster: https://www.ncbi.nlm.nih.gov/sra/SRX19162819

from mpgap.

fmalmeida commented on May 25, 2024

Hi @SergioChile81 ,
Thanks for that. I’ll be waiting for them.

Also, thanks for the enhancement suggestion. At first, it was a design choice to keep it separate. You will see that we have developed a separate pipeline for preprocess and QC: https://github.com/fmalmeida/ngs-preprocess

Right now, due limited resources, I would probably not be able to commit in adding it to this pipeline. However, I would invite you to try running this other pipeline https://github.com/fmalmeida/ngs-preprocess and see it if it fits your needs. If so, would also invite you can open an issue in its github to provide feedback and suggestions of enhancements. It is always very much welcomed.

About the present issue, I have currently added the parameter and am assembling the drosophila reads with 3 algorithms (uncorrected, corrected and high-quality) so I can compare how the outputs look like depending on it, so I could suggest a new command line for your data 😃

from mpgap.

fmalmeida commented on May 25, 2024

Hi @SergioChile81 ,

I am currently doing the benchmarking/comparison between the assembly algorithms using D. melanogaster. I would like to understand how the normal/corrected/high_quality algorithms behave in the assemblers that have such option, so I can properly write a section in the documentation what to do and what to expect if you have such data type.

In the meantime, I would like to ask if you could try running the branch using the flye assembler only, just to check if the selection of the high quality parameter is working fine, and that the selection of assembling algorithm is being passed to the assembler.

The following command line should do the trick for testing:

nextflow \
	run fmalmeida/mpgap \
        -r 52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb \
        -latest \
	--input input.yml \
	--output mpgap_results \
	--tracedir mpgap_results/pipeline_info \
	-profile docker \
	--max_cpus 10 --max_memory '60.GB' \
	--quast_additional_parameters ' --eukaryote --large ' \
        --skip_unicycler --skip_canu --skip_shasta --skip_wtdbg2 --skip_raven \
        --high_quality_longreads

The --skip_* params would make allow only flye to run.

from mpgap.

fmalmeida commented on May 25, 2024

Also, @SergioChile81 ,

About the error showed here: https://drive.google.com/drive/folders/1mxBNmZ7g7clanqaeLvQ6Yy9irYK5Dp-t?usp=share_link
I unfortunately could not make any useful assumption out of it, since the error from wtdbg2 does not seem very clear on the problem.

My best guess would be memory, as I saw that you only specified the parameter --max_cpus but not the parameter --max_memory.

By default, the pipeline does a first try using small amount of resources, in other to try multiple assmelbies at the same time, if it fails, it then launches a second attempt using the maximum values set by the user. As you can see, this was happening as all your assemblies failed in the first attempt. Much probably due to memory.

I would advice to try to setting a new value --max_memory as well, as the default is only 14.GB maximum, which will be low for your genome.

Finally, if wanting to set the first try bigger, you can also increase the amount of resources the pipeline uses in the 1st attempt.

Currently, this is the configuration the pipeline uses for assemblies:

process {
    // Assemblies will first try to adjust themselves to a parallel execution
    // If it is not possible, then it waits to use all the resources allowed
    withLabel:process_assembly {
      cpus   = {  if (task.attempt == 1) { check_max( 6 * task.attempt, 'cpus'       ) } else { params.max_cpus   } }
      memory = {  if (task.attempt == 1) { check_max( 20.GB * task.attempt, 'memory' ) } else { params.max_memory } }
      time   = {  if (task.attempt == 1) { check_max( 24.h * task.attempt, 'time'    ) } else { params.max_time   } }
      
      // retry at least once to try it with full resources
      errorStrategy = { task.exitStatus in [1,21,143,137,104,134,139,247] ? 'retry' : 'finish' }
      maxRetries    = 1
      maxErrors     = '-1'
    }
}

// Function to ensure that resource requirements don't go beyond
// a maximum limit
def check_max(obj, type) {
  if(type == 'memory'){
    try {
      if(obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1)
        return params.max_memory as nextflow.util.MemoryUnit
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max memory '${params.max_memory}' is not valid! Using default value: $obj"
      return obj
    }
  } else if(type == 'time'){
    try {
      if(obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
        return params.max_time as nextflow.util.Duration
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max time '${params.max_time}' is not valid! Using default value: $obj"
      return obj
    }
  } else if(type == 'cpus'){
    try {
      return Math.min( obj, params.max_cpus as int )
    } catch (all) {
      println "   ### ERROR ###   Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
      return obj
    }
  }
}

You can possibly change it to

process {
    // Assemblies will first try to adjust themselves to a parallel execution
    // If it is not possible, then it waits to use all the resources allowed
    withLabel:process_assembly {
      cpus   = {  if (task.attempt == 1) { check_max( 12 * task.attempt, 'cpus'      ) } else { params.max_cpus   } }
      memory = {  if (task.attempt == 1) { check_max( 40.GB * task.attempt, 'memory' ) } else { params.max_memory } }
      time   = {  if (task.attempt == 1) { check_max( 24.h * task.attempt, 'time'    ) } else { params.max_time   } }
      
      // retry at least once to try it with full resources
      errorStrategy = { task.exitStatus in [1,21,143,137,104,134,139,247] ? 'retry' : 'finish' }
      maxRetries    = 1
      maxErrors     = '-1'
    }
}

// Function to ensure that resource requirements don't go beyond
// a maximum limit
def check_max(obj, type) {
  if(type == 'memory'){
    try {
      if(obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1)
        return params.max_memory as nextflow.util.MemoryUnit
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max memory '${params.max_memory}' is not valid! Using default value: $obj"
      return obj
    }
  } else if(type == 'time'){
    try {
      if(obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
        return params.max_time as nextflow.util.Duration
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max time '${params.max_time}' is not valid! Using default value: $obj"
      return obj
    }
  } else if(type == 'cpus'){
    try {
      return Math.min( obj, params.max_cpus as int )
    } catch (all) {
      println "   ### ERROR ###   Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
      return obj
    }
  }
}

You can save these lines to a file called custom.config and pass to the pipeline with -c like this:

nextflow run fmalmeida/mpgap -c custom.config <etc.>

from mpgap.

SergioChile81 commented on May 25, 2024

Hello Felipe,

Sorry for the late response. The computer where I do analyses is the same one I do the Nanopore runs and we have been busy. Here is the results without the actual fasta assembly of your previous request for testing.

https://drive.google.com/drive/folders/1qdlh7F14_CNg3HdNYeDync0JL40g6pd7?usp=sharing

Our server has 64 cpu and 512 RAM. I will adjust the custom.config file and let you know. Thanks!

from mpgap.

fmalmeida commented on May 25, 2024

Hi @SergioChile81 ,

Don’t need to run right now. I am must probably finishing the parameter testing this weekend so you can launch a real run to test the enhancement code in the new branch for real. Already using the parameters for high_quality reads as yours.

I let you know in the beginning of next week when you can launch the run to use and check the new parameter.

from mpgap.

fmalmeida commented on May 25, 2024

Hi @SergioChile81 ,

Now it seems to be properly passing the information for the assemblers. From what I saw, the only assemblers that have special parameters for high quality ONT reads, are canu and flye, thus I would advise you to try skipping all the others and allowing only these two.

Finally, please make sure to use the configuration for increase memory usage.

Your command line would look more like this:

nextflow \
	run fmalmeida/mpgap \
        -r 52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb \
        -latest \
	--input input.yml \
	--output mpgap_results \
	--tracedir mpgap_results/pipeline_info \
	-profile docker \
	--max_cpus 10 --max_memory '60.GB' \
	--quast_additional_parameters ' --eukaryote --large ' \
        --skip_unicycler --skip_shasta --skip_wtdbg2 --skip_raven \
        --high_quality_longreads \
        -c custom.config

Where -c custom.config would be the file having the memory customisation setup I sent before.

Finally, the selection of reads quality can happen globally, by using the parameter --high_quality_longreads or per sample by setting it in the samplesheet as below:

samplesheet:
- id: highquality_algorithm
  nanopore: in_reads/final_output/nanopore/SRR23215008.filtered.fq.gz
  high_quality_longreads: true
  genome_size: 180m
  medaka_model: r1041_e82_400bps_sup_v4.2.0

Please let me know how it goes, because if things work, and the high quality parameter really propagates the information to the assemblers and all, I will then start working in the documentation, to make sure to add full information about these features in the manual.

from mpgap.

fmalmeida commented on May 25, 2024

Hi @SergioChile81 ,
I have forgotten to add a '" at the command above. Please try the following:

rm -rf ~/.nextflow/assets/fmalmeida/mpgap
then save the file custom.config with the amount of memory you need as here: #52 (comment)
then please try it out as here: #52 (comment)

I have updated the code here so it has the missing "".
Finally, remember to see the high_quality_longreads parameters, either in the command line or in the samplesheet as here.

The idea, is that we can test both that you can run the pipeline with more memory, and second that the 'high-quality' parameters are being properly passed to the assemblers that have special params for it ( only flye and canu ).

I am currently working in another ticket to add hifhasm.

😄

from mpgap.

SergioChile81 commented on May 25, 2024

Thanks again for the fast response. I have updated the files and commands with the instructions you provided. However I have this new error.. Sorry : /

(mpgap_nf) ubuntu@AGROSAVIA:~/mpgap$ nextflow   run fmalmeida/mpgap         -r 52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb         -latest --input MPGAP_samplesheet_barcode04.yml --output mpgap_results_barcode04 --tracedir mpgap_results_barcode04/pipeline_info -profile docker --max_cpus 10 --max_memory '60.GB' --quast_additional_parameters ' --eukaryote --large ' --skip_unicycler --skip_shasta --skip_wtdbg2 --skip_raven --high_quality_longreads -c custom.config
N E X T F L O W  ~  version 23.10.0
Pulling fmalmeida/mpgap ...
 Already-up-to-date
Launching `https://github.com/fmalmeida/mpgap` [fabulous_boyd] DSL2 - revision: 19b743abbc [52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb]

WARN: Found unexpected parameters:
* --pilon_polish_rounds: 4
- Ignore this warning: params.schema_ignore_params = "pilon_polish_rounds" 



------------------------------------------------------
  fmalmeida/mpgap v3.2
------------------------------------------------------
Core Nextflow options
  revision                   : 52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb
  runName                    : fabulous_boyd
  containerEngine            : docker
  container                  : fmalmeida/mpgap@sha256:0439466a52a3aef70c3e3b2b8ba5504bf167db2437a7fbb85d40f94c95a67fb9
  launchDir                  : /data/disk1_SSD_8TB/scratch/Sergio_Marchant_sep_2023/Colibries-20Oct-23
  workDir                    : /data/disk1_SSD_8TB/scratch/Sergio_Marchant_sep_2023/Colibries-20Oct-23/work
  projectDir                 : /home/ubuntu/.nextflow/assets/fmalmeida/mpgap
  userName                   : ubuntu
  profile                    : docker
  configFiles                : /home/ubuntu/.nextflow/assets/fmalmeida/mpgap/nextflow.config, /data/disk1_SSD_8TB/scratch/Sergio_Marchant_sep_2023/Colibries-20Oct-23/custom.config

Input/output options
  input                      : MPGAP_samplesheet_barcode04.yml
  output                     : mpgap_results_barcode04

Computational options
  max_cpus                   : 10
  max_memory                 : 60.GB

Long reads assemblers parameters
  high_quality_longreads     : true

Turn assemblers and modules on/off
  skip_unicycler             : true
  skip_raven                 : true
  skip_wtdbg2                : true
  skip_shasta                : true

Software' additional parameters
  quast_additional_parameters:  --eukaryote --large

Generic options
  tracedir                   : mpgap_results_barcode04/pipeline_info

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use fmalmeida/mpgap for your analysis please cite:

* The pipeline
  https://doi.org/10.5281/zenodo.3445485

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/fmalmeida/mpgap#citation
------------------------------------------------------

    Launching defined workflows!
    By default, all workflows will appear in the console "log" message.
    However, the processes of each workflow will be launched based on the inputs received.
    You can see that processes that were not launched have an empty [-       ].
  
[-        ] process > SHORTREADS_ONLY:spades       -
[-        ] process > SHORTREADS_ONLY:shovill      -
[-        ] process > SHORTREADS_ONLY:megahit      -
[-        ] process > LONGREADS_ONLY:canu          -
[-        ] process > LONGREADS_ONLY:flye          -
[-        ] process > LONGREADS_ONLY:medaka        -
[-        ] process > LONGREADS_ONLY:nanopolish    -
[-        ] process > LONGREADS_ONLY:gcpp          -
[-        ] process > SHORTREADS_ONLY:spades       -
[-        ] process > SHORTREADS_ONLY:shovill      -
[-        ] process > SHORTREADS_ONLY:megahit      -
[-        ] process > LONGREADS_ONLY:canu          -
[-        ] process > LONGREADS_ONLY:flye          -
[-        ] process > LONGREADS_ONLY:medaka        -
[-        ] process > LONGREADS_ONLY:nanopolish    -
[-        ] process > LONGREADS_ONLY:gcpp          -
[-        ] process > HYBRID:strategy_1_spades     -
[-        ] process > HYBRID:strategy_1_haslr      -
[-        ] process > HYBRID:strategy_2_canu       -
[-        ] process > HYBRID:strategy_2_flye       -
[-        ] process > HYBRID:strategy_2_medaka     -
[-        ] process > HYBRID:strategy_2_nanopolish -
[-        ] process > HYBRID:strategy_2_gcpp       -
[-        ] process > HYBRID:strategy_2_pilon      -
[-        ] process > SHORTREADS_ONLY:spades       -
[-        ] process > SHORTREADS_ONLY:shovill      -
[-        ] process > SHORTREADS_ONLY:megahit      -
[-        ] process > LONGREADS_ONLY:canu          -
[-        ] process > LONGREADS_ONLY:flye          -
[-        ] process > LONGREADS_ONLY:medaka        -
[-        ] process > LONGREADS_ONLY:nanopolish    -
[-        ] process > LONGREADS_ONLY:gcpp          -
[-        ] process > HYBRID:strategy_1_spades     -
[-        ] process > HYBRID:strategy_1_haslr      -
[-        ] process > HYBRID:strategy_2_canu       -
[-        ] process > HYBRID:strategy_2_flye       -
[-        ] process > HYBRID:strategy_2_medaka     -
[-        ] process > HYBRID:strategy_2_nanopolish -
[-        ] process > HYBRID:strategy_2_gcpp       -
[-        ] process > HYBRID:strategy_2_pilon      -
[-        ] process > HYBRID:strategy_2_polypolish -
[-        ] process > ASSEMBLY_QC:quast            -
[-        ] process > ASSEMBLY_QC:multiqc          -
Execution cancelled -- Finishing pending tasks before exit
Pipeline completed at: 2023-11-17T10:22:38.681112228-05:00
Execution status: failed
Execution duration: 2.4s

Do not give up, we can fix it!


ERROR ~ Error executing process > 'LONGREADS_ONLY:flye (highquality_algorithm)'

Caused by:
  No signature of method: nextflow.script.ScriptBinding.check_max() is applicable for argument types: () values: [] -- Check script '/home/ubuntu/.nextflow/assets/fmalmeida/mpgap/./workflows/../modules/LongReads/flye.nf' at line: 26

Source block:
  lr        = (lr_type == 'nanopore') ? '--nano' : '--pacbio'
  if (corrected_longreads.toBoolean())    { lrparam = lr + '-corr' }
    else if (high_quality_longreads.toBoolean()) {
      lrsuffix = (lr_type == 'nanopore') ? '-hq' : '-hifi'
      lrparam  = lr + lrsuffix
    }
    else { lrparam = lr + '-raw' }
  gsize     = (genome_size) ? "--genome-size ${genome_size}" : ""
  additional_params = (params.flye_additional_parameters) ? params.flye_additional_parameters : ""
  """
    # run flye
    flye \\
        ${lrparam} $lreads \\
        ${gsize} \\
        --out-dir flye \\
        $additional_params \\
        --threads $task.cpus &> flye.log ;
  
    # rename results
    mv flye/assembly.fasta flye/flye_assembly.fasta
    """

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

from mpgap.

fmalmeida commented on May 25, 2024

what is the contents of your -c custom.config file?

from mpgap.

fmalmeida commented on May 25, 2024

I think there might be a problem with your custom.config file. Can you try without this option (-c custom.config)?

or show it here so we can check?

@SergioChile81

from mpgap.

fmalmeida commented on May 25, 2024

Closing this ticket due lack of activity.
Note: If the error persists, or a new one appear, please feel welcomed to open a new issue in the pipeline referencing to this one (if it relates or is the same).

Results from the ticket.

A new parameter to handle high quality long reads and activate parameters in the assemblers that have it. Merged in dev branch bu #63 and shall come in next release.

Finally, a new issue was created in order to make it easier to modify the amount of memory that the pipeline requests from start, to make it easier to run with datasets of bigger genomes that require more memory, allowing it to run without having to first fail with a starting 20.GB assembly job --> #61

from mpgap.

fmalmeida commented on May 25, 2024

Added some new parameters in the latest release to allow users to quickly modify the amount of memory of starting assembly results. Select different BUSCO dbs. And also, say if long reads are corrected or high quality.

https://github.com/fmalmeida/MpGAP/releases/tag/v3.2.0

Hope it helps.

If error persists, we can open a new ticket for tackling it.

from mpgap.

Incomplete pipeline and different errors when using nanopore reads files with different sizes (900 mb vs 11Gb) about mpgap HOT 18 CLOSED

Comments (18)

Results from the ticket.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent