Giter Site home page Giter Site logo

sanger-tol / readmapping Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 5.0 9.01 MB

Nextflow DSL2 pipeline to align short and long reads to genome assembly. This workflow is part of the Tree of Life production suite.

Home Page: https://pipelines.tol.sanger.ac.uk/readmapping

License: MIT License

HTML 2.34% Python 8.47% Nextflow 48.35% Groovy 40.62% Shell 0.23%
genomics nextflow pipeline read-alignment

readmapping's People

Contributors

gq1 avatar muffato avatar priyanka-surana avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

readmapping's Issues

Pre script

Description of feature

  • Check if specimen level file exists
  • Check @RG values in existing file
  • If expected @RG value not found, run the alignment and merge
  • If expected @RG value found, skip and end
  • Check if index exists, if yes skip step

Linting & Testing

Description of feature

Either remove existing tests or modify to match sanger-tol templates

Add multiple output options

Description of feature

To close this issue:

  1. Add a new parameter --outfmt which takes values bam or cram, and a new parameter --compression which takes values none or crumble. User can pass multiple options separated by a ,.
  2. In the main workflow or the input_check sub-workflow, parse and validate the parameter values, and pass these to the convert_stats sub-workflow.
  3. Configure the convert_stats sub-workflow to only run CRUMBLE if --compression crumble is passed, and to only run SAMTOOLS_VIEW if --outfmt cram is passed.
  4. Run tests to make sure --outfmt works as expected.
  5. Update the documentation.

Retries not happening

Description of the bug

When a Nextflow step fails, retries are not happening.

I removed the errorStrategy in a previous commit from Sanger specific config files, assuming it would be handled by nf-core config files but that is apparently not the case.

Also, tried using farm5.config on tol and it worked. It makes sense then to combine the 3 Sanger specific config files into one.

Command used and terminal output

No response

Relevant files

  1. conf/farm5.config
  2. conf/tol.config
  3. conf/gen3.config

System information

Nextflow version: 22.03.0-edge build 5693
Hardware: HPC
Executor: LSF
Container engine: Singularity
OS: Linux Ubuntu
Version of sanger-tol/readmapping: dev

Post script

Description of feature

  • Check all expected files are created
  • Delete genomic data except IRODS.*.fofn
  • Keep genomic data file structure

Create config for single threaded processes

Description of feature

Design a config definition for single threaded processes

    withLabel:process_nompi {
        cpus   = { check_max( 1  ) }
        memory = { check_max( 16.GB * task.attempt, 'memory'  ) }
        time   = { check_max( 8.h   * task.attempt, 'time'    ) }
    }

Support PacBio CLR

Description of feature

The current minimap2 setting for pacbio is map-hifi as this is what we are doing now. But in the past, we didn't have HiFi and were using the CLR reads. It looks like the minimap2 setting in that case should be map-pb.
We need to be able to support CLR to align the legacy data

Meta being read as a number

Description of the bug

Somewhere meta is causing this error, but I cannot find the source.

Cannot cast object '{id=sample1_T2, datatype=hic}' with class 'java.util.LinkedHashMap' to class 'int'

You can read the initial discussion on Nf-core Slack and later on Nextflow Slack.

Command used and terminal output

No response

Relevant files

nextflow.log

System information

Nextflow: 21.10.6.5660
Hardware: HPC
Executor: LSF
Container: Singularity
OS: 18.04.1-Ubuntu
Pipeline: 0.1

References to `ps22` in the `tol_configs` files

Description of the bug

In the singularity section of tol_configs, there are mentions of ps22 and it needs to be removed. In fact it would be ideal to remove the entire singularity section. It is being handled in nextflow.config.

NB: Consider moving all tol_configs files to conf/

Command used and terminal output

No response

Relevant files

tol_configs/
tol_configs/analysis.config  
tol_configs/aws.config  
tol_configs/farm5.config  
tol_configs/gen3.config	
tol_configs/tol.config

System information

Nextflow version: 21.10.6
Hardware: HPC
Executor: LSF
Container engine: Singularity
OS: Linux
Version of sanger-tol/readmapping: dev

Filter PacBio adapter sequences

Description of feature

Create a module for HiFiAdapterFilt and replace bam2fastq with this. It will trim adapters and convert PacBio BAM to .fastq.

Steps:

  1. Create a container for HiFiAdapterFilt
  2. Create a module for HiFiAdapterFilt
bash pbadapterfilt.sh [ -p file Prefix ] [ -l minimum Length of adapter match to remove. Default=44 ] [ -m minimum percent Match of adapter to remove. Default=97 ] [ -t Number of threads for blastn. Default=8 ] [ -o outdirectory prefix Default=. ]
  1. Test before adding to workflow

Also see: https://sangertreeoflife.slack.com/archives/C015874DF7U/p1621329515004000

This is the command for creating those filtered fasta based on the bam and the filter list samtools view -u -N $base.blocklist -o /dev/null -U- ../$base.bam | samtools fasta - | bgzip -c@4 > $base.filtered.fasta.gz. If you change samtools fasta to samtools fastq you can output your own files to your working space.

Delete Wiki

Description of feature

Remove wiki pages and move text to docs

Create module for `bwa-mem2 align` only

Description of feature

Create bwa-mem2 only module using container:

  • Docker: quay.io/biocontainers/bwa-mem2:2.2.1--hd03093a_2
  • Singularity: https://depot.galaxyproject.org/singularity/bwa-mem2%3A2.2.1--hd03093a_2

Update `bwamem2_index` config settings

Description of the bug

For large genomes, the last try of bwamem2_index requests 48Gb memory, but that is not enough. So either increase the final limit by starting off with a larger number. Alternatively, the starting value can be adjusted by genome size.

Reorganise subworkflow execution

Description of feature

To close this issue:

  1. Remove covert_stats from the subworkflow to workflow
  2. Remove samtools_view from convert_stats to workflow
  3. Rename convert_stats to alignment_statistics
  4. Run tests to make sure the output is still consistent

Name the output files with the assembly name

Description of feature

All output files are currently named assembly.*. As Kerstin suggested, it'd be nicer to use the actual assembly name instead of the fixed string assembly, so that all files can be recognised even outside of the directory structure.

Add CRAM compression option

Description of feature

To close this issue:

  1. Add the updated nf-core/crumble (0.9.1) to the pipeline.
  2. Move samtools_view from workflow to new subworkflow alignment_output.
  3. Integrate nf-core/crumble into alignment_output.
  4. Add custom module configurations to run for different technologies with different settings. (difficult)

Nf-tower integration

Description of feature

Set up sanger-tol/readmapping on Sanger nf-tower without credentials. This will only be used as an alternative way to view pipeline reports generated by Nextflow. This system will not be used to launch the pipeline.

Fail to send email when the pipeline completes - dev branch not main branch

Description of the bug

Here is the error message:

[main] ERROR nextflow.script.WorkflowMetadata - Failed to invoke `workflow.onComplete` event handler

groovy.lang.GroovyRuntimeException: Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): startup failed:
GStringTemplateScript2.groovy: 99: Unexpected character: '"' @ line 99, column 48.
ut << summary.collect{ k,v -> "
^

1 error

    at groovy.text.GStringTemplateEngine$GStringTemplate.<init>(GStringTemplateEngine.java:200)
    at groovy.text.GStringTemplateEngine.createTemplate(GStringTemplateEngine.java:114)
    at groovy.text.TemplateEngine.createTemplate(TemplateEngine.java:58)
    at groovy.text.TemplateEngine$createTemplate.call(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
    at NfcoreTemplate.email(NfcoreTemplate.groovy:108)
    at NfcoreTemplate.email(NfcoreTemplate.groovy:38)
    at NfcoreTemplate$email$4.call(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
    at Script_91722f0b$_runScript_closure2.doCall(Script_91722f0b:140)
    at Script_91722f0b$_runScript_closure2.doCall(Script_91722f0b)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
    at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
    at groovy.lang.Closure.call(Closure.java:412)
    at groovy.lang.Closure.call(Closure.java:406)
    at nextflow.script.WorkflowMetadata$_invokeOnComplete_closure4.doCall(WorkflowMetadata.groovy:389)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
    at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
    at groovy.lang.Closure.call(Closure.java:412)
    at groovy.lang.Closure.call(Closure.java:428)
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2359)
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2344)
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2385)
    at nextflow.script.WorkflowMetadata.invokeOnComplete(WorkflowMetadata.groovy:387)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1268)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
    at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1029)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1012)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:101)
    at nextflow.script.WorkflowMetadata$_closure3.doCall(WorkflowMetadata.groovy:252)
    at nextflow.script.WorkflowMetadata$_closure3.call(WorkflowMetadata.groovy)
    at groovy.lang.Closure.run(Closure.java:493)
    at nextflow.Session.shutdown0(Session.groovy:694)
    at nextflow.Session.destroy(Session.groovy:644)
    at nextflow.script.ScriptRunner.shutdown(ScriptRunner.groovy:240)
    at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:135)
    at nextflow.cli.CmdRun.run(CmdRun.groovy:354)
    at nextflow.cli.Launcher.run(Launcher.groovy:487)
    at nextflow.cli.Launcher.main(Launcher.groovy:646)

Failed to invoke workflow.onComplete event handler

Command used and terminal output

No response

Relevant files

No response

System information

nextflow version 22.10.0.5826

Try minimap2 splice

Description of feature

The BlastN step in PacBio read filtering is slow. A suggestion from Yumi was that we could try minimap2 splice.

Compare run time between BlastN vs Minimap2 Splice and select the more efficient option.

Filtering and alignments of Hi-C reads

Description of feature

We need to understand whether the Hi-C reads need to be filtered prior to being aligned, and what filter should be used. The Hi-C pipeline used in production in GRIT involves a locally modified copy of https://github.com/ArimaGenomics/mapping_pipeline which severely filters the reads compared to the official Arima pipeline, or the Hi-C option of BWA MEM2.

Questions to answer:

  • Arima filter or not ? And if yes, which version
  • BWA or BWA-MEM2

Remove repeat masking from genome files

Description of feature

Remove soft repeat masking from genome files as part of the prepare genome subworkflow before it moves to alignment.

Possible options: Need to write a module for either.

  1. awk
awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $0}}' in.fna > out.fna
  1. BBtools reformat.sh
reformat.sh in=file.fasta out=fixed.fasta trd tuc -Xmx1g

Disable BWAMEM2_INDEX when no short read data

Description of feature

To close this issue:

  1. Add conditional check to see if align_short will be run
  2. If not, BWAMEM2_INDEX should not be run

Currently it runs irrespective of the input read technology.

Check number of files before merging

Description of feature

Check if more than 1 file is being passed into STATS_MARKDUP:MARKDUPLICATE_ALL:SAMTOOLS_MERGE. If only 1 file, just copy the files, and skip subworkflows MARKDUPLICATE_ALL and CONVERT_STATS_ALL.

Bwa-mem2 mem parallelisation

Description of feature

For illumina and hic, I want to break the input fastq files, align them against the genome. The split files need to be aligned with the same @RG tag.

  1. Split fastq using split
  2. Make sure all split files have the same @RG tag
  3. Run bwa-mem2 mem on them individually
  4. Sort them with samtools individually
  5. Merge all split alignments from the same individual with -c tag
  6. Merge at the specimen level.
  7. Run through the rest of the markdup_stats subworkflow

Maybe possible to combine steps (5) and (6).

Move subworkflow `stats_markdup` to workflow `readmapping`

Description of feature

Check if it is possible to move subworkflow stats_markdup to workflow readmapping. Here it will accept the input from the different align_datatype subworkflows.

Make sure in case of mixed datatype sample sheet it doesn't mix the inputs.

change config for fixmate and merge

Description of feature

Fixmate and merge currently are set to process_low which means they only have access to 2 cpus but 14GB. I think samtools assigns about ~2GB per cpu, so there may be efficiency increase associated with changing the number of cpus to 4 or 6, with retry options included.

Example code:

process {
    withName:SAMTOOLS_FIXMATE {
        cpus   = { check_max( 6 * task.attempt, 'cpus'    ) }
    }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.