sanger-tol / readmapping Goto Github PK

Nextflow DSL2 pipeline to align short and long reads to genome assembly. This workflow is part of the Tree of Life production suite.

Home Page: https://pipelines.tol.sanger.ac.uk/readmapping

License: MIT License

HTML 2.34% Python 8.47% Nextflow 48.35% Groovy 40.62% Shell 0.23%

genomics nextflow pipeline read-alignment

readmapping's People

Contributors

Stargazers

Watchers

Forkers

sateeshperi ccaio amakunin tkchafin reichan1998

readmapping's Issues

Large genome test with `Sciurus vulgaris` for ONT

Description of feature

Check if specimen level file exists
Check @RG values in existing file
If expected @RG value not found, run the alignment and merge
If expected @RG value found, skip and end
Check if index exists, if yes skip step

Description of feature

Either remove existing tests or modify to match sanger-tol templates

Description of feature

Assign location based on genome path to meta
Update publishDir using meta

Forking to assign sub workflows not working

Description of feature

I would like to assign a sub workflow based on the value of meta.datatype.

When using attached readmapping.nf workflow, I get the error:

Cause: No such property: meta for class: Script_401b8f1e

Description of feature

Implement nf-validation for parameters and sample sheets

Description of feature

To close this issue:

Add a new parameter --outfmt which takes values bam or cram, and a new parameter --compression which takes values none or crumble. User can pass multiple options separated by a ,.
In the main workflow or the input_check sub-workflow, parse and validate the parameter values, and pass these to the convert_stats sub-workflow.
Configure the convert_stats sub-workflow to only run CRUMBLE if --compression crumble is passed, and to only run SAMTOOLS_VIEW if --outfmt cram is passed.
Run tests to make sure --outfmt works as expected.
Update the documentation.

Filtering out non-primary alignments

Description of feature

Command to remove non-primary alignments:

samtools view -F 256 input.bam

Description of the bug

When a Nextflow step fails, retries are not happening.

I removed the errorStrategy in a previous commit from Sanger specific config files, assuming it would be handled by nf-core config files but that is apparently not the case.

Also, tried using farm5.config on tol and it worked. It makes sense then to combine the 3 Sanger specific config files into one.

Command used and terminal output

No response

Relevant files

conf/farm5.config
conf/tol.config
conf/gen3.config

System information

Nextflow version: 22.03.0-edge build 5693
Hardware: HPC
Executor: LSF
Container engine: Singularity
OS: Linux Ubuntu
Version of sanger-tol/readmapping: dev

Description of feature

Check all expected files are created
Delete genomic data except IRODS.*.fofn
Keep genomic data file structure

Create config for single threaded processes

Description of feature

Design a config definition for single threaded processes

    withLabel:process_nompi {
        cpus   = { check_max( 1  ) }
        memory = { check_max( 16.GB * task.attempt, 'memory'  ) }
        time   = { check_max( 8.h   * task.attempt, 'time'    ) }
    }

Large size tests for pacbio and hic with mCerEla1

Location: /lustre/scratch123/tol/teams/tolit/users/ps22/pipelines/nf-core-readmapping_v2/mCerEla1

Medium genome test with `Erithacus rubecula` for Illumina

Description of feature

The current minimap2 setting for pacbio is map-hifi as this is what we are doing now. But in the past, we didn't have HiFi and were using the CLR reads. It looks like the minimap2 setting in that case should be map-pb.
We need to be able to support CLR to align the legacy data

Create automatic csv samplesheet for ToL

Description of feature

Write a module that takes the fasta genome as input and creates a csv samplesheet
Merge into input check subworkflow

Modify stats markdup workflow

Description of feature

Remove convert_stats for individual libraries
Remove all steps post specimen level

Meta being read as a number

Description of the bug

Somewhere meta is causing this error, but I cannot find the source.

Cannot cast object '{id=sample1_T2, datatype=hic}' with class 'java.util.LinkedHashMap' to class 'int'

You can read the initial discussion on Nf-core Slack and later on Nextflow Slack.

Command used and terminal output

No response

Relevant files

nextflow.log

System information

Nextflow: 21.10.6.5660
Hardware: HPC
Executor: LSF
Container: Singularity
OS: 18.04.1-Ubuntu
Pipeline: 0.1

Merge HiC and Illumina subworkflows

Description of feature

The subworkflows align_hic and align_illumina are identical, so it makes sense to combine them.

References to `ps22` in the `tol_configs` files

Description of the bug

In the singularity section of tol_configs, there are mentions of ps22 and it needs to be removed. In fact it would be ideal to remove the entire singularity section. It is being handled in nextflow.config.

NB: Consider moving all tol_configs files to conf/

Command used and terminal output

No response

Relevant files

tol_configs/
tol_configs/analysis.config  
tol_configs/aws.config  
tol_configs/farm5.config  
tol_configs/gen3.config	
tol_configs/tol.config

System information

Nextflow version: 21.10.6
Hardware: HPC
Executor: LSF
Container engine: Singularity
OS: Linux
Version of sanger-tol/readmapping: dev

Large genome test with `Meles meles` for Illumina

Filter PacBio adapter sequences

Description of feature

Create a module for HiFiAdapterFilt and replace bam2fastq with this. It will trim adapters and convert PacBio BAM to .fastq.

Steps:

Create a container for HiFiAdapterFilt
Create a module for HiFiAdapterFilt

bash pbadapterfilt.sh [ -p file Prefix ] [ -l minimum Length of adapter match to remove. Default=44 ] [ -m minimum percent Match of adapter to remove. Default=97 ] [ -t Number of threads for blastn. Default=8 ] [ -o outdirectory prefix Default=. ]

Test before adding to workflow

Also see: https://sangertreeoflife.slack.com/archives/C015874DF7U/p1621329515004000

This is the command for creating those filtered fasta based on the bam and the filter list samtools view -u -N $base.blocklist -o /dev/null -U- ../$base.bam | samtools fasta - | bgzip -c@4 > $base.filtered.fasta.gz. If you change samtools fasta to samtools fastq you can output your own files to your working space.

Medium genome test with `Pararge aegeria ` for ONT

Delete Wiki

Description of feature

Remove wiki pages and move text to docs

Create module for `bwa-mem2 align` only

Description of feature

Create bwa-mem2 only module using container:

Docker: quay.io/biocontainers/bwa-mem2:2.2.1--hd03093a_2
Singularity: https://depot.galaxyproject.org/singularity/bwa-mem2%3A2.2.1--hd03093a_2

Update `bwamem2_index` config settings

Description of the bug

For large genomes, the last try of bwamem2_index requests 48Gb memory, but that is not enough. So either increase the final limit by starting off with a larger number. Alternatively, the starting value can be adjusted by genome size.

Change config based on input

Description of feature

It would save time and compute resources, if the resource config was set based on input size. For example for large genomes, the pipeline starts with a higher time value.

See: https://nfcore.slack.com/archives/CE6SDBX2A/p1648466297611599

Reorganise subworkflow execution

Description of feature

To close this issue:

Remove covert_stats from the subworkflow to workflow
Remove samtools_view from convert_stats to workflow
Rename convert_stats to alignment_statistics
Run tests to make sure the output is still consistent

Name the output files with the assembly name

Description of feature

All output files are currently named assembly.*. As Kerstin suggested, it'd be nicer to use the actual assembly name instead of the fixed string assembly, so that all files can be recognised even outside of the directory structure.

Small genome test with `Anthocharis cardamines` for ONT

Add CRAM compression option

Description of feature

To close this issue:

Add the updated nf-core/crumble (0.9.1) to the pipeline.
Move samtools_view from workflow to new subworkflow alignment_output.
Integrate nf-core/crumble into alignment_output.
Add custom module configurations to run for different technologies with different settings. (difficult)

Nf-tower integration

Description of feature

Set up sanger-tol/readmapping on Sanger nf-tower without credentials. This will only be used as an alternative way to view pipeline reports generated by Nextflow. This system will not be used to launch the pipeline.

Fail to send email when the pipeline completes - dev branch not main branch

Description of the bug

Here is the error message:

[main] ERROR nextflow.script.WorkflowMetadata - Failed to invoke `workflow.onComplete` event handler

groovy.lang.GroovyRuntimeException: Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): startup failed:
GStringTemplateScript2.groovy: 99: Unexpected character: '"' @ line 99, column 48.
ut << summary.collect{ k,v -> "
^

1 error

    at groovy.text.GStringTemplateEngine$GStringTemplate.<init>(GStringTemplateEngine.java:200)
    at groovy.text.GStringTemplateEngine.createTemplate(GStringTemplateEngine.java:114)
    at groovy.text.TemplateEngine.createTemplate(TemplateEngine.java:58)
    at groovy.text.TemplateEngine$createTemplate.call(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
    at NfcoreTemplate.email(NfcoreTemplate.groovy:108)
    at NfcoreTemplate.email(NfcoreTemplate.groovy:38)
    at NfcoreTemplate$email$4.call(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
    at Script_91722f0b$_runScript_closure2.doCall(Script_91722f0b:140)
    at Script_91722f0b$_runScript_closure2.doCall(Script_91722f0b)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
    at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
    at groovy.lang.Closure.call(Closure.java:412)
    at groovy.lang.Closure.call(Closure.java:406)
    at nextflow.script.WorkflowMetadata$_invokeOnComplete_closure4.doCall(WorkflowMetadata.groovy:389)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
    at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
    at groovy.lang.Closure.call(Closure.java:412)
    at groovy.lang.Closure.call(Closure.java:428)
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2359)
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2344)
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2385)
    at nextflow.script.WorkflowMetadata.invokeOnComplete(WorkflowMetadata.groovy:387)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1268)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
    at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1029)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1012)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:101)
    at nextflow.script.WorkflowMetadata$_closure3.doCall(WorkflowMetadata.groovy:252)
    at nextflow.script.WorkflowMetadata$_closure3.call(WorkflowMetadata.groovy)
    at groovy.lang.Closure.run(Closure.java:493)
    at nextflow.Session.shutdown0(Session.groovy:694)
    at nextflow.Session.destroy(Session.groovy:644)
    at nextflow.script.ScriptRunner.shutdown(ScriptRunner.groovy:240)
    at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:135)
    at nextflow.cli.CmdRun.run(CmdRun.groovy:354)
    at nextflow.cli.Launcher.run(Launcher.groovy:487)
    at nextflow.cli.Launcher.main(Launcher.groovy:646)

Failed to invoke workflow.onComplete event handler

Command used and terminal output

No response

Relevant files

No response

System information

nextflow version 22.10.0.5826

tracedir not being defined properly when outdir defined

Description of the bug

readmapping/nextflow.config

Line 31 in 0e694aa

tracedir = "${params.outdir}/pipeline_info"

outdir's value is not being passed to tracedir, the pipeline use '${params.outdir}/pipeline_info' instead.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Wong location for test data for test_full profile

Description of the bug

test data locations doesn't exist.

When this problem is fixed, update the LSF full test workflow to usetest_fullprofile.

Command used and terminal output

No response

Relevant files

https://github.com/sanger-tol/readmapping/blob/main/conf/test_full.config

System information

No response

Try minimap2 splice

Description of feature

The BlastN step in PacBio read filtering is slow. A suggestion from Yumi was that we could try minimap2 splice.

Compare run time between BlastN vs Minimap2 Splice and select the more efficient option.

Filtering and alignments of Hi-C reads

Description of feature

We need to understand whether the Hi-C reads need to be filtered prior to being aligned, and what filter should be used. The Hi-C pipeline used in production in GRIT involves a locally modified copy of https://github.com/ArimaGenomics/mapping_pipeline which severely filters the reads compared to the official Arima pipeline, or the Hi-C option of BWA MEM2.

Questions to answer:

Arima filter or not ? And if yes, which version
BWA or BWA-MEM2

Move `samtools sort` from `bwa-mem2 align` to `stats_markdup`

Description of feature

Once the bwa-mem2 only module is designed. Move samtools sort from subworkflows align_pacbio and align_ont to subworkflow convert_stats. samtools/sort will replace samtools/view in convert_stats subworkflow.

Remove repeat masking from genome files

Description of feature

Remove soft repeat masking from genome files as part of the prepare genome subworkflow before it moves to alignment.

Possible options: Need to write a module for either.

awk

awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $0}}' in.fna > out.fna

BBtools reformat.sh

reformat.sh in=file.fasta out=fixed.fasta trd tuc -Xmx1g

Update pipeline template

Description of feature

Update to the latest nf-core pipeline template

PacBio alignment sub workflow

Description of the bug

The PacBio alignment sub workflow is in progress and must not be used at this point.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Disable BWAMEM2_INDEX when no short read data

Description of feature

To close this issue:

Add conditional check to see if align_short will be run
If not, BWAMEM2_INDEX should not be run

Currently it runs irrespective of the input read technology.

Pipeline logo size too big to make email hard to read

Description of feature

Can we change the image size or update the email template?

Check number of files before merging

Description of feature

Check if more than 1 file is being passed into STATS_MARKDUP:MARKDUPLICATE_ALL:SAMTOOLS_MERGE. If only 1 file, just copy the files, and skip subworkflows MARKDUPLICATE_ALL and CONVERT_STATS_ALL.

s3 integration issues

Description of the bug

The pipeline has trouble accessing s3 buckets, this can be a farm issue or something different.

https://sanger-openstack.slack.com/archives/CB66C3G3X/p1648036674389259

Command used and terminal output

No response

Relevant files

No response

System information

No response

Bwa-mem2 mem parallelisation

Description of feature

For illumina and hic, I want to break the input fastq files, align them against the genome. The split files need to be aligned with the same @RG tag.

Split fastq using split
Make sure all split files have the same @RG tag
Run bwa-mem2 mem on them individually
Sort them with samtools individually
Merge all split alignments from the same individual with -c tag
Merge at the specimen level.
Run through the rest of the markdup_stats subworkflow

Maybe possible to combine steps (5) and (6).

Small genome test with `Asterias rubens` for Illumina

Move subworkflow `stats_markdup` to workflow `readmapping`

Description of feature

Check if it is possible to move subworkflow stats_markdup to workflow readmapping. Here it will accept the input from the different align_datatype subworkflows.

Make sure in case of mixed datatype sample sheet it doesn't mix the inputs.

change config for fixmate and merge

Description of feature

Fixmate and merge currently are set to process_low which means they only have access to 2 cpus but 14GB. I think samtools assigns about ~2GB per cpu, so there may be efficiency increase associated with changing the number of cpus to 4 or 6, with retry options included.

Example code:

process {
    withName:SAMTOOLS_FIXMATE {
        cpus   = { check_max( 6 * task.attempt, 'cpus'    ) }
    }
}

Calculate PacBio filtered data percentage

Description of feature

Calculate the percentage of PacBio read data filtered before alignment. This was flagged by Kamil.

Modify minimap command

Description of feature

Add argument --cs=short to minimap2 align command

sanger-tol / readmapping Goto Github PK

readmapping's People

Contributors

Stargazers

Watchers

Forkers

readmapping's Issues

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Description of feature

Description of feature

Description of the bug

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of the bug

Command used and terminal output

Relevant files

System information

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Description of feature

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Recommend Projects

Recommend Topics

Recommend Org