Giter Site home page Giter Site logo

epigen / atacseq_pipeline Goto Github PK

View Code? Open in Web Editor NEW
28.0 3.0 2.0 47.49 MB

Ultimate ATAC-seq Data Processing & Quantification Workflow. A Snakemake implementation of the BSF's ATAC-seq Data Processing Pipeline extended by downstream quantification and annotation steps using bash and Python.

Home Page: https://epigen.github.io/atacseq_pipeline/

License: MIT License

Python 100.00%
snakemake atac-seq python bioinformatics workflow ngs pipeline biomedical-data-science bash

atacseq_pipeline's People

Contributors

sreichl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

atacseq_pipeline's Issues

quick aggregation of counts and support option from 4-8h to minutes

  • check if the order is for sure always the same

  • think of alternatives

  • speed up aggregation by using a bash 2 liner for aggregate_counts and aggregate_support; needs testing before and compare to my all_counts file (datamash you can install with conda)

    #Merge
    awk 'FNR==1{if (NR==1) print $0; next} {print $0}' ${allFiles} > ${mergeFile}
    #Transpose
    datamash transpose -t ',' < ${mergeFile} > ${mergeFileTranspose}

old notes

  • add config parameter for quick aggregation of counts and support → by simply concatenating without checking the dimensions/features or simply checking it by hand and throw an error if one sample does not match
  • update atacseq_analysis.yaml to include pandas version 1.1.4 (because much faster than newer versions) & test before committing changes
  • pandas 1.3.0 (and the newest pandas 1.3.2) extremely slow compared to pandas 1.1.4 → look for ticket or issue on github that reports this

test on ATAChom within JakStruct

  • plug into project
    • load module
    • annotation
    • config
    • copy mm10 resources
  • usage as module: fix relative paths that break when used as module (unsure what is meant)
  • promoter region quantification
    • promoter
    • homer
    • compare using pearson correlation
      • 16895 features (genes) overlap
      • sample correlation (96 samples)
        • mean 0.83
        • median 0.86
      • feature correlation (16895 features)
        • mean 0.83
        • median 0.97
  • Homer motif enrichment aggregation
  • compare
    • consensus regions bed
    • consensus counts (columns ordered differently, need to compare using e.g., pandas)
    • MultiQC report metrics
      • exactly the same apart from 2 metrics
      • %Duplication was always ~>10% higher before
      • %Adapter was before ~1% and now between ~4% and ~13%

consider changing consensus region generation, quantification, aggregation to R package GenomicRegions to make it more citable/understandable

example methods: Peaks were aggregated into a list of 730 consensus peaks using the function reduce of the package GenomicRanges (version 1.38.0) in R (version 3.6.1). Consensus peak that overlapped with blacklisted genomic regions (downloaded from https://github.com/Boyle - Lab/Blacklist/tree/master/lists) were discarded. Quantitative measurements were obtained by counting reads within consensus peaks using the function summarizeOverlaps from the GenomicAlignments (version 1.22.1) package in R (version 3.6.1)

address adapter “confusion”

  • discuss shortly the pipeline and its configuration with MS
  • double check BE's original pipeline
    • config → using the same?
    • logic/task/rule: does it decide between them at some point? Fastp command different?
  • run pipeline with all potential different settings
    • no adapter info
    • "original” nextera.fa file and config adapter sequence
    • "original” nextera.fa file w/o config adapter sequence
    • "corrected” nextera.fa file and config adapter sequence
    • "corrected” nextera.fa file w/o config adapter sequence
  • share all four MultiQC reports w/ all ATAC-seq users e.g., BSF, RtH, Team-Titan, MS
    • the observation how the fastp command currently trims adapters (sequential and redundant with the short adapter sequence in the config file)
    • the nextera.fa file seems to be incorrect/incomplete. Ask for opinions and/or meeting in what we should change.
      • 8 adapter sequences in the nextera.fa file seem to have one nucelotide too much
    • attach reports for all 4 different settings
      • no adapter info
      • old adapter info
      • new adapter file and sequence
      • only new adapter file

Duplicated empty rows in poromoter_counts.csv

in a human data set with 60 samples, the promoters_count.csv contained 44 duplicated rownames/ensembl ids with all zeros.

in a large mouse data set no duplicate features are found.

  • check the difference between genomes. maybe duplicate ENSG with different genomic regions

make the hub shareable

  • properly linked in the multiQC report so the report can be shared and correctly links to UCSC browser tracks etc.
  • instructions on how to configure and what to do (e.g., correct URL https://medical-epigenomics.org/ and copy atacseq_hup to public_html, bc symlink does not work across partitions)

simplify configuration to 2 files

  • maybe reduce from 4 files to 3 or 2
  • pipeline config + project config → config
  • sample/unit config + metadata config (would lead to redundancy in rows, but neater) → annot
    • metadata maybe even less needed without unsupervised analysis
    • metadata is then also an output (sample x metadata)
  • configurator function: check if one of the rows has skip_preprocess in the annotation file, if yes then exclude the whole sample (ie all rows of that sample) → needed still? nope, removed that column

revamp MultiQC report

  • get it to run without atacseq_module
  • transfer software/package versions to env.yaml
  • drill into atacseq_report.py if/what necessary...
  • Compare current MultiQC results & HUB with previous
    • -> HUB generation required -> symlinks and some special txt files
    • -> Extraction/symlinking of stats required inside or report/ dir
  • try Copilot Chat for help
  • implement HUB generation as rule in multiqc.smk
  • implement report/ symlinking as rule in multiqc.smk
  • address: throws an error if the field flowcell is not present in sample_annotation.csv
  • debug remaining atacseq.py code
    • fastp error messages
    • file name is not correct -> snakemake can't find it
    • check if some of the commented configs are directly used by multiqc (and not the atacseq report module)
    • make every annot column an exploratory column in report, no. conifg, but also used for trackhub
    • check why multiqc folder is called multiqc_report_data/ instead of multiqc_data/ per docs
  • compare statistics of samples
  • go through report and check hyperlinks: same?, correct?, working?
  • go through atacseq report module and update everything e.g., metadata etc.
  • try to put remaining configs into bash command, so that config file dependency is gone
  • change back the environment specs multiqc.yaml, push the changes, and test one more time

installation error on 64-bit Ubuntu 20.04.6 LTS

Hi there, I am attempting to install the pipeline but am encountering errors. According to the documentation, I first installed mamba and created the snakemake conda environment. I cloned the repository, and from it I did the following:

conda activate snakemake
snakemake -p -n

The homer installation began, and I think it finished but ended with the following, which did not seem promising...any advice?

finished homer installation
FileNotFoundError in file /home/anna/Desktop/atacseq_pipeline/workflow/Snakefile, line 50:
[Errno 2] No such file or directory: '/research/home/sreichl/projects/imstruct/config/atac_atacseq_config.yaml'
  File "/home/anna/Desktop/atacseq_pipeline/workflow/Snakefile", line 171, in <module>
  File "/home/anna/Desktop/atacseq_pipeline/workflow/Snakefile", line 50, in configurator

make test data larger to include peaks and motifs

so that the whole workflow can be tested using the test data

something like this
If working with FASTQ data, a straightforward way
to generate a small test set is to subsample the first million lines of a file (first 250,000 reads) as follows: head -n 1000000 FASTQ FILE.fq > test fastq.fq

or just sampling more reads

remove tmp bam files before bowtie execution

  • exception handling OR before starting the bowtie alignment
  • RtH: Not sure if I mentioned this, but it seems bowtie does not like it when temporary BAM files are still present after jobs were killed/the queue time was not long enough. Before rerunning I needed to remove all of these from atac_results. I did this using:
find . -type f -name '*.bam.tmp.*' -exec rm {} + 

Building white lists

Hi,

Thank you for sharing your pipeline. I was wondering if you have an already ready-to-use script for retrieving white lists. It is actually not tricky to do with GenomicRanges but that would save some coding.

Thanks a lot.

Nicolas

fix multiqc.yaml installation error

  • during installation does not resolve with mamba!
  • → if conda channel priority is strict it worked for me → check again!
  • multiqc 1.9 does not like python 3.8
  • multiqc 1.9 wants python 3.6.15
  • pytables installation makes problems!
  • figure out the right combination of versions and put them there permantly

resource download

automatic download of resources from Ensemble (like snakemake RNA pipeline does it) and/or Zenodo
-> no! not ETC for powerusers

  • provide commands for downloading hg38 and mm10 from Zenodo to /resources folder
  • fill default config with the respective paths to increase reproducibility and accessibility (not BSF anymore, because even CeMMies do not have access)

Adapt result folder and report to MR.PARETO standard

  • Structure
  • Only rules have report flag, not rule all
  • Environments and annotation files are exported and put in report (by cp) ie make env_export.smk rule
  • Projectname_atacseq
  • Adapt result folder structure to Pareto standard
  • go through MR.PARETO checklist

mitochondrial fraction metric is missing in report

  • should be determined during processing in rule align
    samtools idxstats "{params.bam_dir}/{params.sample_name}.bam" | awk '{{ sum += $3 + $4; if($1 == "{params.chrM}") {{ mito_count = $3; }}}}END{{ print "mitochondrial_fraction\t"mito_count/sum }}' > "{params.sample_dir}/{params.sample_name}.stats.tsv";

  • stat.tsv is missing it as the first line

  • -> MultiQC does not find it -> not in the report

  • check in other reports mm10 and hg38

    • mm10 bmdm-stim column exists but all 0
    • hg38 cll-progression column exists, including meaningful values
  • fix

  • compare all QC metrics one by one (for existence)

Save TSS_regions table also

For downstream analysis on the TSS matrix at some point you need to know which exactly peaks are mapped (i.e. for cqn normalisation feature calculation)

refactorize, reduce scope/remove redundancy to other modules and integrate into MR.PARETO for v1.0.0

  • reorder and split into issues all tasks below
  • Save all installed “working” environments (computational)
  • Make sure to have tests in place to compare new pipeline with
    • annotation and results for hg38 and mm10 samples are present
    • test data from BSF requested -> does not exist anymore.
    • bsf: /nobackup/lab_bsf/samples/{flowcell}/{flowcell}_{lane}samples/{flowcell}{lane}#{BSF_name}.bam
    • bsf_test: /nobackup/lab_bsf/users/berguener/junk/atac_test_data/{flowcell}_{lane}#{BSF_name}.bam
  • #24
  • #25
  • #26
  • #12
  • #23
  • #22
  • #4
  • #21
  • remove atacseq_pipeline/tmp folder after region annotation aggregation
  • rerun start to finish and compare results systematically with results of previous pipeline output
  • clean up: remove all commented code and unused/deprecated files
  • update README.md according to the changes and latest standards (see MR.PARETO README template)

reduce scope/remove redundancy to other modules

  • remove cemm slurm profile from repo
  • Remove all redundant downstream analyses
    • unsupervised analysis
    • split
    • region filter
    • HRV selection
    • normalization
    • var mean plots
  • Remove singularity and dockerfile
  • adapt and reduce documentation ie no redundancy to MR.PARETO docs
    • refer to spilterlize and other modules for downstream analysis
  • Adapt workflow catalog yaml (remove Singularity)

promoter region quantification

2 different flavors (code exists)

  • Promoter region identification & quantification: With gRanges function for given distances up and downstream (parameter config -> maybe TSS_slop configs can be reused?) → like done in macroStim project for integrative analysis

  • Consensus region to gene mapping - With Homer annotations map regions within the smallest given distances (up & downstream) to the TSS to genes → like done in macroStim project for DEA results correlation comparison

  • output:

    • counts/homerTSS_counts.csv?
    • counts/promoter_counts.csv

Promoter quant
Yes do. Leverage existing quantification scripts and commands (agg).
Re use tss config info
Add to doc

remove UCSC genome browser track hub

DRY principle dictates to not have the same functionality duplicated (hard to maintain)

  • remove all genome browser track and bigWig rules/code in favor for genome_tracks module i.e., hub/ result folder
    • tracks are in multiqc report: Does that change the decision??
  • adapt README accordingly
    • remove Genome Browser section (was moved to genome_tracks)
    • adapt hyperlink in QC section to point to genome_tracks
    • add to recommended MRP modules the QC aspect of genome_tracks
  • new release with "less" functionality and point to genome_tracks in release notes

extract MultiQC statistics

  • extract multiQC stats automatically from multiqc report JSON (RtH has code) and add to the sample annot/metadata in the result folder
  • make separate rule for sample_annotation file creation
  • counts/annotation.csv -> sample x MultiQC metrics OR sample_metrics.csv or QC_metrics?
  • convert this from TSV to CSV: atacseq_pipeline/report/multiqc_report_data/multiqc_general_stats.txt

document config of UROPA

necessary to enable use as module!
either add it to the config file i.e., to provide path to these configs as resources or document that they have to be in the respective config folder.
Option 1 is more consistent and easier to understand...

MissingInputException in rule ATAChom_atacseq_pipeline_uropa_prepare in line 4 of /research/home/sreichl/projects/atacseq_pipeline/workflow/rules/region_annotation.smk:
Missing input files for rule ATAChom_atacseq_pipeline_uropa_prepare:
output: results/ATAChom/atacseq_pipeline/tmp/consensus_regions_gencode.json, results/ATAChom/atacseq_pipeline/tmp/consensus_regions_reg.json
affected files:
config/uropa/regulatory_config_TEMPLATE.txt
config/uropa/gencode_config_TEMPLATE_V4.txt

make Snakemake 7 compatible

  • Make Snakemake 7 compatible (eg mem to mem_mb)
    • change resource vocabulary to the current standard eg mem_mb
      • Processes run out of memory, even though the "mem" parameter was sufficient. It started working when I hard-coded this into the pipeline: mem_mb=32000,disk_mb=32000,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.