The atacseq_pipeline from epigen

quick aggregation of counts and support option from 4-8h to minutes

check if the order is for sure always the same
think of alternatives
speed up aggregation by using a bash 2 liner for aggregate_counts and aggregate_support; needs testing before and compare to my all_counts file (datamash you can install with conda)
```
#Merge
awk 'FNR==1{if (NR==1) print $0; next} {print $0}' ${allFiles} > ${mergeFile}
#Transpose
datamash transpose -t ',' < ${mergeFile} > ${mergeFileTranspose}
```

old notes

add config parameter for quick aggregation of counts and support → by simply concatenating without checking the dimensions/features or simply checking it by hand and throw an error if one sample does not match
update atacseq_analysis.yaml to include pandas version 1.1.4 (because much faster than newer versions) & test before committing changes
pandas 1.3.0 (and the newest pandas 1.3.2) extremely slow compared to pandas 1.1.4 → look for ticket or issue on github that reports this

Promoter beds extend the genome size

Last promoter on chrM for me was out of the chormosome size

test on ATAChom within JakStruct

consider changing consensus region generation, quantification, aggregation to R package GenomicRegions to make it more citable/understandable

example methods: Peaks were aggregated into a list of 730 consensus peaks using the function reduce of the package GenomicRanges (version 1.38.0) in R (version 3.6.1). Consensus peak that overlapped with blacklisted genomic regions (downloaded from https://github.com/Boyle - Lab/Blacklist/tree/master/lists) were discarded. Quantitative measurements were obtained by counting reads within consensus peaks using the function summarizeOverlaps from the GenomicAlignments (version 1.22.1) package in R (version 3.6.1)

address adapter “confusion”

Duplicated empty rows in poromoter_counts.csv

in a human data set with 60 samples, the promoters_count.csv contained 44 duplicated rownames/ensembl ids with all zeros.

in a large mouse data set no duplicate features are found.

check the difference between genomes. maybe duplicate ENSG with different genomic regions

make the hub shareable

properly linked in the multiQC report so the report can be shared and correctly links to UCSC browser tracks etc.
instructions on how to configure and what to do (e.g., correct URL https://medical-epigenomics.org/ and copy atacseq_hup to public_html, bc symlink does not work across partitions)

simplify configuration to 2 files

maybe reduce from 4 files to 3 or 2
pipeline config + project config → config
sample/unit config + metadata config (would lead to redundancy in rows, but neater) → annot
- metadata maybe even less needed without unsupervised analysis
- metadata is then also an output (sample x metadata)
configurator function: check if one of the rows has skip_preprocess in the annotation file, if yes then exclude the whole sample (ie all rows of that sample) → needed still? nope, removed that column

handle exception when no known motifs are found by HOMER

revamp MultiQC report

aggregated HOMER results are empty although knownResults.txt in sample results are non-empty

probably the recently introduced exception handling for empty results leads to an empty aggregated file.

installation error on 64-bit Ubuntu 20.04.6 LTS

Hi there, I am attempting to install the pipeline but am encountering errors. According to the documentation, I first installed mamba and created the snakemake conda environment. I cloned the repository, and from it I did the following:

conda activate snakemake
snakemake -p -n

The homer installation began, and I think it finished but ended with the following, which did not seem promising...any advice?

finished homer installation
FileNotFoundError in file /home/anna/Desktop/atacseq_pipeline/workflow/Snakefile, line 50:
[Errno 2] No such file or directory: '/research/home/sreichl/projects/imstruct/config/atac_atacseq_config.yaml'
  File "/home/anna/Desktop/atacseq_pipeline/workflow/Snakefile", line 171, in <module>
  File "/home/anna/Desktop/atacseq_pipeline/workflow/Snakefile", line 50, in configurator

create and add versions to all env.yaml files

add versions to env.yaml files e.g., atacseq_pipeline requires python 3.8 to work (reported by BH)

clean bash commands of all rules

Replace convoluted paths in bash commands with input/output paths
requires fast/small test data #19

make test data larger to include peaks and motifs

so that the whole workflow can be tested using the test data

something like this
If working with FASTQ data, a straightforward way
to generate a small test set is to subsample the first million lines of a file (first 250,000 reads) as follows: head -n 1000000 FASTQ FILE.fq > test fastq.fq

or just sampling more reads

remove tmp bam files before bowtie execution

exception handling OR before starting the bowtie alignment
RtH: Not sure if I mentioned this, but it seems bowtie does not like it when temporary BAM files are still present after jobs were killed/the queue time was not long enough. Before rerunning I needed to remove all of these from atac_results. I did this using:

find . -type f -name '*.bam.tmp.*' -exec rm {} +

(automated) QC: add tutorial & tips

include either in docs and/or computationally (ie configs can be given to automatically exclude samples)

Update rulegraph to reflect latest feature additions

missing features/rules:

sample_annotation
get_promoter_regions
homer_aggregate
map_consensus_tss
...

Building white lists

Hi,

Thank you for sharing your pipeline. I was wondering if you have an already ready-to-use script for retrieving white lists. It is actually not tricky to do with GenomicRanges but that would save some coding.

Thanks a lot.

Nicolas

add example (micro)data for hg38 & mm10

add example data sets (hg38 & mm10) and run them

hg38
mm10

fix multiqc.yaml installation error

during installation does not resolve with mamba!
→ if conda channel priority is strict it worked for me → check again!
multiqc 1.9 does not like python 3.8
multiqc 1.9 wants python 3.6.15
pytables installation makes problems!
figure out the right combination of versions and put them there permantly

resource download

automatic download of resources from Ensemble (like snakemake RNA pipeline does it) and/or Zenodo
-> no! not ETC for powerusers

provide commands for downloading hg38 and mm10 from Zenodo to /resources folder
fill default config with the respective paths to increase reproducibility and accessibility (not BSF anymore, because even CeMMies do not have access)

Adapt result folder and report to MR.PARETO standard

Structure
Only rules have report flag, not rule all
Environments and annotation files are exported and put in report (by cp) ie make env_export.smk rule
Projectname_atacseq
Adapt result folder structure to Pareto standard
go through MR.PARETO checklist

split UROPA region annotation rule to parallelize

Genetic similarity testing with (Picard) functionality

(RtH has code)
dendrogram with labeled leaves
need to check if works for mm10 or only hg38
genotype variant calling to find sample mixups (should be part of standard QC analyses)

-> maybe NGS QC module?

make parameter slop_extension configurable

current;y "sloop_extension", but should this be "slop_extension"?
make it configurable?
used by bedtools (consensus region generation)

mitochondrial fraction metric is missing in report

should be determined during processing in rule align
samtools idxstats "{params.bam_dir}/{params.sample_name}.bam" | awk '{{ sum += $3 + $4; if($1 == "{params.chrM}") {{ mito_count = $3; }}}}END{{ print "mitochondrial_fraction\t"mito_count/sum }}' > "{params.sample_dir}/{params.sample_name}.stats.tsv";
stat.tsv is missing it as the first line
-> MultiQC does not find it -> not in the report
check in other reports mm10 and hg38
- mm10 bmdm-stim column exists but all 0
- hg38 cll-progression column exists, including meaningful values
fix
compare all QC metrics one by one (for existence)

consensus region annotation file

make all columns without spaces
provide gc content and genomic region length (from previous normCQN code)

Save TSS_regions table also

For downstream analysis on the TSS matrix at some point you need to know which exactly peaks are mapped (i.e. for cqn normalisation feature calculation)

refactorize, reduce scope/remove redundancy to other modules and integrate into MR.PARETO for v1.0.0

reduce scope/remove redundancy to other modules

promoter region quantification

2 different flavors (code exists)

Promoter region identification & quantification: With gRanges function for given distances up and downstream (parameter config -> maybe TSS_slop configs can be reused?) → like done in macroStim project for integrative analysis
Consensus region to gene mapping - With Homer annotations map regions within the smallest given distances (up & downstream) to the TSS to genes → like done in macroStim project for DEA results correlation comparison
output:
- counts/homerTSS_counts.csv?
- counts/promoter_counts.csv

Promoter quant
Yes do. Leverage existing quantification scripts and commands (agg).
Re use tss config info
Add to doc

remove UCSC genome browser track hub

DRY principle dictates to not have the same functionality duplicated (hard to maintain)

remove all genome browser track and bigWig rules/code in favor for genome_tracks module i.e., hub/ result folder
- tracks are in multiqc report: Does that change the decision??
adapt README accordingly
- remove Genome Browser section (was moved to genome_tracks)
- adapt hyperlink in QC section to point to genome_tracks
- add to recommended MRP modules the QC aspect of genome_tracks
new release with "less" functionality and point to genome_tracks in release notes

aggregate sample-wise HOMER results for downstream analyses

HOMER known motif enrichment aggregate results -> how and put were? counts/homer_results?

Aggregate Homer by simply
Add sample column
Turn to csv
Concatenation all known motif txt
Call HOMER_known_motifs.csv
Add to doc

extract MultiQC statistics

extract multiQC stats automatically from multiqc report JSON (RtH has code) and add to the sample annot/metadata in the result folder
make separate rule for sample_annotation file creation
counts/annotation.csv -> sample x MultiQC metrics OR sample_metrics.csv or QC_metrics?
convert this from TSV to CSV: atacseq_pipeline/report/multiqc_report_data/multiqc_general_stats.txt

document config of UROPA

necessary to enable use as module!
either add it to the config file i.e., to provide path to these configs as resources or document that they have to be in the respective config folder.
Option 1 is more consistent and easier to understand...

MissingInputException in rule ATAChom_atacseq_pipeline_uropa_prepare in line 4 of /research/home/sreichl/projects/atacseq_pipeline/workflow/rules/region_annotation.smk:
Missing input files for rule ATAChom_atacseq_pipeline_uropa_prepare:
output: results/ATAChom/atacseq_pipeline/tmp/consensus_regions_gencode.json, results/ATAChom/atacseq_pipeline/tmp/consensus_regions_reg.json
affected files:
config/uropa/regulatory_config_TEMPLATE.txt
config/uropa/gencode_config_TEMPLATE_V4.txt

Install Homer should be a rule/job

without relative path, otherwise .sh script can not be found when used as a module
same solution as with LOLA resource in enrichment_analysis https://github.com/epigen/enrichment_analysis/blob/efa96bf076bac8584b31b235feb571d5aef1e741/workflow/rules/resources.smk#L56C3-L56C3
script, rule and env file exist
http://homer.ucsd.edu/homer/

genetic sex testing functionality as QC

RtH has code
need to check if works for mm10 or only hg38

-> maybe NGS QC module?

consider including TOBIAS

paper: https://www.nature.com/articles/s41467-020-18035-1
GitHub: https://github.com/loosolab/TOBIAS
Snakemake Workflow: https://github.com/loosolab/TOBIAS_snakemake
conda: https://anaconda.org/bioconda/tobias

find out what are the input options (e.g., raw/aligned)
find out concrete and generalizable outputs
decide to implement or not

make Snakemake 7 compatible

Make Snakemake 7 compatible (eg mem to mem_mb)
- change resource vocabulary to the current standard eg mem_mb
  - Processes run out of memory, even though the "mem" parameter was sufficient. It started working when I hard-coded this into the pipeline: mem_mb=32000,disk_mb=32000,

epigen / atacseq_pipeline Goto Github PK

atacseq_pipeline's People

Contributors

Stargazers

Watchers

Forkers

atacseq_pipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org