Giter Site home page Giter Site logo

liulab-dfci / chips Goto Github PK

View Code? Open in Web Editor NEW
19.0 2.0 3.0 66.63 MB

A Snakemake pipeline for quality control and reproducible processing of chromatin profiling data

License: MIT License

Python 66.39% Shell 2.43% R 4.40% HTML 23.48% CSS 0.77% JavaScript 2.54%

chips's Introduction

CHIPS (CHromatin enrIchment ProceSsor), an analysis pipeline in snakemake to streamline the processing of ChIP-seq, ATAC-seq, and DNase-seq data

This is a mirror of the repo on bitbucket https://bitbucket.org/plumbers/CHIPS/src/master/

Table of Contents

Installing CHIPS

You will only need to install CHIPS once, either for your own use, or if you are a system administrator, for the entire system (see Appendix C). In other words, you will only need to perform the steps described in this section only once.
NOTE: this section ends with Using CHIPS (below)

Required software

We assume that the following tools are already installed on your system and that you have some basic familiarity in using them: git wget

Installing Miniconda

CHIPS uses the Conda packaging system to manage and install all of its required software packages. To install miniconda:

  1. download the Miniconda installer:

    $ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    
  2. run the installer:

    $ bash Miniconda3-latest-Linux-x86_64.sh
    
  3. update channels of conda:

    $ conda config --add channels defaults
    
    $ conda config --add channels bioconda
    
    $ conda config --add channels conda-forge
    

Installing the CHIPS conda environments

Conda environments are briefly explained here. Briefly, if you are familiar with Python Virtual Environments or Docker Containers then Conda environments should be a familiar concept.

If you are not familiar with these concepts, then a conda environment is simply a self-contained package space that is composed of various packages. So for example, a bioinformatics conda space may include packages such as R, samtools, bedtools, etc.

Chips is dependent on conda environments, chips.

  1. clone the chips source code:

    git clone https://github.com/liulab-dfci/CHIPS
    

    ** NOTE: this command will create a directory called 'chips'. After the next five steps, this directory can be safely deleted as we will explain how to Setup a Chips Project* below. **

  2. installing chips:
    After cloning the git repository, create the chips environment by doing this:

    $ cd CHIPS  
    $ conda env create -f environment.yml -n chips
    

    Or if you have mamba installed in your base environment, a faster method is:

    $ mamba env create -f environment.yml -n chips
    

    Activate chips Conda Environment:

    $ conda activate chips
    
  3. Post installation steps: configuring homer: NOTE: Chips uses the homer software for motif analysis. It also has the capability of using the MDSeqPos motif finder for a similar analysis. If you are interested in using MDSeqPos for motif analysis, please see Appendix D.

    To activate/initialize homer:

    • Run the configure script:
    $ perl ~/miniconda3/envs/chips/share/homer/.//configureHomer.pl -install
    
    • Install the required assemblies:

    For human samples:

    $ perl ~/miniconda3/envs/chips/share/homer/.//configureHomer.pl -install hg38
    
    $ perl ~/miniconda3/envs/chips/share/homer/.//configureHomer.pl -install hg19
    

    For mouse samples:

    $ perl ~/miniconda3/envs/chips/share/homer/.//configureHomer.pl -install mm9
    

Downloading the CHIPS static reference files

CHIPS comes pre-packaged with static reference files (e.g. bwa index, refSeq tables, etc.) for hg38/hg19 and mm9/mm10. You can download those files ref_files. Many of these files are commonly used static reference files, but if you would like to use the files that you already have, OR if you are interested in sup then please see Appendix E.

Using CHIPs

Anatomy of a CHIPS project

All work in CHIPS is done in a PROJECT/ directory, which is simply a directory to contain a single Chips analysis run. PROJECT/ directories can be named anything (and they usually start with a simple mkdir command, e.g. mkdir chips_for_paper), but what is CRITICAL about a PROJECT/ directory is that you fill them with the following core components: (We first lay out the directory structure and explain each element below)

PROJECT/
├── CHIPS/
├── data/ - optional
├── config.yaml
├── metasheet.csv
├── ref.yaml - only if you are using chips OTHER THAN hg19 and mm9
└── ref_files/

The 'CHIPS' directory contains all of the chips source code. We'll explain how to download that directory below. The 'data' directory is an optional directory that contains all of your raw data. It is optional because those paths may be fully established in the config.yaml, however it is best practice to gather your raw data within 'data' using symbolic links.

The config.yaml and metasheet.csv are configurations for your VIPER run (also explained below).

The ref.yaml file is explained in Appendix E.

After a successful Chips run, another 'analysis' folder is generated which contains all of the resulting output files.

Setting up a CHIPS project

  1. Create Project Directory As explained above, the PROJECT directory is simply a directory to contain an entire Chips run. It can be named anything, but for this section, we'll simply call it 'PROJECT'

    $ mkdir PROJECT
    
    $ cd PROJECT
    
  2. Create Data Directory As explained above, creating a data directory is a place to gather all of your raw data files (.fastq, .fastq.gz, .bam). It is optional, but highly recommended.

    $ mkdir data
    

    And in 'data', copy over or make symbolic links to your raw data files

  3. Clone CHIPS Repository In your PROJECT directory:

    $ mv CHIPS/ PROJECT/
    
  4. Create config.yaml and metasheet.csv

    a. copy chips/config.yaml and chips/metasheet.csv into the PROJECT dir:

    In the PROJECT directory:

    $ cp CHIPS/config.yaml .
    
    $ cp CHIPS/metasheet.csv .
    

    b. setup config.yaml The config.yaml is where you define Chips run parameters and the ChIP-seq samples for analysis.

    • genes_to_plot: If set, genomic region and TSS will be displayed in Genome Trackview figure. Multiple genes should be separated by space (default: GAPDH ACTB TP53).
    • upstream/downstream: Upstream and Downstream of the genome region can be extended to have a better view of peaks.
    • output_path: Directory to save all the output files (default: analysis).
    • assembly: typically hg19/hg38 for human or mm9/mm10 for mouse (default: hg19)
    • Choose the motif software: choose either homer or MDSeqPos (default: homer)
    • Contamination Panel:The contamination panel is a panel that Chips will check for "cross-species" contamination. Out of the box, the config.yaml has hg19 and mm9 as assemblies to check. IF you would like to add other species/assemblies, simply add as many BWA indices as you would like
    • cnv_analysis: Set to 'true' to enable copy number variation analysis
    • samples: The most important part of the config file is to define the samples for Chips analysis. Each sample is given an arbitrary name, e.g. MCF7_ER, MCF7_input, etc. Sample names, however, can not start with a number, and cannot contain '.', '-' (dash--use underscores instead) (POSSIBLY others). For each sample, define the path to the raw data file (.fastq, .fastq.gz, .bam). For paired-end samples, simply add another line to define the path to the second pair.

    c. setup metasheet.csv: The metasheet.csv is where you group the samples (defined in config.yaml) into Treatment, Control (and if applicable, replicates). For Chips, each of these groupings is called a run.

    Open metasheet.csv in Excel or in a text-editor.You will notice the first (uncommented) line is:

    RunName,Treat1,Cont1,Treat2,Cont2

    RunName- arbitrary name for the run, e.g. MCF7_ER_run
    Treat1- The sample name that corresponds to treatment sample. It must exactly match the sample name used in config.yaml
    Cont1- (optional) The input-control sample that should be paired with Treat1.
    Treat2- (optional) if you have replicates, define the treatment of the replicate here.
    Cont2- (optional) the input-control, if available, for Treat2

  5. Set Up Refs

    • A pre-built ref_files can be downloaded from the link.
    • makesure in config.yaml, ref: "CHIPS/ref.yaml"
    • linking to static refs.
    • copying ref.yaml

Running CHIPS

  1. Acitivate the environment
conda activate chips
  1. dry run
$ snakemake -np  -s CHIPS/chips.snakefile --rerun-incomplete
  1. full run
$ nohup snakemake -s CHIPS/chips.snakefile --rerun-incomplete -j 8 > run.out &

More information for using snakemake can be found here.

Appendix A: System requirements

Appendix B: Recommended requirements

Appendix C: Installing Chips system-wide

for system administrator, those who wish to share their Chips installation

Appendix D: Installing the MDSeqPos motif finder for chips

$ conda activate chips
$ cd mdseqpos/lib
$ cp settings.py.example settings.py

Modify settings.py like below:

#This should be absolute directory where you ref_files folder is.
ASSEMBLY_DIR = '***/***/ref_files'
BUILD_DICT = { "hg19": "hg19/",
               "hg38": "hg38/",
               "mm9":"mm9/",
               "mm10": "mm10/"
               }

Then do:

cd ..
./version_updater.py
python setup.py install

At last, type MDSeqPos.py to ensure MDSeqPos is installed and check the usage.

Appendix E: Generating static reference files for CHIPS

  • all of the required files
  • using your own files
  • supporting something new
  • adding to ref.yaml

chips's People

Contributors

baigal628 avatar crazyhottommy avatar liulab-dfci avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

chips's Issues

InputFunctionException in line 242 of .../CHIPS/chips.snakefile

Hi. I'm trying CHIPS on single-end ATAC-seq data, and got the following error:

(chips) -bash-4.1$ snakemake -np  -s CHIPS/chips.snakefile --rerun-incomplete
fastp 0.20.1
/home/../miniconda3/envs/chips/lib/python3.6/site-packages/snakemake/workflow.py:743: DtypeWarning: Columns (1,2,7) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(compile(code, snakefile, "exec"), self.globals)
Gapdh found in bed
Actb found in bed
Trp53 found in bed
Homer module is being used and details file was copied.
Homer module is being run.
Building DAG of jobs...
InputFunctionException in line 242 of /home/sequencing/ATAC_seq/CHIPS/chips.snakefile:
KeyError: 'S1_S1_R1_001.fastq.gz'
Wildcards:

I'm rather new to Snakemake, so not sure what to do about it. The filenames look correct in config.yaml., and it is not the first file the error is thrown.

Thanks,
Mikhail

core dumped when run examples from F1000 research

Hi,
I run the example from F1000 research Case 2. I download the data recommend in research. But I get error "Segmentation fault (core dumped) /public/home/anaconda3/envs/chips/bin/python3.6 -m snakemake analysis/trim_adaptor/
PANC1_1/PANC1_1_fastp.json --snakefile /01.data/PROJECT/CHIPS/chips.snakefile --force -j10 --keep-target-file
s --keep-remote --attempt 1 --force-use-threads --wrapper-prefix https://bitbucket.org/snakemake/snakemake-wrappers/raw/ --latency-wait 5 --allowed-rules tri
m_fastp --notemp --quiet --no-hooks --nolock --mode 1 "
Could you help me deal with it? Thanks a lot.

keyword version is deprecated

Once I go into the chips directory and run the snakemake command I get the following error, /scratch/Scel_ATACseq/Project/CHIPS/chips.snakefile:156: SyntaxWarning: invalid escape sequence '.'
#NOTE: Template class allows for _ in the variable names, we want to DISALLOW
SyntaxError in file /scratch/Scel_ATACseq/Project/CHIPS/./modules/trim_adapter.snakefile, line 73:
Keyword version is deprecated. Use conda or container directive instead (see docs). (trim_adapter.snakefile, line 73).
I believe I have properly edited the config, ref, and metasheet to my samples. i am in the CHIPS environment when running this so I was coping someone could help me trouble shoot.

non-model species user for this software

Hi,
Thanks for your convenient software, I run this with my species and get some issues. How I product GDC_hg38.refGene files using my genome annotation files. I don't understand the #bin column meaning in the blow content. Do you have some scripts to directly transfer the genome annotation file to meet this requirement. Thanks a lot.

"""#bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
0 ENST00000371007.5 chr1 - 67092164 67231852 67093004 67127240 8 67092164,67095234,67096251,67115351,67125751,67127165,67131141,67231845, 67093604,67095421,67096321,67115464,67125909,67127257,67131227,6723185
0 ENST00000371006.4 chr1 - 67092175 67127261 67093004 67127240 6 67092175,67095234,67096251,67115351,67125751,67127165, 67093604,67095421,67096321,67115464,67125909,67127261, 0 C1orf141 cmpl cmpl """

Example not Working

Hi,

Thank you for creating this analysis software. I am trying to run the example with some fq.gz files and it is failing with the following error:

fastp 0.20.1
SyntaxError:
Not all output, log and benchmark files of rule peaks_getBroadStats contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
File "/home/username/PROJECT/CHIPS/chips.snakefile", line 258, in
File "/home/username/PROJECT/CHIPS/modules/peaks.snakefile", line 314, in

I am running the command from the "/home/username/PROJECT/" directory:

nohup snakemake -s CHIPS/chips.snakefile --rerun-incomplete -j 8 > run.out &

ref file linking missing

Thank you for creating this analysis software. I am trying to download the ref files, but the link is missing, could you help to update it?
Thanks.

Segmentation fault sambamba

Hi,

I am trying to run the pipeline on some files from encode. I keep getting a segmentation fault killing the job when running sambamba to sort the files. However, if I just run the shell command on its own it runs just fine.

[Thu Jul 22 17:58:54 2021]
Job 4: ALIGN: sort bam file for analysis/align/IRF1_K562_input/IRF1_K562_input.bam

/usr/bin/bash: line 1:  2913 Segmentation fault      (core dumped) sambamba sort analysis/align/IRF1_K562_input/IRF1_K562_input.bam -o analysis/align/IRF1_K562_input/IRF1_K562_input.sorted.bam -t 8 -m 4G 2>> analysis/logs/align/IRF1_K562_input.log
[Thu Jul 22 17:59:14 2021]
Error in rule align_sortBams:
    jobid: 4
    output: analysis/align/IRF1_K562_input/IRF1_K562_input.sorted.bam
    log: analysis/logs/align/IRF1_K562_input.log (check log file(s) for error message)
    shell:
        sambamba sort analysis/align/IRF1_K562_input/IRF1_K562_input.bam -o analysis/align/IRF1_K562_input/IRF1_K562_input.sorted.bam -t 8 -m 4G 2>>analysis/logs/align/IRF1_K562_input.log

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

However, if I just run this on a development node it runs fine:
(chips) [jfreimer@dev3 test_chip]$ sambamba sort analysis/align/IRF1_K562_input/IRF1_K562_input.bam -o analysis/align/IRF1_K562_input/IRF1_K562_input.sorted.bam -t 8 -m 8G 2>>analysis/logs/align/IRF1_K562_input.log

KeyError Unnamed chips.snakefile

Hi,
I am trying to use chips to analyse some ChIP-seq data. I have managed to troubleshoot most of the errors, but I think there is something wrong when calling all the inputs to compile in the chips.snakefile. The error I get is:

InputFunctionException in line 242 of /Users/cvara/Documents/PD2.0/ChIPseq/Analysis/PROJECT/cidc_chips/chips.snakefile:
KeyError: 'Unnamed:'
Wildcards:

Can you help me? Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.