liulab-dfci / maestro Goto Github PK

Single-cell Transcriptome and Regulome Analysis Pipeline

License: GNU General Public License v3.0

Python 12.51% R 5.77% HTML 5.32% Batchfile 0.01% Shell 0.81% Makefile 0.85% C 70.02% JavaScript 2.03% CSS 0.37% Roff 0.52% M4 0.33% Perl 0.84% Scilab 0.07% Ruby 0.46% Dockerfile 0.11%

scatac-seq scrna-seq epigenetics snakemake-workflows transcription-factors gene-regulatory-networks

maestro's People

Contributors

Stargazers

Watchers

Forkers

kunju-pitt pythseq kant booew nsteinau dongqingsun96 crazyhottommy gt7901b jhuanglabreferences deepbody-me shulp2211 amsszlh zorrodong yixf-self khandaud15 yuewangpanda yiwang12 bicnasong dadongz yjchen1201 haroon123 baigal628 mengchengyao shenyang1981 labinaili mcbrlab qiangtu bzdfci shanwai1234 bit-vs-it taoliu mengxiao ssmadha jphe f6v chengwenxuan1997 anderkj2 sophial05 han-tun philloidin chenfeiwang mudappathir gsarfo-boateng sca-di rb56 macs3-project tarela nailouzhang ranojoychatterjee jianguozhou3 ariakid phycomlab omar-almolla209 altingia jhuanglabsctools jiahuaqu qapf ruthwww l-z-l kerwin12580 tenlives guochengying-7824 programming-sbcg853-cohort3 qiupinghust wenmm starlitnightly gladelephant chenjunjie1996 harishkgpbiopython24 huizhong1993 imahabub

maestro's Issues

Incompatibility with Seurat 4.0

When installing MAESTRO (774ce46) on top of satijalab/seurat/release/4.0.0:

** byte-compile and prepare package for lazy loading
Error: object ‘CreateGeneActivityMatrix’ is not exported by 'namespace:Seurat'
Execution halted
ERROR: lazy loading failed for package ‘MAESTRO’

Where to find predicted regulatory transcription factors and corresponding targets in the output

The file XXX.PredictedTFTop10.txt has list of transcription factors. The tutorial mentions target genes should be in XXX.GIGGLE folder, but I couldn't find it. Could you point out which file I should look for? Thanks!

Yuping

Multi-sample analysis!

Hi!
A very good tool.
I now have five scATAC samples.How do I merge multiple samples?

Feature request: Add recipe on bioconda

Hi @liulab-dfci, I'm interested in trying out MAESTRO, and the Dockerfile recipe works great. One thing I noticed is that the conda recipe is currently managed under a custom channel (see https://anaconda.org/liulab-dfci/maestro). I was wondering if it would be possible to submit the recipe to bioconda, which makes developing other recipes that depend on MAESTRO easier to manage.

Best,
Mike

Enhancement: Description in the HTML report

The current (v1.0.2) HTML report only includes figures and tables without explanations. Please add more texts explaining how to interpret the results in the report.

https://github.com/liulab-dfci/MAESTRO/tree/master/MAESTRO/html

add doublet removal step

according to https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02136-7
"scDblFinder achieved comparable or better accuracy than top alternatives while being the fastest method"

Bug: a typo in the name of "mouce.brain.ALLEN"

Please rename it as "mouse.brain.ALLEN"

Bug: Hardcoded separator "_" doesn't work for 10X H5 format, in scATAC_genescore.py

The following line for calculating RP scores assumes that peaks are named as "chr_start_end" in peak matrix.

MAESTRO/MAESTRO/scATAC_genescore.py

Line 97 in ffdfffe

peaks_tmp = peak.decode().split("_")

However, the standard output "filtered_peak_bc_matrix.h5" from 10X cellranger-atac may use chr:start-end", according to

https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/advanced/h5_matrices

Suggest replacing the separator in the peak column to a standard character (either "_" or "-") in MAESTRO R script after loading the H5 file.

FileNotFoundError: [Errno 2] No such file or directory: 'Result/Analysis/M1R1_MAESTRO.PredictedTFTop10.txt'

Hi Chenfei,

Thank you for making MAESTRO available! I encountered an error while processing some 10x scRNA-seq data from mice.

It seems that MAESTRO couldn't find PredictedTFTop10.txt. I have RABIT installed and specified the RABIT index in the yaml file.

[Mon Dec  9 15:03:42 2019]
rule scrna_report:
    input: Result/Analysis, Result/QC/M1R1_MAESTRO_scRNA_read_distr.png, Result/QC/M1R1_MAESTRO_scRNA_read_quality.png, Result/QC/M1R1_MAESTRO_scRNA_NVC.png, Result/QC/M1R1_MAESTRO_scRNA_GCcontent.png, Result/QC/M1R1_MAESTRO_scRNA_genebody_cov.png, Result/QC/M1R1_MAESTRO_scRNA_cell_filtering.png
    output: Result/M1R1_MAESTRO_scRNA_report.html
    jobid: 1

Traceback (most recent call last):
  File "/omg/home/username/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scRNA_report.py", line 36, in <module>
    for line in open(cluster_regulator_file,"r").readlines():
FileNotFoundError: [Errno 2] No such file or directory: 'Result/Analysis/M1R1_MAESTRO.PredictedTFTop10.txt'
[Mon Dec  9 15:03:42 2019]
Error in rule scrna_report:
    jobid: 1
    output: Result/M1R1_MAESTRO_scRNA_report.html
    shell:
        python /omg/home/username/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scRNA_report.py M1R1_MAESTRO /home/username/Projects/maestro_eval/results/2019-12-06-first/M1R1_MAESTRO/M1R1 GRCm38 10x-genomics True

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/username/Projects/maestro_eval/results/2019-12-06-first/M1R1_MAESTRO/.snakemake/log/2019-12-09T150340.656779.snakemake.log

This appears to the last missing step in MAESTRO as rerunning the pipeline gave me:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        1       scrna_report
        2

Enhancement: Documentation of APIs

We need documentation of APIs. Since users need to know the options of each function, the data structures, in order to fine-tune their pipeline or skip certain steps. Currently, the only available documentation is the tutorial, which is not enough.

Bug: incorrect use of MACS2 -n option

As shown in the current code to invoke macs2:

MAESTRO/MAESTRO/Snakemake/scATAC/Snakefile

Lines 138 to 152 in ffdfffe

    
           rule scatac_allpeakcall: 
        
               input: 
        
                   bam = "Result/BWA/" + config["outprefix"] + ".merged.sortedByPos.rmdp.bam" 
        
               output: 
        
                   peak = "Result/Analysis/" + config["outprefix"] + "_all_peaks.narrowPeak", 
        
                   bdg = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg", 
        
               params: 
        
                   name = "Result/Analysis/" + config["outprefix"] + "_all" 
        
               log: 
        
                   "Result/Log/" + config["outprefix"] + "_macs2_allpeak.log" 
        
               conda: 
        
                   ENV_PATH + "/environment_py2.yaml" 
        
               shell: 
        
                   "macs2 callpeak -g hs -n {params.name} -B -q 0.05 --nomodel --extsize=50 --SPMR -t {input.bam}"

The option '-n' contains a folder structure. This is not a recommended way to specify an output directory of macs2. Please use '--outdir' instead and keep the '-n' simple such as:

name = config["outprefix"] + "_all"

and

macs2 callpeak -g hs --outdir Result/Analysis -n {params.name} -B -q 0.05 --nomodel --extsize=50 --SPMR -t {input.bam}

The issue is also related to other places in this Snakemake file where MACS2 is invoked.

install_github("liulab-dfci/MAESTRO") error

Dear Tao,

When I ran install_github("liulab-dfci/MAESTRO"), I encountered the below error. Have you ever encountered the same error?

Error: Failed to install 'unknown package' from GitHub:
Failed to connect to api.github.com port 443: Connection refused

Thanks very much for your help.

Best regards,
Leon.

Installation of MEATRO R codes is not tested.

Could you add it to travis.yml? Currently, only the conda installation part is tested.

genedistance is not used in calculating RP

the genedistance parameter https://github.com/liulab-dfci/MAESTRO/blob/master/MAESTRO/Snakemake/scATAC/Snakefile#L1166

is not used in the shell command https://github.com/liulab-dfci/MAESTRO/blob/master/MAESTRO/Snakemake/scATAC/Snakefile#L1174

Using default 10,000 bp.

error at scatac_qcfilter step

Hi,
I got the error below when running the pipeline starting from bam. Any suggestion what I should try to fix this? Also I wonder if the pipeline can be started at any point. In my case here, I would like to resume the pipeline from this scatac_qcfilter step once I figure out the problem, rather than starting over the whole pipeline again.

rule scatac_qcfilter:
input: Result/Analysis/SI_25417_peak_count.h5
output: Result/QC/SI_25417_filtered_peak_count.h5
jobid: 4
benchmark: Result/Benchmark/SI_25417_QCFilter.benchmark

Traceback (most recent call last):
File "/root/miniconda3/envs/MAESTRO/bin/MAESTRO", line 4, in
import('pkg_resources').run_script('MAESTRO==1.2.1', 'MAESTRO')
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/init.py", line 665, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.2.1-py3.7.egg-info/scripts/MAESTRO", line 112, in
main()
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.2.1-py3.7.egg-info/scripts/MAESTRO", line 94, in main
scatac_qc(directory = args.directory, outprefix = args.outprefix, fileformat = args.format, peakcount = args.peakcount, feature = args.feature, barcode = args.barcode, single_stat = args.single_stat, peak_cutoff = args.peak_cutoff, count_cutoff = args.count_cutoff, frip_cutoff = args.frip_cutoff, cell_cutoff = args.cell_cutoff, species = args.species)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scATAC_QC.py", line 147, in scatac_qc
Filter(rawmatrix = peakmatrix, feature = features, barcode = barcodes, peak_cutoff = peak_cutoff, cell_cutoff = cell_cutoff, validcell = validcells_list, outprefix = filename, species = species)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scATAC_QC.py", line 84, in Filter
passed_cell_matrix = rawmatrix[np.ix_(passed_peak.tolist()[0], passed_cell.tolist()[0])]
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/scipy/sparse/_index.py", line 75, in getitem
return self._get_arrayXarray(row, col)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 665, in _get_arrayXarray
major.size, major.ravel(), minor.ravel(), val)
ValueError: could not convert integer scalar
[Wed Sep 2 20:52:51 2020]
Error in rule scatac_qcfilter:
jobid: 4
output: Result/QC/SI_25417_filtered_peak_count.h5
shell:
MAESTRO scatac-qc --format h5 --peakcount Result/Analysis/SI_25417_peak_count.h5 --peak-cutoff 100 --cell-cutoff 10 --directory Result/QC --outprefix SI_25417
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Thanks,
Yuping

two data files not accessible

Hello,

I found two data files were not accessible:

http://cistrome.org/~chenfei/MAESTRO/Refdata_scATAC_MAESTRO_GRCh38_1.1.0.tar.gz
http://cistrome.org/~chenfei/MAESTRO/giggle.all.tar.gz

The error is 403 Forbidden.

Why I can't open the HTML report

Several sets of data were tested, and it was very convenient.
When I open HTML report, but I can't see any contents of the report only sample information.
Is there anything I haven't noticed?

Dockerfile recipe error

I'm currently hitting this error on the last step of the Dockerfile recipe:

Step 10/10 : RUN cd /root/Annotation && tar -xzvf rabit.tar.gz && tar -xzvf giggle.tar.gz && rm rabit.tar.gz && rm giggle.tar.gz
 ---> Running in a935b7a7a681
tar (child): rabit.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
The command '/bin/sh -c cd /root/Annotation && tar -xzvf rabit.tar.gz && tar -xzvf giggle.tar.gz && rm rabit.tar.gz && rm giggle.tar.gz' returned a non-zero code: 2

Multiple Samples

How would one go about using MAESTRO (starting at Step 4 of the pipeline) with multiple sc-RNA-Seq samples?

possible typo in the ATAC instruction page

Hi,

I ran MAESTRO by the Docker image successfully, then I tried to explore the data using R following the instruction. In the step 4, there is:

p <- VisualizeTFenrichment(TFs = tfs,
                             cluster.1 = 0,
                             type = "ATAC",
                             SeuratObj = pbmc.ATAC.res$ATAC,
                             GIGGLE.table = "10X_PBMC_10k_giggle.txt",
                             visual.totalnumber = 100,
                             name = "10X_PBMC_10k_Monocyte_filtered")

I think the GIGGLE.table should be '10X_PBMC_10k.PredictedTFScore.txt', because in the directory, there is no such file named '10X_PBMC_10k_giggle.txt'.

Thanks!

Install MAESTRO version 1.2.1.9999

I followed the instruction https://github.com/liulab-dfci/MAESTRO to install MAESTRO on HPC cluster for our institution research community.

My python version is 3.7.3-anaconda; conda version 4.9.2-py37h89c1867_0

I followed the instruction “ Installing the full solution of MAESTRO workflow through conda “, here is my procedures

[ris_hpc_apps@tdragon4 ~]$ module load python/3.7.3-anaconda

[ris_hpc_apps@tdragon4 ~]$ which conda

/risapps/rhel7/python/3.7.3/bin/conda

[ris_hpc_apps@tdragon4 ~]$ conda config --add channels defaults

Warning: 'defaults' already in 'channels' list, moving to the top

[ris_hpc_apps@tdragon4 ~]$ conda config --add channels bioconda

Warning: 'bioconda' already in 'channels' list, moving to the top

[ris_hpc_apps@tdragon4 ~]$ conda config --add channels conda-forge

Warning: 'conda-forge' already in 'channels' list, moving to the top

   -- no need to add those channels; they are already there.

 

[ris_hpc_apps@tdragon4 ~]$ conda install mamba -c conda-forge

Collecting package metadata (repodata.json): done

Solving environment: done

## Package Plan ##

 

  environment location: /risapps/rhel7/python/3.7.3

  added / updated specs:

    - mamba

The following packages will be downloaded:

 

    package                    |            build

    ---------------------------|-----------------

    ca-certificates-2020.11.8  |       ha878542_0         145 KB  conda-forge

    certifi-2020.11.8          |   py37h89c1867_0         150 KB  conda-forge

    conda-4.9.2                |   py37h89c1867_0         3.0 MB  conda-forge

    libsolv-0.7.16             |       h8b12597_0         455 KB  conda-forge

    ------------------------------------------------------------

                                           Total:         3.8 MB

 

The following NEW packages will be INSTALLED:

  libsolv            conda-forge/linux-64::libsolv-0.7.16-h8b12597_0

  mamba              conda-forge/linux-64::mamba-0.1.2-py37h99015e2_0

 

The following packages will be UPDATED:

  ca-certificates                      2020.6.20-hecda079_0 --> 2020.11.8-ha878542_0

  certifi                          2020.6.20-py37he5f6b98_2 --> 2020.11.8-py37h89c1867_0

  conda                                4.9.0-py37he5f6b98_1 --> 4.9.2-py37h89c1867_0

 

Proceed ([y]/n)? *y*

# I choose yes to update the conda to the latest version here

 

 Downloading and Extracting Packages

libsolv-0.7.16       | 455 KB    | ########################################################### | 100%

 certifi-2020.11.8    | 150 KB    | ########################################################### | 100%

 ca-certificates-2020 | 145 KB    | ########################################################### | 100%

 conda-4.9.2          | 3.0 MB    | ########################################################### | 100%

 Preparing transaction: done

Verifying transaction: done

Executing transaction: done

 

[ris_hpc_apps@tdragon4 ~]$ mamba create -n MAESTRO-1.2.1 -c liulab-dfci

  (since we already have MAESTO created for version 1.0.1, I choose a different conda environment name MAESTRO-1.2.1  )

  .......

   Getting  liulab-dfci linux-64

Getting  liulab-dfci noarch

Getting  conda-forge linux-64

Getting  conda-forge noarch

Getting  bioconda linux-64

Getting  bioconda noarch

Getting  pkgs/main linux-64

Getting  pkgs/main noarch

Getting  pkgs/r noarch

Getting  r linux-64

Getting  pkgs/r linux-64

Getting  dranew linux-64

Getting  r noarch

Getting  dranew noarch

 

Looking for: []

 

9 packages in https://conda.anaconda.org/liulab-dfci/linux-64

0 packages in https://conda.anaconda.org/liulab-dfci/noarch

139892 packages in https://conda.anaconda.org/conda-forge/linux-64

43184 packages in https://conda.anaconda.org/conda-forge/noarch

33562 packages in https://conda.anaconda.org/bioconda/linux-64

21237 packages in https://conda.anaconda.org/bioconda/noarch

18648 packages in https://repo.anaconda.com/pkgs/main/linux-64

2356 packages in https://repo.anaconda.com/pkgs/main/noarch

6637 packages in https://repo.anaconda.com/pkgs/r/linux-64

5025 packages in https://repo.anaconda.com/pkgs/r/noarch

6620 packages in https://conda.anaconda.org/r/linux-64

5025 packages in https://conda.anaconda.org/r/noarch

170 packages in https://conda.anaconda.org/dranew/linux-64

0 packages in https://conda.anaconda.org/dranew/noarch

 

## Package Plan ##

 

  environment location: /risapps/rhel7/python/3.7.3/envs/MAESTRO-1.2.1

 

Proceed ([y]/n)? y

 #
  # To activate this environment, use
  #
  #     $ conda activate MAESTRO-1.2.1
  #
  # To deactivate an active environment, use
  #
  #     $ conda deactivate

Preparing transaction: done

Verifying transaction: done

Executing transaction: done

At this point, I check the environment directory /risapps/rhel7/python/3.7.3/envs/MAESTRO-1.2.1;

I only notice one directory was generated, is this normal?

[ris_hpc_apps@tdragon4 MAESTRO]$ ls -l /risapps/rhel7/python/3.7.3/envs/MAESTRO-1.2.1

total 0

drwxr-xr-x 2 ris_hpc_apps rists 29 Nov 20 16:07 conda-meta

Note: I used our software installation account ris_hpc_apps to install this on HPC cluster. I suppose to let user to do "conda activate MAESTRO-1.2.1" ( "conda init bash" is needed before this)

Regards,
Rong

Error in integrating Seurat scRNA and Signac scATAC objects

First, I readRDS for my Seurat/harmony and Signac clustered scRNA and scATAC objects, respectively.

Then, I tried to run the Incorporate function:
coembed <- Incorporate(RNA = rna, ATAC = atac, project = "Maestro_integrated", method = "MAESTRO")

But I get the following error:
Error: Cannot find 'ACTIVITY' in this Seurat object

I also tried:
coembed <- Incorporate(RNA = rna$RNA, ATAC = atac$ATAC, project = "Maestro_integrated", method = "MAESTRO")

But,
Error: Cannot find 'ATAC' in this Seurat object

scrna-init creates a problematic snakefile for 10x-genomics

I used the following command to create a Snakefile following the instructions at https://github.com/liulab-dfci/MAESTRO/blob/master/example/RNA_infrastructure_10x/RNA_infrastructure_10x.md.

Unfortunately, the Snakefile does not work since I get the following error:

IndexError in line 34 of /storage/mathelierarea/processed/anthoma/Projects/single_cell/results/20201216_maestro/10X_PBMC_RNA_8k_MAESTRO_V122/Snakefile:
list index out of range
  File "/storage/mathelierarea/processed/anthoma/Projects/single_cell/results/20201216_maestro/10X_PBMC_RNA_8k_MAESTRO_V122/Snakefile", line 34, in <module>
  File "/lsc/MAESTRO/1.2.2/envs/MAESTRO/lib/python3.8/site-packages/MAESTRO/scRNA_utility.py", line 44, in getfastq_10x

The problem comes from the fact that the scrna-init command did not provide values for the transcript, barcode, and decompress variable; and these variables are used in the Snakefile for the STAR command.

This should not be the case given your documentation that specifies that --fastq-barcode and --fastq-transcript should only be provided for the Dropseq format (not 10x-genomics)

You can find the Snakefile and config.xml files in the archive enclosed.
snakefile_and_config.zip

Calculating gene variances: Error in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : invalid 'x'

I ran MAESTRO workflow using fragment file as the input file and encountered the error.

Could you let me know why I have such an error? or how to solve it?

Thank you.

Bicna

Enhancement: Provide options to set customized cell signatures in Snakemake configuration file

In the current version 1.0.2, if users use the snakemake pipeline, there is no way to set their customized cell signatures, and they are forced to use CIBERSORT immune cell signatures. The pipeline would be more useful if the cell annotation results from snakemake pipeline can make more sense.

https://github.com/liulab-dfci/MAESTRO/tree/master/MAESTRO/Snakemake

Question: Can we do TF annotation without doing cell annotation?

In some cases, we do not know the actual cell types, or we do not know the signature genes of possible cell types, so we'd like to skip cell-type annotation step, and plan to identify the driver TFs for each cluster. Can we do that? I am asking this since according to the tutorial, the regulation potentials (RP, or gene scores) are only to be included in the 'cell type annotation' step, but RPs are needed in the TF identification step.

Bug: scATAC_genescore.py and ATACCalculateGenescore.R can't handle peaks from unusual chromosomes

When peaks are from unusual chromosomes such as "chr11_KI270721v1_random", scATAC_genescore.py and ATACCalculateGenescore.R can't correctly split the peak name such as "chr11_KI270721v1_random_12345_67890" to find the chromosome name, start and end positions since it will use "_" as the separator:

MAESTRO/MAESTRO/scATAC_genescore.py

Lines 96 to 98 in 4af5c11

    
           for ipeak, peak in enumerate(peaks_list): 
        
               peaks_tmp = peak.decode().split("_") 
        
               peaks_info.append([peaks_tmp[0], (int(peaks_tmp[1]) + int(peaks_tmp[2])) / 2.0, 0, ipeak])

Please use rsplit("_",maxsplit=2) instead. Here is an example:

>>> a="chr11_KI270721v1_random_12345_67890"
>>> a.split("_") # this will incorrectly split the chromosome name
['chr11', 'KI270721v1', 'random', '12345', '67890']
>>> a.rsplit("_",maxsplit=2) # this will give exactly three values
['chr11_KI270721v1_random', '12345', '67890']

Pass in existing Seurat/SingleCellExperiment Object

Is it possible to pass in an existing Seurat object (or converted SingleCellExperiment object)? I have a pretty robust pipeline that does QC, clustering and such, would like to have some level of preservation of those clusters and such.

cluster resolution other than 0.6 with error in coembedding

Hello,

I tried to cluster my scRNA and scATAC datasets using a resolution other than 0.6 (this number was used in the MAESTRO vignettes), but I got an error saying that "ATAC_snn_res.0.6" or "RNA_snn_res.0.6" cannot be found in the Seurat object when I used the "Incorporate" function to coembed the two datasets. Do I need to use a default resolution of 0.6 before coembedding? But can I change the resolution after coembedding?

Thank you!

Question: Any plan to upgrade R to 3.6?

Currently, R 3.5 is specifically required. Any reason that we should keep using the old version? Or is there any plan to upgrade? The reason why I asked is that the latest Bioconductor (3.10) requires R 3.6.

MAESTRO/conda/MAESTRO/meta.yaml

Line 32 in da64dcb

- r-base=3.5

Bug: bdg to bw in Snakemake

The current implementation has a potential problem. The bedtools intersect with -f 1.00 will get rid of the region that is not ENTIRELY included in the chr_limit file. It would be more reasonable to clip instead of discarding the whole region.

MAESTRO/MAESTRO/Snakemake/scATAC/Snakefile

Lines 637 to 651 in ffdfffe

    
           rule scatac_bdg2bw: 
        
               input: 
        
                   bdg = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg", 
        
                   chrlimit = SCRIPT_PATH + "/annotations/" + config["species"] + "_chr_limit.bed", 
        
                   chrlen = SCRIPT_PATH + "/annotations/" + config["species"] + "_chr.len", 
        
               output: 
        
                   tmp = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg.tmp", 
        
                   bigwig = "Result/Analysis/" + config["outprefix"] + "_all.bw" 
        
               params: 
        
                   outdir = "Result/Analysis/" + config["outprefix"] 
        
               shell: 
        
                   "bedtools intersect -a {input.bdg} -b {input.chrlimit} -wa -f 1.00 > {output.tmp};" 
        
                   "bedSort {output.tmp} {input.bdg};" 
        
                   "bedGraphToBigWig {input.bdg} {input.chrlen} {output.bigwig};"

Check my gist: https://gist.github.com/taoliu/2469050

gene_peak matrix in RP_model

The value range of gene_peak matrix in RP_model should between 0 and 1. Why does "genes_peaks_score_csr" in scATAC_Genescore.py have a value greater than 1, with a maximum of 1600+.

Python scripts should use argparse

Scripts like scRNA_qc.py use many positional arguments, which is not particularly user friendly if the user needs to run the script outside of the snakemake pipeline. It would be easier for the user if argparse was used so that they can specify command line flags and have a help doc for the script.

Bug: Bugs of relative pathes in the MAESTRO scrna-/scatac-init scripts and Errors in Tutorial

The version of MAESTRO: v1.1.0

Command line bugs:

If --fastq-dir ends with /, the snakemake cmd will give a ‘double /’ warning.
If --fastq-dir is a relative path, it won’t work since this relative path will be added to the config file in a new child directory; the same problem can be found for the option for whitelist file.

Tutorial errors:

The instruction, there is no --directory option but -d.
The assembly names GRCh38 or GRCm38 are not species names!
Please do not hardcode path names in instruction.
Errors in the scRNA instruction: the download link for the STAR index only contains human data. The RSEM library download link is pointing to giggle download.

Is MAESTRO compatible with 10X data derived from nuclear RNA?

Hello,

I'm currently installing the MAESTRO prerequisites and, after reading the paper, I'd like to ask if MAESTRO is compatible with 10X data derived from nuclear RNA, particularly if I'm looking to integrate single-modal snRNA- and snATAC-seq data?

And more specifically, could the use of a pre-mRNA reference and GTF files for alignment, as opposed to standard reference/annotation files, impact a MAESTRO analysis at all?

Until now I have been using Cell Ranger 4 for my analysis which recommends using a pre-mRNA reference and GTF file for nuclear RNA. I had started creating STARsolo compatible versions of these files for my MAESTRO analysis and wondered if this is the best course of action, particularly as 10X have recently released v5 which includes a new function for dealing with intronic reads without the need of a pre-RNA reference, and STARsolo also provides a similar function.

Regardless, it would be useful to hear if you have any recommendations or points of interest that I should consider when running MAESTRO using single-nuclear data.

Many Thanks,

Darren

Enhancement: Optimize the memory usage for calculating gene score (RP) for scATAC-seq with a large number of cells

I have to process a scATAC-seq dataset with over 48,000 cells after merging three conditions and 2 replicates each condition. I kept 200,000 top peaks. The scATAC_cellranger_count.py can finish successfully, however, even our high memory node (260G mem) of our HPC keeps killing scATAC_genescore.py. We need to optimize the memory usage of this script. This issue ticket will track the progress of the optimization.

Documentation clarification

It would be appreciated if the documentation explicitly mentioned that rownames of the input matrixes need to be gene symbols and not other annotations. Took me a while to figure out that LISA requires symbols, and that the mismatch in rownames of my ATAC and RNA data kept causing Incorporate() to segfault. However, I also admit I am using a slightly processed counts matrix for my RNA input, which used ensemble ids as the rownames.

A hardcoded path of LISA in the docker image

Hello,

I am using MAESTRO docker image winterdongqing/maestro, 1.2.1, 57f41ffbe200. When I ran the scRNA-seq module, it reported errors like:

unable to open file: name = '/project/dev/qqin/LISA/lisa_web/cistromedb_data/hg38/cluster_human/DNase_median_for_each_cluster.h5'

This error should be caused by a hardcoded path of LISA. I changed to web LISA, then this module finished successfully.

Bug: Let reticulate find the current Python

If users use MAESTRO for the first time or more accurately use "reticulate" package through ATACCalculateGenescore script for the first time, they will be prompted to decide whether "reticulate" should install an independent miniconda environment by itself.

No non-system installation of Python could be found.
Would you like to download and install Miniconda?
Miniconda is an open source environment management system for Python.
See https://docs.conda.io/en/latest/miniconda.html for more details.

Would you like to install Miniconda? [Y/n]: n

Assuming users are utilizing conda/miniconda version of MAESTRO for their analysis, they should stick to the python in that environment, so answer "n". This has to be done just once. We'd better describe it in our tutorial otherwise users may have a lot of trouble later on if they accidentally answer "Y" to the above question. See the discussion regarding this in rstudio/reticulate#607.

The code related to the issue is

MAESTRO/R/ATACCalculateGenescore.R

Lines 29 to 30 in 92b1d3f

    
           library(reticulate) 
        
           source_python(paste(system.file(package="MAESTRO"), "ATACCalculateGenescore.py", sep="/"))

Bug: Version number has to be corrected

The version should be set as 1.0.2.

MAESTRO/setup.py

Line 50 in 1eba0a2

version = "1.0.1",

Problem with ATACCalculateGenescore when choosing "GRCm38"

genes = ATACCalculateGenescore(peaks, organism = "GRCm38")

this just give me a zero matrix. And the column names are human gene names not mouse gene names.

enhance scATAC_HTMLReport.py for multisample

Hi @DongqingSun96 ,
can you please fix this command for multisample scATACseq? https://github.com/liulab-dfci/MAESTRO/blob/feat/multisample/MAESTRO/Snakemake/multisample/scATAC/rules/sc_atac_report.smk#L27
currently, it assumes the files are in Result/Analysis/, but for multisample, the files are in /Result/Analysis/{sample}/

Thanks a lot!

Enhancement: Suggestions on MAESTRO/R/ATACRunSeurat.R

If we look at MAESTRO/R/ATACRunSeurat.R script, it includes many of Seurat functions, such as RunLSI, FindCluster, and RunUMAP. However, the parameters for these functions are mainly fixed with only few tunable parameters for users. How can users tweak the parameter such as n.neighbors or min.dist for UMAP with this script, or the maximum number of components from LSI (n=50 is fixed)? My suggestion is to break this big script into at least three modules -- LSI, UMAP, and DE, and provide more options!

How to generate those QC metrics plots?

Hi,

Thank you for this great work! I just have one question when I go through the Galleries & Tutorials. For the scRNA-seq example, after the RNARunSeurat() function, you use str() to check those features. How to plot those figures below the line of str(pbmc.RNA.res$RNA)?

Thanks,
Frank

Bug: Missing default value or required status for `MAESTRO init` command

The main executable script MAESTRO will throw exceptions when no argument is set for init command:

$ MAESTRO init 
Traceback (most recent call last):
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/bin/MAESTRO", line 4, in <module>
    __import__('pkg_resources').run_script('MAESTRO==1.0.1', 'MAESTRO')
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.0.1-py3.7.egg-info/scripts/MAESTRO", line 79, in <module>
    main()
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.0.1-py3.7.egg-info/scripts/MAESTRO", line 72, in main
    init_workflow(args.directory, args.module)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.0.1-py3.7.egg-info/scripts/MAESTRO", line 21, in init_workflow
    os.makedirs(directory)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/os.py", line 206, in makedirs
    head, tail = path.split(name)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/posixpath.py", line 107, in split
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

The solution is either to add required=True for "-m" and "-d", or to set default= for these options.

MAESTRO/MAESTRO/MAESTRO

Lines 61 to 62 in 1eba0a2

    
           workflow.add_argument("-d", "--directory", help = "Path to the directory where the workflow shall be initialized and results shall be stored.") 
        
           workflow.add_argument("-m", "--module", choices = ["scATAC","scRNA","integrate"], help="Analysis of scATAC-seq or scRNA-seq, or integrattion analysis")

giggle and lisa tables for drosophila

Is it possible to generate the giggle and lisa tables for drosophila melanogaster? I would like to use these to generate the driver TFs for both scRNA and scATAC-seq samples.

Bug: a typo in the name of mouce.brain.ALLEN

--giggleannotation for other species

Hi Chenfei,

Does MASTRO allow to process data of other species, like zebrafish? If so, how should I prepare the reference genome files, like giggle annotation files?

Thank you!
Yujie

Suggestion to set default assay, fix hard coded lines and more in Incorporate.R

To make this script more flexible for customized analysis, please add DefaultAssay(ATAC) <- "ACTIVITY" to the end of this code block.

MAESTRO/R/Incorporate.R

Lines 47 to 64 in d58fc18

    
           if(is.null(ATAC[["ACTIVITY"]])&method!="Seurat"){ 
        
              RPmatrix <- RPmatrix[,intersect(colnames(ATAC), colnames(RPmatrix))] 
        
              ATAC <- SubsetData(ATAC, cells = intersect(colnames(ATAC), colnames(RPmatrix))) 
        
              ATAC[["ACTIVITY"]] <- CreateAssayObject(counts = RPmatrix) 
        
              DefaultAssay(ATAC) <- "ACTIVITY" 
        
              ATAC <- FindVariableFeatures(ATAC) 
        
              ATAC <- NormalizeData(ATAC) 
        
              ATAC <- ScaleData(ATAC)} 
        
            if(method=="Seurat"){ 
        
              activity.matrix <- CreateGeneActivityMatrix(peak.matrix = RPmatrix, annotation.file = "/homes/cwang/annotations/hg38/Homo_sapiens.GRCh38.98.gtf",  seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE) 
        
              activity.matrix <- activity.matrix[,intersect(colnames(ATAC), colnames(activity.matrix))] 
        
              ATAC <- SubsetData(ATAC, cells = intersect(colnames(ATAC), colnames(activity.matrix))) 
        
              ATAC[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix) 
        
              DefaultAssay(ATAC) <- "ACTIVITY" 
        
              ATAC <- FindVariableFeatures(ATAC) 
        
              ATAC <- NormalizeData(ATAC) 
        
              ATAC <- ScaleData(ATAC)}

Also, please fix the hard code at:

MAESTRO/R/Incorporate.R

Line 56 in d58fc18

    
           activity.matrix <- CreateGeneActivityMatrix(peak.matrix = RPmatrix, annotation.file = "/homes/cwang/annotations/hg38/Homo_sapiens.GRCh38.98.gtf",  seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)

MAESTRO/R/Incorporate.R

Line 84 in d58fc18

    
           p2 <- DimPlot(CombinedObj, reduction = "umap", group.by = "RNA_snn_res.0.6", cells = rownames(CombinedObj@meta.data[which(CombinedObj@meta.data[,'tech']=='RNA'),]), label = TRUE, repel = TRUE)

MAESTRO/R/Incorporate.R

Line 86 in d58fc18

    
           p3 <- DimPlot(CombinedObj, reduction = "umap", group.by = "ATAC_snn_res.0.6", cells = rownames(CombinedObj@meta.data[which(CombinedObj@meta.data[,'tech']=='ATAC'),]), label = TRUE, repel = TRUE)

Another suggestion is to separate the plotting functions from this script since when users get the CombinedObj, they can plot whatever they want. To combine plotting with computation is unnecessary and will potentially cause some bugs.

lisa unpack error

Hi, I am running MAESTRO 1.2.2 following the tutorial. The scRNA-seq module complained:

lisa unpack: error: argument tarball: ERROR: /home/qtu/work/maestro/lisa/hg38_2.1.tar.gz is not a valid file.

Actually I could directly execute lisa unpack hg38_2.1.tar.gz successful, not sure what caused the error.

BTW, does the option --lisadir lisa/hg38_2.1.tar.gz mean the pipeline will unpack the huge file each time?

Thanks!

	rule scatac_allpeakcall:
	input:
	bam = "Result/BWA/" + config["outprefix"] + ".merged.sortedByPos.rmdp.bam"
	output:
	peak = "Result/Analysis/" + config["outprefix"] + "_all_peaks.narrowPeak",
	bdg = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg",
	params:
	name = "Result/Analysis/" + config["outprefix"] + "_all"
	log:
	"Result/Log/" + config["outprefix"] + "_macs2_allpeak.log"
	conda:
	ENV_PATH + "/environment_py2.yaml"
	shell:
	"macs2 callpeak -g hs -n {params.name} -B -q 0.05 --nomodel --extsize=50 --SPMR -t {input.bam}"

	for ipeak, peak in enumerate(peaks_list):
	peaks_tmp = peak.decode().split("_")
	peaks_info.append([peaks_tmp[0], (int(peaks_tmp[1]) + int(peaks_tmp[2])) / 2.0, 0, ipeak])

	rule scatac_bdg2bw:
	input:
	bdg = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg",
	chrlimit = SCRIPT_PATH + "/annotations/" + config["species"] + "_chr_limit.bed",
	chrlen = SCRIPT_PATH + "/annotations/" + config["species"] + "_chr.len",
	output:
	tmp = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg.tmp",
	bigwig = "Result/Analysis/" + config["outprefix"] + "_all.bw"
	params:
	outdir = "Result/Analysis/" + config["outprefix"]
	shell:
	"bedtools intersect -a {input.bdg} -b {input.chrlimit} -wa -f 1.00 > {output.tmp};"
	"bedSort {output.tmp} {input.bdg};"
	"bedGraphToBigWig {input.bdg} {input.chrlen} {output.bigwig};"

	library(reticulate)
	source_python(paste(system.file(package="MAESTRO"), "ATACCalculateGenescore.py", sep="/"))

	workflow.add_argument("-d", "--directory", help = "Path to the directory where the workflow shall be initialized and results shall be stored.")
	workflow.add_argument("-m", "--module", choices = ["scATAC","scRNA","integrate"], help="Analysis of scATAC-seq or scRNA-seq, or integrattion analysis")

	if(is.null(ATAC[["ACTIVITY"]])&method!="Seurat"){
	RPmatrix <- RPmatrix[,intersect(colnames(ATAC), colnames(RPmatrix))]
	ATAC <- SubsetData(ATAC, cells = intersect(colnames(ATAC), colnames(RPmatrix)))
	ATAC[["ACTIVITY"]] <- CreateAssayObject(counts = RPmatrix)
	DefaultAssay(ATAC) <- "ACTIVITY"
	ATAC <- FindVariableFeatures(ATAC)
	ATAC <- NormalizeData(ATAC)
	ATAC <- ScaleData(ATAC)}
	if(method=="Seurat"){
	activity.matrix <- CreateGeneActivityMatrix(peak.matrix = RPmatrix, annotation.file = "/homes/cwang/annotations/hg38/Homo_sapiens.GRCh38.98.gtf", seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)
	activity.matrix <- activity.matrix[,intersect(colnames(ATAC), colnames(activity.matrix))]
	ATAC <- SubsetData(ATAC, cells = intersect(colnames(ATAC), colnames(activity.matrix)))
	ATAC[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
	DefaultAssay(ATAC) <- "ACTIVITY"
	ATAC <- FindVariableFeatures(ATAC)
	ATAC <- NormalizeData(ATAC)
	ATAC <- ScaleData(ATAC)}

liulab-dfci / maestro Goto Github PK

maestro's People

Contributors

Stargazers

Watchers

Forkers

maestro's Issues

Recommend Projects

Recommend Topics

Recommend Org