Giter Site home page Giter Site logo

liulab-dfci / maestro Goto Github PK

View Code? Open in Web Editor NEW
274.0 12.0 75.0 208.21 MB

Single-cell Transcriptome and Regulome Analysis Pipeline

License: GNU General Public License v3.0

Python 12.51% R 5.77% HTML 5.32% Batchfile 0.01% Shell 0.81% Makefile 0.85% C 70.02% JavaScript 2.03% CSS 0.37% Roff 0.52% M4 0.33% Perl 0.84% Scilab 0.07% Ruby 0.46% Dockerfile 0.11%
scatac-seq scrna-seq epigenetics snakemake-workflows transcription-factors gene-regulatory-networks

maestro's People

Contributors

baigal628 avatar chenfeiwang avatar crazyhottommy avatar dongqingsun avatar dongqingsun96 avatar kant avatar liulab-dfci avatar lzy604 avatar mengxiao avatar qiangtu avatar taoliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maestro's Issues

Incompatibility with Seurat 4.0

When installing MAESTRO (774ce46) on top of satijalab/seurat/release/4.0.0:

** byte-compile and prepare package for lazy loading
Error: object ‘CreateGeneActivityMatrix’ is not exported by 'namespace:Seurat'
Execution halted
ERROR: lazy loading failed for package ‘MAESTRO’

Multi-sample analysis!

Hi!
A very good tool.
I now have five scATAC samples.How do I merge multiple samples?

Bug: Hardcoded separator "_" doesn't work for 10X H5 format, in scATAC_genescore.py

The following line for calculating RP scores assumes that peaks are named as "chr_start_end" in peak matrix.

peaks_tmp = peak.decode().split("_")

However, the standard output "filtered_peak_bc_matrix.h5" from 10X cellranger-atac may use chr:start-end", according to

https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/advanced/h5_matrices

Suggest replacing the separator in the peak column to a standard character (either "_" or "-") in MAESTRO R script after loading the H5 file.

FileNotFoundError: [Errno 2] No such file or directory: 'Result/Analysis/M1R1_MAESTRO.PredictedTFTop10.txt'

Hi Chenfei,

Thank you for making MAESTRO available! I encountered an error while processing some 10x scRNA-seq data from mice.

It seems that MAESTRO couldn't find PredictedTFTop10.txt. I have RABIT installed and specified the RABIT index in the yaml file.

[Mon Dec  9 15:03:42 2019]
rule scrna_report:
    input: Result/Analysis, Result/QC/M1R1_MAESTRO_scRNA_read_distr.png, Result/QC/M1R1_MAESTRO_scRNA_read_quality.png, Result/QC/M1R1_MAESTRO_scRNA_NVC.png, Result/QC/M1R1_MAESTRO_scRNA_GCcontent.png, Result/QC/M1R1_MAESTRO_scRNA_genebody_cov.png, Result/QC/M1R1_MAESTRO_scRNA_cell_filtering.png
    output: Result/M1R1_MAESTRO_scRNA_report.html
    jobid: 1

Traceback (most recent call last):
  File "/omg/home/username/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scRNA_report.py", line 36, in <module>
    for line in open(cluster_regulator_file,"r").readlines():
FileNotFoundError: [Errno 2] No such file or directory: 'Result/Analysis/M1R1_MAESTRO.PredictedTFTop10.txt'
[Mon Dec  9 15:03:42 2019]
Error in rule scrna_report:
    jobid: 1
    output: Result/M1R1_MAESTRO_scRNA_report.html
    shell:
        python /omg/home/username/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scRNA_report.py M1R1_MAESTRO /home/username/Projects/maestro_eval/results/2019-12-06-first/M1R1_MAESTRO/M1R1 GRCm38 10x-genomics True

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/username/Projects/maestro_eval/results/2019-12-06-first/M1R1_MAESTRO/.snakemake/log/2019-12-09T150340.656779.snakemake.log

This appears to the last missing step in MAESTRO as rerunning the pipeline gave me:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        1       scrna_report
        2

Enhancement: Documentation of APIs

We need documentation of APIs. Since users need to know the options of each function, the data structures, in order to fine-tune their pipeline or skip certain steps. Currently, the only available documentation is the tutorial, which is not enough.

Bug: incorrect use of MACS2 -n option

As shown in the current code to invoke macs2:

rule scatac_allpeakcall:
input:
bam = "Result/BWA/" + config["outprefix"] + ".merged.sortedByPos.rmdp.bam"
output:
peak = "Result/Analysis/" + config["outprefix"] + "_all_peaks.narrowPeak",
bdg = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg",
params:
name = "Result/Analysis/" + config["outprefix"] + "_all"
log:
"Result/Log/" + config["outprefix"] + "_macs2_allpeak.log"
conda:
ENV_PATH + "/environment_py2.yaml"
shell:
"macs2 callpeak -g hs -n {params.name} -B -q 0.05 --nomodel --extsize=50 --SPMR -t {input.bam}"

The option '-n' contains a folder structure. This is not a recommended way to specify an output directory of macs2. Please use '--outdir' instead and keep the '-n' simple such as:

name = config["outprefix"] + "_all"

and

macs2 callpeak -g hs --outdir Result/Analysis -n {params.name} -B -q 0.05 --nomodel --extsize=50 --SPMR -t {input.bam}

The issue is also related to other places in this Snakemake file where MACS2 is invoked.

install_github("liulab-dfci/MAESTRO") error

Dear Tao,

When I ran install_github("liulab-dfci/MAESTRO"), I encountered the below error. Have you ever encountered the same error?

Error: Failed to install 'unknown package' from GitHub:
Failed to connect to api.github.com port 443: Connection refused

Thanks very much for your help.

Best regards,
Leon.

error at scatac_qcfilter step

Hi,
I got the error below when running the pipeline starting from bam. Any suggestion what I should try to fix this? Also I wonder if the pipeline can be started at any point. In my case here, I would like to resume the pipeline from this scatac_qcfilter step once I figure out the problem, rather than starting over the whole pipeline again.

rule scatac_qcfilter:
input: Result/Analysis/SI_25417_peak_count.h5
output: Result/QC/SI_25417_filtered_peak_count.h5
jobid: 4
benchmark: Result/Benchmark/SI_25417_QCFilter.benchmark

Traceback (most recent call last):
File "/root/miniconda3/envs/MAESTRO/bin/MAESTRO", line 4, in
import('pkg_resources').run_script('MAESTRO==1.2.1', 'MAESTRO')
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/init.py", line 665, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.2.1-py3.7.egg-info/scripts/MAESTRO", line 112, in
main()
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.2.1-py3.7.egg-info/scripts/MAESTRO", line 94, in main
scatac_qc(directory = args.directory, outprefix = args.outprefix, fileformat = args.format, peakcount = args.peakcount, feature = args.feature, barcode = args.barcode, single_stat = args.single_stat, peak_cutoff = args.peak_cutoff, count_cutoff = args.count_cutoff, frip_cutoff = args.frip_cutoff, cell_cutoff = args.cell_cutoff, species = args.species)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scATAC_QC.py", line 147, in scatac_qc
Filter(rawmatrix = peakmatrix, feature = features, barcode = barcodes, peak_cutoff = peak_cutoff, cell_cutoff = cell_cutoff, validcell = validcells_list, outprefix = filename, species = species)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scATAC_QC.py", line 84, in Filter
passed_cell_matrix = rawmatrix[np.ix_(passed_peak.tolist()[0], passed_cell.tolist()[0])]
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/scipy/sparse/_index.py", line 75, in getitem
return self._get_arrayXarray(row, col)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 665, in _get_arrayXarray
major.size, major.ravel(), minor.ravel(), val)
ValueError: could not convert integer scalar
[Wed Sep 2 20:52:51 2020]
Error in rule scatac_qcfilter:
jobid: 4
output: Result/QC/SI_25417_filtered_peak_count.h5
shell:
MAESTRO scatac-qc --format h5 --peakcount Result/Analysis/SI_25417_peak_count.h5 --peak-cutoff 100 --cell-cutoff 10 --directory Result/QC --outprefix SI_25417
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Thanks,
Yuping

Why I can't open the HTML report

Several sets of data were tested, and it was very convenient.
When I open HTML report, but I can't see any contents of the report only sample information.
Is there anything I haven't noticed?
image

Dockerfile recipe error

I'm currently hitting this error on the last step of the Dockerfile recipe:

Step 10/10 : RUN cd /root/Annotation && tar -xzvf rabit.tar.gz && tar -xzvf giggle.tar.gz && rm rabit.tar.gz && rm giggle.tar.gz
 ---> Running in a935b7a7a681
tar (child): rabit.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
The command '/bin/sh -c cd /root/Annotation && tar -xzvf rabit.tar.gz && tar -xzvf giggle.tar.gz && rm rabit.tar.gz && rm giggle.tar.gz' returned a non-zero code: 2

Multiple Samples

How would one go about using MAESTRO (starting at Step 4 of the pipeline) with multiple sc-RNA-Seq samples?

possible typo in the ATAC instruction page

Hi,

I ran MAESTRO by the Docker image successfully, then I tried to explore the data using R following the instruction. In the step 4, there is:

p <- VisualizeTFenrichment(TFs = tfs,
                             cluster.1 = 0,
                             type = "ATAC",
                             SeuratObj = pbmc.ATAC.res$ATAC,
                             GIGGLE.table = "10X_PBMC_10k_giggle.txt",
                             visual.totalnumber = 100,
                             name = "10X_PBMC_10k_Monocyte_filtered")

I think the GIGGLE.table should be '10X_PBMC_10k.PredictedTFScore.txt', because in the directory, there is no such file named '10X_PBMC_10k_giggle.txt'.

Thanks!

Install MAESTRO version 1.2.1.9999

I followed the instruction https://github.com/liulab-dfci/MAESTRO to install MAESTRO on HPC cluster for our institution research community.

My python version is 3.7.3-anaconda; conda version 4.9.2-py37h89c1867_0

I followed the instruction “ Installing the full solution of MAESTRO workflow through conda “, here is my procedures

[ris_hpc_apps@tdragon4 ~]$ module load python/3.7.3-anaconda

[ris_hpc_apps@tdragon4 ~]$ which conda

/risapps/rhel7/python/3.7.3/bin/conda

[ris_hpc_apps@tdragon4 ~]$ conda config --add channels defaults

Warning: 'defaults' already in 'channels' list, moving to the top

[ris_hpc_apps@tdragon4 ~]$ conda config --add channels bioconda

Warning: 'bioconda' already in 'channels' list, moving to the top

[ris_hpc_apps@tdragon4 ~]$ conda config --add channels conda-forge

Warning: 'conda-forge' already in 'channels' list, moving to the top

   -- no need to add those channels; they are already there.

 

[ris_hpc_apps@tdragon4 ~]$ conda install mamba -c conda-forge

Collecting package metadata (repodata.json): done

Solving environment: done

## Package Plan ##

 

  environment location: /risapps/rhel7/python/3.7.3

  added / updated specs:

    - mamba

The following packages will be downloaded:

 

    package                    |            build

    ---------------------------|-----------------

    ca-certificates-2020.11.8  |       ha878542_0         145 KB  conda-forge

    certifi-2020.11.8          |   py37h89c1867_0         150 KB  conda-forge

    conda-4.9.2                |   py37h89c1867_0         3.0 MB  conda-forge

    libsolv-0.7.16             |       h8b12597_0         455 KB  conda-forge

    ------------------------------------------------------------

                                           Total:         3.8 MB

 

The following NEW packages will be INSTALLED:

  libsolv            conda-forge/linux-64::libsolv-0.7.16-h8b12597_0

  mamba              conda-forge/linux-64::mamba-0.1.2-py37h99015e2_0

 

The following packages will be UPDATED:

  ca-certificates                      2020.6.20-hecda079_0 --> 2020.11.8-ha878542_0

  certifi                          2020.6.20-py37he5f6b98_2 --> 2020.11.8-py37h89c1867_0

  conda                                4.9.0-py37he5f6b98_1 --> 4.9.2-py37h89c1867_0

 

Proceed ([y]/n)? *y*

# I choose yes to update the conda to the latest version here

 

 Downloading and Extracting Packages

libsolv-0.7.16       | 455 KB    | ########################################################### | 100%

 certifi-2020.11.8    | 150 KB    | ########################################################### | 100%

 ca-certificates-2020 | 145 KB    | ########################################################### | 100%

 conda-4.9.2          | 3.0 MB    | ########################################################### | 100%

 Preparing transaction: done

Verifying transaction: done

Executing transaction: done

 

[ris_hpc_apps@tdragon4 ~]$ mamba create -n MAESTRO-1.2.1 -c liulab-dfci

  (since we already have MAESTO created for version 1.0.1, I choose a different conda environment name MAESTRO-1.2.1  )

  .......

   Getting  liulab-dfci linux-64

Getting  liulab-dfci noarch

Getting  conda-forge linux-64

Getting  conda-forge noarch

Getting  bioconda linux-64

Getting  bioconda noarch

Getting  pkgs/main linux-64

Getting  pkgs/main noarch

Getting  pkgs/r noarch

Getting  r linux-64

Getting  pkgs/r linux-64

Getting  dranew linux-64

Getting  r noarch

Getting  dranew noarch

 

Looking for: []

 

9 packages in https://conda.anaconda.org/liulab-dfci/linux-64

0 packages in https://conda.anaconda.org/liulab-dfci/noarch

139892 packages in https://conda.anaconda.org/conda-forge/linux-64

43184 packages in https://conda.anaconda.org/conda-forge/noarch

33562 packages in https://conda.anaconda.org/bioconda/linux-64

21237 packages in https://conda.anaconda.org/bioconda/noarch

18648 packages in https://repo.anaconda.com/pkgs/main/linux-64

2356 packages in https://repo.anaconda.com/pkgs/main/noarch

6637 packages in https://repo.anaconda.com/pkgs/r/linux-64

5025 packages in https://repo.anaconda.com/pkgs/r/noarch

6620 packages in https://conda.anaconda.org/r/linux-64

5025 packages in https://conda.anaconda.org/r/noarch

170 packages in https://conda.anaconda.org/dranew/linux-64

0 packages in https://conda.anaconda.org/dranew/noarch

 

## Package Plan ##

 

  environment location: /risapps/rhel7/python/3.7.3/envs/MAESTRO-1.2.1

 

Proceed ([y]/n)? y

 #
  # To activate this environment, use
  #
  #     $ conda activate MAESTRO-1.2.1
  #
  # To deactivate an active environment, use
  #
  #     $ conda deactivate

Preparing transaction: done

Verifying transaction: done

Executing transaction: done

At this point, I check the environment directory /risapps/rhel7/python/3.7.3/envs/MAESTRO-1.2.1;

I only notice one directory was generated, is this normal?

[ris_hpc_apps@tdragon4 MAESTRO]$ ls -l /risapps/rhel7/python/3.7.3/envs/MAESTRO-1.2.1

total 0

drwxr-xr-x 2 ris_hpc_apps rists 29 Nov 20 16:07 conda-meta

Note: I used our software installation account ris_hpc_apps to install this on HPC cluster. I suppose to let user to do "conda activate MAESTRO-1.2.1" ( "conda init bash" is needed before this)

Regards,
Rong

Error in integrating Seurat scRNA and Signac scATAC objects

First, I readRDS for my Seurat/harmony and Signac clustered scRNA and scATAC objects, respectively.

Then, I tried to run the Incorporate function:
coembed <- Incorporate(RNA = rna, ATAC = atac, project = "Maestro_integrated", method = "MAESTRO")

But I get the following error:
Error: Cannot find 'ACTIVITY' in this Seurat object

I also tried:
coembed <- Incorporate(RNA = rna$RNA, ATAC = atac$ATAC, project = "Maestro_integrated", method = "MAESTRO")

But,
Error: Cannot find 'ATAC' in this Seurat object

scrna-init creates a problematic snakefile for 10x-genomics

I used the following command to create a Snakefile following the instructions at https://github.com/liulab-dfci/MAESTRO/blob/master/example/RNA_infrastructure_10x/RNA_infrastructure_10x.md.

Unfortunately, the Snakefile does not work since I get the following error:

IndexError in line 34 of /storage/mathelierarea/processed/anthoma/Projects/single_cell/results/20201216_maestro/10X_PBMC_RNA_8k_MAESTRO_V122/Snakefile:
list index out of range
  File "/storage/mathelierarea/processed/anthoma/Projects/single_cell/results/20201216_maestro/10X_PBMC_RNA_8k_MAESTRO_V122/Snakefile", line 34, in <module>
  File "/lsc/MAESTRO/1.2.2/envs/MAESTRO/lib/python3.8/site-packages/MAESTRO/scRNA_utility.py", line 44, in getfastq_10x

The problem comes from the fact that the scrna-init command did not provide values for the transcript, barcode, and decompress variable; and these variables are used in the Snakefile for the STAR command.

This should not be the case given your documentation that specifies that --fastq-barcode and --fastq-transcript should only be provided for the Dropseq format (not 10x-genomics)

You can find the Snakefile and config.xml files in the archive enclosed.
snakefile_and_config.zip

Question: Can we do TF annotation without doing cell annotation?

In some cases, we do not know the actual cell types, or we do not know the signature genes of possible cell types, so we'd like to skip cell-type annotation step, and plan to identify the driver TFs for each cluster. Can we do that? I am asking this since according to the tutorial, the regulation potentials (RP, or gene scores) are only to be included in the 'cell type annotation' step, but RPs are needed in the TF identification step.

Bug: scATAC_genescore.py and ATACCalculateGenescore.R can't handle peaks from unusual chromosomes

When peaks are from unusual chromosomes such as "chr11_KI270721v1_random", scATAC_genescore.py and ATACCalculateGenescore.R can't correctly split the peak name such as "chr11_KI270721v1_random_12345_67890" to find the chromosome name, start and end positions since it will use "_" as the separator:

for ipeak, peak in enumerate(peaks_list):
peaks_tmp = peak.decode().split("_")
peaks_info.append([peaks_tmp[0], (int(peaks_tmp[1]) + int(peaks_tmp[2])) / 2.0, 0, ipeak])

Please use rsplit("_",maxsplit=2) instead. Here is an example:

>>> a="chr11_KI270721v1_random_12345_67890"
>>> a.split("_") # this will incorrectly split the chromosome name
['chr11', 'KI270721v1', 'random', '12345', '67890']
>>> a.rsplit("_",maxsplit=2) # this will give exactly three values
['chr11_KI270721v1_random', '12345', '67890']

Pass in existing Seurat/SingleCellExperiment Object

Is it possible to pass in an existing Seurat object (or converted SingleCellExperiment object)? I have a pretty robust pipeline that does QC, clustering and such, would like to have some level of preservation of those clusters and such.

cluster resolution other than 0.6 with error in coembedding

Hello,

I tried to cluster my scRNA and scATAC datasets using a resolution other than 0.6 (this number was used in the MAESTRO vignettes), but I got an error saying that "ATAC_snn_res.0.6" or "RNA_snn_res.0.6" cannot be found in the Seurat object when I used the "Incorporate" function to coembed the two datasets. Do I need to use a default resolution of 0.6 before coembedding? But can I change the resolution after coembedding?

Thank you!

Bug: bdg to bw in Snakemake

The current implementation has a potential problem. The bedtools intersect with -f 1.00 will get rid of the region that is not ENTIRELY included in the chr_limit file. It would be more reasonable to clip instead of discarding the whole region.

rule scatac_bdg2bw:
input:
bdg = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg",
chrlimit = SCRIPT_PATH + "/annotations/" + config["species"] + "_chr_limit.bed",
chrlen = SCRIPT_PATH + "/annotations/" + config["species"] + "_chr.len",
output:
tmp = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg.tmp",
bigwig = "Result/Analysis/" + config["outprefix"] + "_all.bw"
params:
outdir = "Result/Analysis/" + config["outprefix"]
shell:
"bedtools intersect -a {input.bdg} -b {input.chrlimit} -wa -f 1.00 > {output.tmp};"
"bedSort {output.tmp} {input.bdg};"
"bedGraphToBigWig {input.bdg} {input.chrlen} {output.bigwig};"

Check my gist: https://gist.github.com/taoliu/2469050

gene_peak matrix in RP_model

The value range of gene_peak matrix in RP_model should between 0 and 1. Why does "genes_peaks_score_csr" in scATAC_Genescore.py have a value greater than 1, with a maximum of 1600+.

Python scripts should use argparse

Scripts like scRNA_qc.py use many positional arguments, which is not particularly user friendly if the user needs to run the script outside of the snakemake pipeline. It would be easier for the user if argparse was used so that they can specify command line flags and have a help doc for the script.

Bug: Bugs of relative pathes in the MAESTRO scrna-/scatac-init scripts and Errors in Tutorial

The version of MAESTRO: v1.1.0

Command line bugs:

  1. If --fastq-dir ends with /, the snakemake cmd will give a ‘double /’ warning.
  2. If --fastq-dir is a relative path, it won’t work since this relative path will be added to the config file in a new child directory; the same problem can be found for the option for whitelist file.

Tutorial errors:

  1. The instruction, there is no --directory option but -d.
  2. The assembly names GRCh38 or GRCm38 are not species names!
  3. Please do not hardcode path names in instruction.
  4. Errors in the scRNA instruction: the download link for the STAR index only contains human data. The RSEM library download link is pointing to giggle download.

Is MAESTRO compatible with 10X data derived from nuclear RNA?

Hello,

I'm currently installing the MAESTRO prerequisites and, after reading the paper, I'd like to ask if MAESTRO is compatible with 10X data derived from nuclear RNA, particularly if I'm looking to integrate single-modal snRNA- and snATAC-seq data?

And more specifically, could the use of a pre-mRNA reference and GTF files for alignment, as opposed to standard reference/annotation files, impact a MAESTRO analysis at all?

Until now I have been using Cell Ranger 4 for my analysis which recommends using a pre-mRNA reference and GTF file for nuclear RNA. I had started creating STARsolo compatible versions of these files for my MAESTRO analysis and wondered if this is the best course of action, particularly as 10X have recently released v5 which includes a new function for dealing with intronic reads without the need of a pre-RNA reference, and STARsolo also provides a similar function.

Regardless, it would be useful to hear if you have any recommendations or points of interest that I should consider when running MAESTRO using single-nuclear data.

Many Thanks,

Darren

Enhancement: Optimize the memory usage for calculating gene score (RP) for scATAC-seq with a large number of cells

I have to process a scATAC-seq dataset with over 48,000 cells after merging three conditions and 2 replicates each condition. I kept 200,000 top peaks. The scATAC_cellranger_count.py can finish successfully, however, even our high memory node (260G mem) of our HPC keeps killing scATAC_genescore.py. We need to optimize the memory usage of this script. This issue ticket will track the progress of the optimization.

Documentation clarification

It would be appreciated if the documentation explicitly mentioned that rownames of the input matrixes need to be gene symbols and not other annotations. Took me a while to figure out that LISA requires symbols, and that the mismatch in rownames of my ATAC and RNA data kept causing Incorporate() to segfault. However, I also admit I am using a slightly processed counts matrix for my RNA input, which used ensemble ids as the rownames.

A hardcoded path of LISA in the docker image

Hello,

I am using MAESTRO docker image winterdongqing/maestro, 1.2.1, 57f41ffbe200. When I ran the scRNA-seq module, it reported errors like:

unable to open file: name = '/project/dev/qqin/LISA/lisa_web/cistromedb_data/hg38/cluster_human/DNase_median_for_each_cluster.h5'

This error should be caused by a hardcoded path of LISA. I changed to web LISA, then this module finished successfully.

Bug: Let reticulate find the current Python

If users use MAESTRO for the first time or more accurately use "reticulate" package through ATACCalculateGenescore script for the first time, they will be prompted to decide whether "reticulate" should install an independent miniconda environment by itself.

No non-system installation of Python could be found.
Would you like to download and install Miniconda?
Miniconda is an open source environment management system for Python.
See https://docs.conda.io/en/latest/miniconda.html for more details.

Would you like to install Miniconda? [Y/n]: n

Assuming users are utilizing conda/miniconda version of MAESTRO for their analysis, they should stick to the python in that environment, so answer "n". This has to be done just once. We'd better describe it in our tutorial otherwise users may have a lot of trouble later on if they accidentally answer "Y" to the above question. See the discussion regarding this in rstudio/reticulate#607.

The code related to the issue is

library(reticulate)
source_python(paste(system.file(package="MAESTRO"), "ATACCalculateGenescore.py", sep="/"))

Enhancement: Suggestions on MAESTRO/R/ATACRunSeurat.R

If we look at MAESTRO/R/ATACRunSeurat.R script, it includes many of Seurat functions, such as RunLSI, FindCluster, and RunUMAP. However, the parameters for these functions are mainly fixed with only few tunable parameters for users. How can users tweak the parameter such as n.neighbors or min.dist for UMAP with this script, or the maximum number of components from LSI (n=50 is fixed)? My suggestion is to break this big script into at least three modules -- LSI, UMAP, and DE, and provide more options!

How to generate those QC metrics plots?

Hi,

Thank you for this great work! I just have one question when I go through the Galleries & Tutorials. For the scRNA-seq example, after the RNARunSeurat() function, you use str() to check those features. How to plot those figures below the line of str(pbmc.RNA.res$RNA)?

Thanks,
Frank

Bug: Missing default value or required status for `MAESTRO init` command

The main executable script MAESTRO will throw exceptions when no argument is set for init command:

$ MAESTRO init 
Traceback (most recent call last):
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/bin/MAESTRO", line 4, in <module>
    __import__('pkg_resources').run_script('MAESTRO==1.0.1', 'MAESTRO')
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.0.1-py3.7.egg-info/scripts/MAESTRO", line 79, in <module>
    main()
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.0.1-py3.7.egg-info/scripts/MAESTRO", line 72, in main
    init_workflow(args.directory, args.module)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.0.1-py3.7.egg-info/scripts/MAESTRO", line 21, in init_workflow
    os.makedirs(directory)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/os.py", line 206, in makedirs
    head, tail = path.split(name)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/posixpath.py", line 107, in split
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

The solution is either to add required=True for "-m" and "-d", or to set default= for these options.

MAESTRO/MAESTRO/MAESTRO

Lines 61 to 62 in 1eba0a2

workflow.add_argument("-d", "--directory", help = "Path to the directory where the workflow shall be initialized and results shall be stored.")
workflow.add_argument("-m", "--module", choices = ["scATAC","scRNA","integrate"], help="Analysis of scATAC-seq or scRNA-seq, or integrattion analysis")

giggle and lisa tables for drosophila

Is it possible to generate the giggle and lisa tables for drosophila melanogaster? I would like to use these to generate the driver TFs for both scRNA and scATAC-seq samples.

--giggleannotation for other species

Hi Chenfei,

Does MASTRO allow to process data of other species, like zebrafish? If so, how should I prepare the reference genome files, like giggle annotation files?

Thank you!
Yujie

Suggestion to set default assay, fix hard coded lines and more in Incorporate.R

To make this script more flexible for customized analysis, please add DefaultAssay(ATAC) <- "ACTIVITY" to the end of this code block.

MAESTRO/R/Incorporate.R

Lines 47 to 64 in d58fc18

if(is.null(ATAC[["ACTIVITY"]])&method!="Seurat"){
RPmatrix <- RPmatrix[,intersect(colnames(ATAC), colnames(RPmatrix))]
ATAC <- SubsetData(ATAC, cells = intersect(colnames(ATAC), colnames(RPmatrix)))
ATAC[["ACTIVITY"]] <- CreateAssayObject(counts = RPmatrix)
DefaultAssay(ATAC) <- "ACTIVITY"
ATAC <- FindVariableFeatures(ATAC)
ATAC <- NormalizeData(ATAC)
ATAC <- ScaleData(ATAC)}
if(method=="Seurat"){
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = RPmatrix, annotation.file = "/homes/cwang/annotations/hg38/Homo_sapiens.GRCh38.98.gtf", seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)
activity.matrix <- activity.matrix[,intersect(colnames(ATAC), colnames(activity.matrix))]
ATAC <- SubsetData(ATAC, cells = intersect(colnames(ATAC), colnames(activity.matrix)))
ATAC[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
DefaultAssay(ATAC) <- "ACTIVITY"
ATAC <- FindVariableFeatures(ATAC)
ATAC <- NormalizeData(ATAC)
ATAC <- ScaleData(ATAC)}

Also, please fix the hard code at:

activity.matrix <- CreateGeneActivityMatrix(peak.matrix = RPmatrix, annotation.file = "/homes/cwang/annotations/hg38/Homo_sapiens.GRCh38.98.gtf", seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)

p2 <- DimPlot(CombinedObj, reduction = "umap", group.by = "RNA_snn_res.0.6", cells = rownames(CombinedObj@meta.data[which(CombinedObj@meta.data[,'tech']=='RNA'),]), label = TRUE, repel = TRUE)

p3 <- DimPlot(CombinedObj, reduction = "umap", group.by = "ATAC_snn_res.0.6", cells = rownames(CombinedObj@meta.data[which(CombinedObj@meta.data[,'tech']=='ATAC'),]), label = TRUE, repel = TRUE)

Another suggestion is to separate the plotting functions from this script since when users get the CombinedObj, they can plot whatever they want. To combine plotting with computation is unnecessary and will potentially cause some bugs.

lisa unpack error

Hi, I am running MAESTRO 1.2.2 following the tutorial. The scRNA-seq module complained:

lisa unpack: error: argument tarball: ERROR: /home/qtu/work/maestro/lisa/hg38_2.1.tar.gz is not a valid file.

Actually I could directly execute lisa unpack hg38_2.1.tar.gz successful, not sure what caused the error.

BTW, does the option --lisadir lisa/hg38_2.1.tar.gz mean the pipeline will unpack the huge file each time?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.