dpeerlab / seacells Goto Github PK

View Code? Open in Web Editor NEW

142.0 7.0 26.0 169.63 MB

SEACells algorithm for Inference of transcriptional and epigenomic cellular states from single-cell genomics data

License: GNU General Public License v2.0

Python 3.23% Jupyter Notebook 96.64% R 0.13%

seacells's Introduction

SEACells:

Single-cEll Aggregation for High Resolution Cell States

Installation and dependencies

SEACells has been implemented in Python3.8 can be installed via pip: $> pip install cmake $> pip install SEACells It can also be installed directly from source.
```
$> git clone https://github.com/dpeerlab/SEACells.git
$> cd SEACells
$> python setup.py install
```
If you are using conda, you can use the environment.yaml to create a new environment and install SEACells.

conda env create -n seacells --file environment.yaml
conda activate seacells

You can also use pip to install the requirements

pip install -r requirements.txt

And then follow step (1)

MulticoreTSNE issues can be solved using

conda create --name seacells -c conda-forge -c bioconda cython python=3.8
conda activate seacells
pip install git+https://github.com/settylab/Palantir@removeTSNE
git clone https://github.com/dpeerlab/SEACells.git
cd SEACells
python setup.py install

SEACells depends on a number of python3 packages available on pypi and these dependencies are listed in setup.py.

All the dependencies will be automatically installed using the above commands
To uninstall: $> pip uninstall SEACells
To install the developer installation of SEACells, run

git clone https://github.com/dpeerlab/SEACells.git
cd SEACells.git

pip install -e ".[dev]"
pre-commit install

Usage

ATAC preprocessing: notebooks/ArchR folder contains the preprocessing scripts and notebooks including peak calling using NFR fragments. See notebook here to get started. A version of ArchR that supports NFR peak calling is available here.
Computing SEACells: A tutorial on SEACells usage and results visualization for single cell data can be found in the [SEACell computation notebook] (https://github.com/dpeerlab/SEACells/blob/main/notebooks/SEACell_computation.ipynb).
Gene regulatory toolkit: Peak gene correlations, gene scores and gene accessibility scores can be computed using the [ATAC analysis notebook] (https://github.com/dpeerlab/SEACells/blob/main/notebooks/SEACell_ATAC_analysis.ipynb).
TF activity inference: TF activities along differenitation trajectories can be computed using the [TF activity notebook] (https://github.com/dpeerlab/SEACells/blob/main/notebooks/SEACell_tf_activity.ipynb).
Large-scale data integration using SEACells : Details are avaiable in the [COVID integration notebook] (https://github.com/dpeerlab/SEACells/blob/main/notebooks/SEACell_COVID_integration.ipynb)
Cross-modality integration : Integration between scRNA and scATAC can be performed following the Integration notebook

Citations

SEACells manuscript is available on bioRxiv. If you use SEACells for your work, please cite our paper.

@article {Persad2022.04.02.486748,
	author = {Persad, Sitara and Choo, Zi-Ning and Dien, Christine and Masilionis, Ignas and Chalign{\'e}, Ronan and Nawy, Tal and Brown, Chrysothemis C and Pe{\textquoteright}er, Itsik and Setty, Manu and Pe{\textquoteright}er, Dana},
	title = {SEACells: Inference of transcriptional and epigenomic cellular states from single-cell genomics data},
	elocation-id = {2022.04.02.486748},
	year = {2022},
	doi = {10.1101/2022.04.02.486748},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2022/04/03/2022.04.02.486748},
	eprint = {https://www.biorxiv.org/content/early/2022/04/03/2022.04.02.486748.full.pdf},
	journal = {bioRxiv}
}

Release Notes

seacells's People

Contributors

Stargazers

Watchers

seacells's Issues

conda environment yaml file dependency seems conflict

It seems that scanpy=1.8.2 not compatible with python=3.5 ?

Question regarding ATAC Pipeline

In the preprint you claim that using fragments from the NFR (nucleosome-free region) "leads to substantially better sensitivity in identification of regulatory elements", referring to Supplementary Fig. 7.
I was wondering which figure you mean, since the supplemental figures do not have captions.

Cannot initialize_archetypes

Hello!
I'm trying to run the example, but I'm facing the following error...
I've installed the package by cloning the repository and running the setup.py without errors.
After downloafing the example data I encounter:

>>> model.initialize_archetypes()
Building kernel on X_pca
Computing diffusion components from X_pca for waypoint initialization ... 
Determing nearest neighbor graph...
Done.
Sampling waypoints ...
Done.
Selecting 81 cells from waypoint initialization.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/SEACells/SEACells/core.py", line 127, in initialize_archetypes
    greedy_ix = self._get_greedy_centers(n_mcs=from_greedy + 10)
  File "/root/SEACells/SEACells/core.py", line 243, in _get_greedy_centers
    n = K.shape[0]
AttributeError: 'NoneType' object has no attribute 'shape'

What can be happening here?

Python version: Python 3.8.16

cell-cell affinity matrix

Hi,

Thanks for this fantastic tool!
I read the paper introducing SEACell and felt like the cell-cell affinity matrix ordered by meta cells is a useful visualization (Fig1,h in the paper). I'm wondering if the package has a plotting function to make such a visualization.

Thanks!

Reproducibility issue

Hello and thanks for a great tool!

I would like to re-open this closed issue #16 - I am also seeing different initialized archetypes and therefore SEACells when I repeat the analysis with the same parameters on the same data. I would really like to avoid this randomness as I am computing SEACells as part of a wider pipeline which I would like to make totally reproducible from start to finish. I was wondering if anyone had been able to pinpoint where the randomness is coming from and where we could fix it to get reproducible results?

Thanks very much,
Eva

ArchR code tutorial - addGroupCoverage has no maxFragmentLength param

Hello,

With ArchR 1.0.1+, the function addGroupCoverage has no maxFragmentLength parameter, rendering it impossible to call NFR peaks.

Is there a workaround to this?

Thank you

PWM for mouse genome

Hi,
First of all, amazing tool! I just wanted to enquire if you are thinking of making a pwm available for the mouse genome. I can figure out how to make one on my own, but was wondering if there is one available.

Gene-peak association fails with Cellranger GTF file

I'm trying to use the SEACells.genescores.get_gene_peak_correlations function using the GTF file included in the Cell Ranger GRCh38-2020-A reference. When I try to use this GTF file, all of the gene-peak correlations are 0, but I can successfully get the gene-peak correlations if I use the hg38 GTF included in the tutorial, but not changing anything else about the function call. I'm not exactly sure why this is happening, since it seems like this function should work with other GTF files than the one in the tutorial. This is an example of my code and output

Code:

gene_set = rna_meta_ad.var_names[:1000]
gene_peak_cors = SEACells.genescores.get_gene_peak_correlations(atac_meta_ad, rna_meta_ad, 
                                           path_to_gtf=path_to_repository + '/data/genes.gtf', 
                                           span=100000, 
                                           n_jobs=1,
                                           gene_set=gene_set)
gene_peak_cors[gene_peak_cors == 0]

Output:

LINC00115    0
NOC2L        0
KLHL17       0
ISG15        0
            ..
FCER1G       0
TOMM40L      0
NR1I3        0
MPZ          0
SDHC         0
Length: 1000, dtype: int64

RuntimeError: cannot cache function 'sparse_mean_var_minor_axis'

I am running SEACells into a singularity environment, and installation is smooth with

pip3 install cmake
pip3 install SEACells

But when I then open a singularity shell and I try to import SEACells in the python terminal I get the following error:

$ apptainer run docker://registry.git.embl.de/procacci/seacells_docker:latest 
INFO:    Using cached SIF image
Python 3.8.16 (default, Apr 12 2023, 15:00:48) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import SEACells
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-2opxlic0 because the default path (/home/procacci/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/SEACells/__init__.py", line 1, in <module>
    from . import core
  File "/usr/local/lib/python3.8/site-packages/SEACells/core.py", line 3, in <module>
    import palantir
  File "/usr/local/lib/python3.8/site-packages/palantir/__init__.py", line 3, in <module>
    from . import io
  File "/usr/local/lib/python3.8/site-packages/palantir/io.py", line 5, in <module>
    import scanpy as sc
  File "/usr/local/lib/python3.8/site-packages/scanpy/__init__.py", line 14, in <module>
    from . import tools as tl
  File "/usr/local/lib/python3.8/site-packages/scanpy/tools/__init__.py", line 1, in <module>
    from ..preprocessing import pca
  File "/usr/local/lib/python3.8/site-packages/scanpy/preprocessing/__init__.py", line 1, in <module>
    from ._recipes import recipe_zheng17, recipe_weinreb17, recipe_seurat
  File "/usr/local/lib/python3.8/site-packages/scanpy/preprocessing/_recipes.py", line 8, in <module>
    from ._deprecated.highly_variable_genes import (
  File "/usr/local/lib/python3.8/site-packages/scanpy/preprocessing/_deprecated/highly_variable_genes.py", line 11, in <module>
    from .._utils import _get_mean_var
  File "/usr/local/lib/python3.8/site-packages/scanpy/preprocessing/_utils.py", line 46, in <module>
    def sparse_mean_var_minor_axis(data, indices, major_len, minor_len, dtype):
  File "/usr/local/lib/python3.8/site-packages/numba/core/decorators.py", line 212, in wrapper
    disp.enable_caching()
  File "/usr/local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 863, in enable_caching
    self._cache = FunctionCache(self.py_func)
  File "/usr/local/lib/python3.8/site-packages/numba/core/caching.py", line 601, in __init__
    self._impl = self._impl_class(py_func)
  File "/usr/local/lib/python3.8/site-packages/numba/core/caching.py", line 337, in __init__
    raise RuntimeError("cannot cache function %r: no locator available "
RuntimeError: cannot cache function 'sparse_mean_var_minor_axis': no locator available for file '/usr/local/lib/python3.8/site-packages/scanpy/preprocessing/_utils.py'

Has anyone had this problem before? I don't understand whether is a singularity-specific problem or a

Computing SEACells on individual samples

@sitarapersad et al., thanks for this incredibly interesting package! I'm really excited about metacells, cluster-free DE, etc., coming through in scRNA-seq analyses to better dissect true biological differences between health states rather than risking blunting both technical and biological differences with current integration methods.

As someone who mostly uses R rather than Python, I was hoping to get some guidance on the following:

For the COVID example, did you fully process each individual sample (e.g., removing low quality cells/doublets, normalization, feature selection, and dim. reduction with cluster annotation) before running SEACells on them? And then merge the 20 .h5ad files together in order to generate the aggregated metacell x gene expression matrix? Was curious, as there were already cell state labels in the vignette (https://github.com/dpeerlab/SEACells/blob/main/notebooks/SEACell_COVID_integration.ipynb). The reason I ask is that I currently have a fully integrated dataset from ~22 samples from the Multiome kit (spanning two conditions: healthy and disease) - a Seurat object where I did batch correction using Harmony with each individual patient sample as a batch. Figure 6 in your paper was really illuminating, leading me here. I wanted to just work with the RNA side first to get the hang for this tool. I'm wondering whether I can subset my 22 patient Seurat object into individual samples and run the initial SEACells on each one individually without needing to re-run all the pre-processing. Happy to start from scratch, if that's your recommendation however. EDIT: I assume the approach would be the same for ATAC in the paired Multiome dataset?
If different from the above, could you please share the pre-processing code that you used on each individual sample?
Would you happen to have example scripts of how you used meta2cells in downstream analyses? For example, I'm interested in performing DEG analysis using 1) voom-limma and 2) miloDE (for "cluster"-free DEG) and wondering how I should modify my typical input - comparing across individual meta2cells rather than the clusters that I'm used to?

Thanks!

Type checking needed in initialize_archetypes

Hi, I was following along with this tutorial, and I kept coming across this error message triggered on line 194 of core.py:

‘numpy.float64’ object cannot be interpreted as an integer

Based on the tutorial, it was recommended to select one metacell for every 75 single cells, so I thought I was being clever by doing this to automatically select the number of metacells:

n_SEACells = np.floor(adata.obs.shape[0] / 75)

However, this code returns a numpy.float64, which screws up a bunch of the downstream steps such as initialize_archetypes. Basically, this parameter n_SEACells throws an error if it is not an int, so I am suggesting to add in a type check for this parameter so you can throw a more informative error in the future.

get_sizes() error

Hi,

The Counter method is not imported and a NameError is thrown when calling get_sizes(),

    487 def get_sizes(self):
    488     """Return size of each SEACell as array
    489     """
--> 490     return Counter(np.argmax(self.A_, axis=0))

NameError: name 'Counter' is not defined

Thanks!

Peak calling on NFR fragments

Hi,

Great paper, thanks for developing SEACells!

I'm trying to follow your approach to peak call on NFR fragments using the following functions in ArchR:
proj <- addGroupCoverages(proj, maxFragmentLength=147)
proj <- addReproduciblePeakSet(proj)
proj <- addPeakMatrix(proj, maxFragmentLength=147, ceiling=10^9)

But when I try to install the modified ArchR pipeline from github directly from the Greeleaf lab I'm getting:
Skipping install of 'ArchR' from a github remote, the SHA1 (GreenleafLab/ArchR@79953a9) has not changed since last install.

Forcing to install from github results in:
installation of package ‘.../tmp/RtmpM2couO/file28dbc3dd83145/ArchR_1.0.2.tar.gz’ had non-zero exit status

I tried pulling directly from https://github.com/dpeerlab/ArchR as well and it didn't work.

Can you help? I'm not sure how to get it otherwise.

Thanks
Ivan

Issue with model.initialize_archetypes()

I have an AnnData object (adata) from which I am trying to create a SEACells model. The AnnData object has 593 rows corresponding to 593 cells.
I'm running into an error at this line of code which I pulled from one of the tutorial notebooks:
`

    # SEACells parameter setup
    n_SEACells = 1 + int(len(adata)/75)
    build_kernel_on = "XPCA"
    n_waypoint_eigs = 10 # Number of eigenvalues to consider when initializing metacells
    
    # Build SEACells model
    model = SEACells.core.SEACells(adata,
              build_kernel_on=build_kernel_on,
              n_SEACells=n_SEACells,
              use_gpu=True,
              n_waypoint_eigs=n_waypoint_eigs,
              convergence_epsilon = 1e-5)
    model.construct_kernel_matrix()
    M = model.kernel_matrix

    # Initialize archetypes
    model.initialize_archetypes()`

The error I'm getting is an IndexError seemed to be caused by Palantir:

Traceback (most recent call last):
File "1_make_metacells.py", line 176, in main
model.initialize_archetypes()
File "/users/salil512/miniconda3/envs/seacells/lib/python3.8/site-packages/SEACells/gpu.py", line 162, in initialize_archetypes
waypoint_ix = self._get_waypoint_centers(k)
File "/users/salil512/miniconda3/envs/seacells/lib/python3.8/site-packages/SEACells/gpu.py", line 286, in _get_waypoint_centers
waypoint_init = palantir.core._max_min_sampling(data=dc_components, num_waypoints=k)
File "/users/salil512/miniconda3/envs/seacells/lib/python3.8/site-packages/palantir/core.py", line 145, in _max_min_sampling
dists[:, 0] = abs(vec - data[ind].values[iter_set])
IndexError: index 0 is out of bounds for axis 1 with size 0

I have the latest version of SEACells and my palantir version is 1.1

Some help debugging this would be appreciated. Thanks.

Fail to install seacells via conda due to conflicts in environment.yaml

Hi! I tried to install SEACells via option 2, i.e.

conda env create -n seacells --file environment.yaml
conda activate seacells

However, there seems to be conflicts in the specifications, and below are the error messages from conda.

Any insights will be greatly appreciated!!

Channels:
 - conda-forge
 - bioconda
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: failed
Channels:
 - conda-forge
 - bioconda
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - package fa2-0.3.5-py36h8c4c3a4_0 requires python_abi 3.6.* *_cp36m, but none of the providers can be installed

Could not solve for environment specs
The following packages are incompatible
├─ fa2 is installable with the potential options
│  ├─ fa2 0.3.5 would require
│  │  ├─ python >=3.6,<3.7.0a0  with the potential options
│  │  │  ├─ python [3.6.0|3.6.1|...|3.6.9], which can be installed;
│  │  │  └─ python 3.6.12 would require
│  │  │     └─ pypy3.6 7.3.3.* , which can be installed;
│  │  └─ python_abi 3.6.* *_cp36m, which can be installed;
│  ├─ fa2 0.3.5 would require
│  │  ├─ python >=3.10,<3.11.0a0 , which can be installed;
│  │  └─ python_abi 3.10.* *_cp310, which can be installed;
│  ├─ fa2 0.3.5 would require
│  │  ├─ python >=3.11,<3.12.0a0 , which can be installed;
│  │  └─ python_abi 3.11.* *_cp311, which can be installed;
│  ├─ fa2 0.3.5 would require
│  │  ├─ python >=3.7,<3.8.0a0 , which can be installed;
│  │  └─ python_abi 3.7.* *_cp37m, which can be installed;
│  ├─ fa2 0.3.5 would require
│  │  ├─ python >=3.8,<3.9.0a0 , which can be installed;
│  │  └─ python_abi 3.8.* *_cp38, which can be installed;
│  └─ fa2 0.3.5 would require
│     ├─ python >=3.9,<3.10.0a0  with the potential options
│     │  ├─ python [3.9.0|3.9.1|...|3.9.9], which can be installed;
│     │  └─ python 3.9.18 would require
│     │     └─ pypy3.9 7.3.13.* , which can be installed;
│     └─ python_abi 3.9.* *_cp39, which can be installed;
├─ louvain >=0.6,!=0.6.2  is installable with the potential options
│  ├─ louvain 0.6.1 would require
│  │  └─ python_abi * *_cp27mu, which conflicts with any installable versions previously reported;
│  ├─ louvain 0.6.1 would require
│  │  └─ python_abi * *_cp35m, which conflicts with any installable versions previously reported;
│  ├─ louvain 0.6.1, which can be installed;
│  ├─ louvain [0.6.1|0.7.0] would require
│  │  └─ python_abi 3.6.* *_cp36m, which can be installed;
│  ├─ louvain 0.6.1 would require
│  │  └─ python_abi 3.6 *_pypy36_pp73, which can be installed;
│  ├─ louvain [0.6.1|0.7.0|0.7.1|0.8.0] would require
│  │  └─ python_abi 3.7.* *_cp37m, which can be installed;
│  ├─ louvain 0.6.1 would require
│  │  └─ python_abi * *_cp37m, which conflicts with any installable versions previously reported;
│  ├─ louvain [0.6.1|0.7.0|0.7.1|0.8.0|0.8.1] would require
│  │  └─ python_abi 3.8.* *_cp38, which can be installed;
│  ├─ louvain 0.6.1 would require
│  │  └─ python_abi * *_cp38, which conflicts with any installable versions previously reported;
│  ├─ louvain [0.6.1|0.7.0|0.7.1|0.8.0|0.8.1] would require
│  │  └─ python_abi 3.9.* *_cp39, which can be installed;
│  ├─ louvain [0.7.1|0.8.0|0.8.1] would require
│  │  └─ python_abi 3.10.* *_cp310, which can be installed;
│  ├─ louvain 0.7.1 would require
│  │  └─ python_abi 3.7 *_pypy37_pp73, which can be installed;
│  ├─ louvain [0.7.1|0.8.0|0.8.1] would require
│  │  └─ python_abi 3.8 *_pypy38_pp73, which can be installed;
│  ├─ louvain [0.7.1|0.8.0|0.8.1] would require
│  │  └─ python_abi 3.9 *_pypy39_pp73, which can be installed;
│  ├─ louvain 0.8.0 would require
│  │  └─ python_abi 3.11.* *_cp311, which can be installed;
│  └─ louvain 0.8.1 would require
│     ├─ python >=3.11,<3.12.0a0 , which can be installed;
│     └─ python_abi 3.11.* *_cp311, which can be installed;
├─ python 3.5**  is not installable because there are no viable options
│  ├─ python [3.5.1|3.5.2|3.5.3|3.5.4|3.5.5] would require
│  │  └─ python_abi * *_cp35m, which conflicts with any installable versions previously reported;
│  └─ python [3.5.4|3.5.5|3.5.6] conflicts with any installable versions previously reported;
└─ python_abi is requested and can be installed.

Aggregation of expression data ends up in matrix filled with NaN's

Dear SEACell team,

first of all thank you for such an interesting and versatile tool.
I have been recently using it for creating metacells from a scRNA-seq dataset with cells coming from different studies and in turn from different patients.
I wanted to try to repeat the workflow shown for the COVID dataset integration, but I am still at the first round of metacells.
I am running the basic pipeline shown in notebooks/SEACell_computation.ipynb iteratively across the samples, and I am using the soft assignment for binning the cells.
Everything seems to run smoothly, except for some samples which have no apparent difference (in data) from the other ones. In those cases, the expression matrix of the metacell (X slot) is completely filled with NaN's, even though the X slot of the starting anndata object, the anndata layer used for aggregation, and the X_pca are not.
This is an example:

adata.X.toarray() :
adata.layers['norm_counts'].toarray() :
adata.obsm['X_pca'] :
whereas this the output of metacell.X.toarray() :

I have also inspected the figures produced in the workflow, but none of them looks abnormal based on my understanding (should I pay attention to one of them specifically in this case? If so, what should I look at?)

Finally, this is the code I have used for producing the metacell object:

for sample in rerun_these :
    print("Analyzing", sample)
    ad_tmp = adata_big[adata_big.obs['Sample'] == sample].copy()
    
    n_SEACells = ceil(ad_tmp.n_obs / 75)
    
    # renormalize 
    ad_tmp.X = ad_tmp.layers['counts'].copy()
    sc.pp.normalize_total(ad_tmp, target_sum=1e4)
    ad_tmp.layers['norm_counts'] = ad_tmp.X.copy()
    
    # rerun pca
    sc.pp.log1p(ad_tmp)
    sc.pp.pca(ad_tmp, n_comps=50)
    
    model = SEACells.core.SEACells(ad_tmp, 
                  build_kernel_on=build_kernel_on, 
                  n_SEACells= n_SEACells , 
                  n_waypoint_eigs=n_waypoint_eigs,
                  convergence_epsilon = 1e-5)
    
    model.construct_kernel_matrix()
    M = model.kernel_matrix
    
    model.initialize_archetypes()
                
    model.fit(min_iter=10, max_iter=1000)
           
    SEACell_soft_ad = SEACells.core.summarize_by_soft_SEACell(ad_tmp, model.A_, celltype_label='celltype_col',summarize_layer='norm_counts', minimum_weight=0.05)    
    
    rerun_dict[sample] = SEACell_soft_ad

Thank you for any help or suggestions you can provide!

Vittorio

anndata     0.7.6
scanpy      1.8.1
sinfo       0.3.4
-----
PIL                 8.2.0
SEACells            NA
backcall            0.2.0
bottleneck          1.3.2
cairo               1.20.1
cffi                1.14.5
colorama            0.4.4
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.1
debugpy             1.3.0
decorator           5.0.7
fcsparser           0.2.3
h5py                3.2.1
igraph              0.9.6
ipykernel           6.0.0
ipython_genutils    0.2.0
ipywidgets          7.6.3
jedi                0.18.0
joblib              1.0.1
kiwisolver          1.3.1
leidenalg           0.8.7
llvmlite            0.36.0
loompy              3.0.7
matplotlib          3.4.2
matplotlib_inline   NA
mpl_toolkits        NA
natsort             7.1.1
ncls                0.0.67
netifaces           0.10.9
networkx            2.5.1
numba               0.53.1
numexpr             2.7.3
numpy               1.20.3
numpy_groupies      0.9.14
packaging           20.9
palantir            1.2
pandas              1.2.4
parso               0.8.2
pexpect             4.8.0
phenograph          1.5.7
pickleshare         0.7.5
pkg_resources       NA
progressbar         4.2.0
prompt_toolkit      3.0.19
psutil              5.8.0
ptyprocess          0.7.0
pycparser           2.20
pyexpat             NA
pygam               0.8.0
pygments            2.9.0
pynndescent         0.5.4
pyparsing           2.4.7
pyranges            0.0.110
pyrle               0.0.33
python_utils        NA
pytoml              NA
pytz                2021.1
scipy               1.6.3
seaborn             0.11.2
setuptools_scm      NA
simplejson          3.17.2
sitecustomize       NA
six                 1.16.0
sklearn             0.24.2
sorted_nearest      0.0.32
sphinxcontrib       NA
statsmodels         0.12.2
storemagic          NA
tables              3.6.1
tabulate            0.8.9
texttable           1.6.4
tornado             6.1
tqdm                4.61.2
traitlets           5.0.5
typing_extensions   NA
umap                0.5.1
wcwidth             0.2.5
zmq                 22.1.0
-----
IPython             7.25.0
jupyter_client      6.1.12
jupyter_core        4.7.1
notebook            6.4.0
-----
Python 3.9.5 (default, Dec 21 2022, 10:33:37)

Scale or not scale be more suitable for computation of SEACells?

Dear investigators,

First of all, thank you for developing such a great tool! I read the SEACell_computation.ipynb manual carefully and found that the input for computing SEACells are being normalized but not scaled. I tried both (with and without scaling) and so far, I have not found any advantages or disadvantages regarding which to use. Do you have any recommendations on this?

Many thanks in advances,

Anlin

maxFragmentLength=147

Hi all,

So when using addGroupCoverages(proj, maxFragmentLength=147) in ArchR , it says unused argument: maxFragmentLength.
I also tried to maxFragsize in CreateArrowfFiles, but it didn't work.

Any suggestions please?
Many thanks

Save Model

Hello,

How do you recommend we save the model object in the tutorials?

SEACells on integrated dataset

Hi,
I'd like to apply SEACells to a dataset with multiple samples from the same condition. It's very heterogeneous, as they are samples from patients with cancer, but still the same condition. I'm wondering which is the best way to use SEACells, and I'm debating between two options:

1- Replicate the COVID dataset workflow in your preprint: run SEACells independently per sample => merge metacell matrices => integrate with Harmony => run SEACells on Harmony reduction.

2- Alternative pipeline: Integrate every sample with Harmony at the cell level => run SEACells on Harmony reduction.

I'm personally inclined to use the second option, as I see more compact metacells across the UMAP plot and more metacells with cells from just 1-2 clusters. But I'd like to ask whether you think it is appropriate to do this.

Thanks a lot!

Reproducible example for the PBMC dataset

Hi,
Could it be possible to make public the script used to produce the results for the 10x PBMC dataset? Or at least provide the matrix of barcode/Metacells.

Thanks

PWM.h5ad in SEACells/notebooks/SEACell_tf_activity.ipynb

How was the pwm.h5ad file in the SEAcell_tf_activity.ipynb file constructed:

R version of SEACells

Dear author,

I wonder if SEACells will roll out an R version in addition to the current python implementation? Thanks!

wonder why the @parm `ceiling` can be set to 10^9

Thanks for develop such a powerful tools！
But now Im wondering why the ceiling can be set to 10^9. Its far away with the default setting 4. AND this parm is used to prevent large biases in peak counts.
https://github.com/dpeerlab/SEACells/blob/3462c624ffae0df6d3930490f345f00196c3503e/notebooks/ArchR/ArchR-preprocessing-nfr-peaks.R#L67C9-L67C22
sorry for the bother， but it is little confusing

Which modality to use in Multiome for metacell inference

Hi,

In Multiome data which data modality would you recommend to use for the inference of metacells? RNA or ATAC?
I'm asking because in this tutorial you use ATAC but according to your biorxiv you mention that is actually harder:

As a further challenge, we ran Palantir on aggregated RNA from metacells computed on the ATAC modality, since the sparsity of scATAC-seq data renders cell-state identification much more difficult

I know it depends on the biological context but, would you recommend as a rule of thumb to use RNA instead for metacell inference? What do you think?

Thank you for your time!

Custom Neighborhood Graph for Metacells

Hello,

I was wondering if it's possible to use a custom neighborhood graph for metacell construction. Looking through the source code and the tutorials, it seems that the methods depend on providing an anndata object with a low-dimension representation as input.

For example, if I wanted to build metacells using the wnn graph, would that be possible? Are there methods that I could use to create metacells by just providing the distances array from running scanpy.pp.neighbors or muon.pp.neighbors?

This would also be helpful for trying different values of k or different metrics for neighborhood graph construction.

Thanks for your input!

model.fit taking too long?

Hello,
I've been going through the computation notebook of SEAcells 0.2.0, but I get stuck at this step:

model = SEACells.core.SEACells(ad, 
                  build_kernel_on=build_kernel_on, 
                  n_SEACells=n_SEACells, 
                  n_waypoint_eigs=n_waypoint_eigs,
                  waypt_proportion=waypoint_proportion,
                  convergence_epsilon = 1e-5)

model.fit(n_iter=1)

The tutorial states that it should take ~5min, but it has been running for hours in a 8-core machine. I tried downsampling cells to 1.5k and still has been running for an hour. Any idea what is going on?

Here is the output:

Building kernel...
Computing kNN graph using scanpy NN ...
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Computing radius for adaptive bandwidth kernel...
100%
1720/1720 [00:00<00:00, 3748.03it/s]
Making graph symmetric...
Computing RBF kernel...
100%
1720/1720 [00:00<00:00, 1737.40it/s]
Building similarity LIL matrix...
100%
1720/1720 [00:00<00:00, 2886.20it/s]
Constructing CSR matrix...
Building kernel on X_pca
Computing diffusion components from X_pca for waypoint initialization ... 
Determing nearest neighbor graph...

Thanks,
Ricard

Tutorial Data for "SEACell_domain_adapt.ipynb" not available

$ wget https://dp-lab-data-public.s3.amazonaws.com/SEACells/cd34_multiome_rna_with_labels.h5ad -O cd34_multiome_rna_with_labels.h5ad
--2022-04-12 16:44:06-- https://dp-lab-data-public.s3.amazonaws.com/SEACells/cd34_multiome_rna_with_labels.h5ad
Resolving dp-lab-data-public.s3.amazonaws.com (dp-lab-data-public.s3.amazonaws.com)... 52.217.101.236
Connecting to dp-lab-data-public.s3.amazonaws.com (dp-lab-data-public.s3.amazonaws.com)|52.217.101.236|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-04-12 16:44:07 ERROR 404: Not Found.

The files for the other tutorials are available.

Way too strict dependency pins

Hi,

the dependencies pinned in https://github.com/dpeerlab/SEACells/blob/main/requirements.txt are way too strict and it is not possible to add SEACells to another Python package that has a few other scverse tools. Please unpin them (e.g. cython, anndata, h5py, ...)

Issues of model.initialize_archetypes

Hi I encountered an issue below when following SEACells analysis tutorial notebook (In [17]) using my own data with model.initialize_archetypes. Do you have any advice on this? Thank you so much!! ( by the way, I am able to go through the entire tutorial with example data using my current setting)

Issue with cupyx

Hi, I'm running the model.fit() command with gpu and am getting the following error:

File /ref/smlab/software/jindalk/.conda/envs/sc_env/lib/python3.10/site-packages/SEACells/core.py:574, in SEACells.fit(self, max_iter, min_iter, initial_archetypes)
    572 if max_iter < min_iter:
    573     raise ValueError("The maximum number of iterations specified is lower than the minimum number of iterations specified.")
--> 574 self._fit(max_iter=max_iter, min_iter=min_iter, initial_archetypes=initial_archetypes, initial_assignments=None)

File /ref/smlab/software/jindalk/.conda/envs/sc_env/lib/python3.10/site-packages/SEACells/core.py:531, in SEACells._fit(self, max_iter, min_iter, initial_archetypes, initial_assignments)
    519 def _fit(self, max_iter: int = 50, min_iter:int=10, initial_archetypes=None, initial_assignments=None):
    520     """
    521     Compute archetypes and loadings given kernel matrix K. Iteratively updates A and B matrices until maximum
    522     number of iterations or convergence has been achieved.
   (...)
    529 
    530     """
--> 531     self.initialize(initial_archetypes=initial_archetypes, initial_assignments=initial_assignments)
    533     converged = False
    534     n_iter = 0

File /ref/smlab/software/jindalk/.conda/envs/sc_env/lib/python3.10/site-packages/SEACells/core.py:172, in SEACells.initialize(self, initial_archetypes, initial_assignments)
    170 A = np.random.random((k, n))
    171 A /= A.sum(0)
--> 172 A = self._updateA(B, A)
    174 if self.verbose:
    175     print('Randomly initialized A matrix.')

File /ref/smlab/software/jindalk/.conda/envs/sc_env/lib/python3.10/site-packages/SEACells/core.py:335, in SEACells._updateA(self, B, A_prev)
    329 t = 0  # current iteration (determine multiplicative update)
    332 if self.gpu:
    333     # Use the GPU version of the update step
--> 335     K = cupyx.scipy.sparse.csc_matrix(self.K)
    337     Ag = cp.array(A)
    338     Kg = K

NameError: name 'cupyx' is not defined

I do have cupyx installed and working, so I think for some reason the model is not loading cupyx despite use_gpu being set as True

Thanks!

gene score matrix with sparse format

In the tutorial, the gene score matrix is exported by

write.csv(scores, "genescore.csv", quote=FALSE)

and imported to python using

gene_scores = pd.read_csv(data_dir + 'gene_scores.csv', index_col=0).T

However, when the cells number is large, say 150000, there will be an error
Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102 in this command

scores <- as.matrix(scores)

that is converting a sparse matrix to a dense matrix. It seems R can not handle the conversion of such a large matrix, I wonder if there is a way of both exporting the gene score matrix from R and importing that to Python as a sparse matrix, as the genescore matrix is pretty sparse?

peak calling for each metacell

Hi Thank you so much for providing such a valuable tool! I have a couple of questions and thank you in advance for your help!

I wonder if can I perform peak calling for a pseudo-bulk sample of each metacell?
How many cells/fragments of a metacell will be sufficient for a reliable peak calling?
Anything else should I pay attention to for this purpose?

model.fit hangs indefinitely

HI,

I was able to run this tutorial on the provided sample data. However, I am running into problems when using my own dataset (~58k cells and ~30k genes). Here is my SEACells code:


# they recommend one metacell for every 75 real cells
n_SEACells = int(np.floor(adata.obs.shape[0] / 75))

#build_kernel_on = 'X_pca' # key in ad.obsm to use for computing metacells
build_kernel_on = 'X_harmony' # key in ad.obsm to use for computing metacells
                          # This would be replaced by 'X_svd' for ATAC data

## Additional parameters
n_waypoint_eigs = 10 # Number of eigenvalues to consider when initializing metacells
waypoint_proportion = 0.9 # Proportion of metacells to initialize using waypoint analysis,
                        # the remainder of cells are selected by greedy selection


model = SEACells.core.SEACells(adata,
                  build_kernel_on=build_kernel_on,
                  n_SEACells=n_SEACells,
                  n_waypoint_eigs=n_waypoint_eigs,
                  waypt_proportion=waypoint_proportion,
                  convergence_epsilon = 1e-5)

# Initialize archetypes
model.initialize_archetypes()

model.fit(n_iter=20)

Randomly initialized A matrix.
Setting convergence threshold at 0.05610640134547047
Starting iteration 1.
Completed iteration 1.

It hangs up after completing iteration 1 for several hours, and eventually I killed it. #7 references model.fit taking too long but this seems like a different issue. Please let me know if you have any insights!

ValueError: row, column, and data array must all be the same length when running model.fit

I get the following error when running model.fit unless I set model.k = len(model.archetypes):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[38], line 1
----> 1 model.fit(min_iter=1, max_iter=1)

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/SEACells-0.3.3-py3.8.egg/SEACells/cpu.py:608, in SEACellsCPU.fit(self, max_iter, min_iter, initial_archetypes, initial_assignments)
    605 if max_iter < min_iter:
    606     raise ValueError(
    607         "The maximum number of iterations specified is lower than the minimum number of iterations specified.")
--> 608 self._fit(max_iter=max_iter, min_iter=min_iter, initial_archetypes=initial_archetypes,
    609           initial_assignments=initial_assignments)

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/SEACells-0.3.3-py3.8.egg/SEACells/cpu.py:562, in SEACellsCPU._fit(self, max_iter, min_iter, initial_archetypes, initial_assignments)
    550 def _fit(self, max_iter: int = 50, min_iter: int = 10, initial_archetypes=None, initial_assignments=None):
    551     """
    552     Internal method to compute archetypes and loadings given kernel matrix K.
    553     Iteratively updates A and B matrices until maximum number of iterations or convergence has been achieved.
   (...)
    560     :return: None
    561     """
--> 562     self.initialize(initial_archetypes=initial_archetypes, initial_assignments=initial_assignments)
    564     converged = False
    565     n_iter = 0

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/SEACells-0.3.3-py3.8.egg/SEACells/cpu.py:220, in SEACellsCPU.initialize(self, initial_archetypes, initial_assignments)
    218 rows = self.archetypes
    219 shape = (n, k)
--> 220 B0 = csr_matrix((np.ones(len(rows)), (rows, cols)), shape=shape)
    222 self.B0 = B0
    223 B = self.B0.copy()

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/scipy/sparse/_compressed.py:53, in _cs_matrix.__init__(self, arg1, shape, dtype, copy)
     49 else:
     50     if len(arg1) == 2:
     51         # (data, ij) format
     52         other = self.__class__(
---> 53             self._coo_container(arg1, shape=shape, dtype=dtype)
     54         )
     55         self._set_self(other)
     56     elif len(arg1) == 3:
     57         # (data, indices, indptr) format

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/scipy/sparse/_coo.py:196, in coo_matrix.__init__(self, arg1, shape, dtype, copy)
    193 if dtype is not None:
    194     self.data = self.data.astype(dtype, copy=False)
--> 196 self._check()

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/scipy/sparse/_coo.py:281, in coo_matrix._check(self)
    278 self.col = np.asarray(self.col, dtype=idx_dtype)
    279 self.data = to_native(self.data)
--> 281 if self.nnz > 0:
    282     if self.row.max() >= self.shape[0]:
    283         raise ValueError('row index exceeds matrix dimensions')

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/scipy/sparse/_base.py:299, in spmatrix.nnz(self)
    291 @property
    292 def nnz(self):
    293     """Number of stored values, including explicit zeros.
    294 
    295     See also
    296     --------
    297     count_nonzero : Number of non-zero entries
    298     """
--> 299     return self.getnnz()

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/scipy/sparse/_coo.py:243, in coo_matrix.getnnz(self, axis)
    241 nnz = len(self.data)
    242 if nnz != len(self.row) or nnz != len(self.col):
--> 243     raise ValueError('row, column, and data array must all be the '
    244                      'same length')
    246 if self.data.ndim != 1 or self.row.ndim != 1 or \
    247         self.col.ndim != 1:
    248     raise ValueError('row, column, and data arrays must be 1-D')

ValueError: row, column, and data array must all be the same length

Error in model.summarize_by_metacell when raw count is stored at other place

When I run,model.summarize_by_metacell, I got the following error. My raw is stored in other location and it would be good if this function can provide an argument that allows users to point to the raw location instead of finding in ad.raw.X by default.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-b9b826f2be54> in <module>
----> 1 metacell_ad = model.summarize_by_metacell(aggregate_by='sum')
      2 metacell_ad

~/miniconda3/envs/scrna/lib/python3.8/site-packages/metacells/core.py in summarize_by_metacell(self, aggregate_by)
    529         assert aggregate_by in ['sum', 'mean'], 'aggregate_by must be either sum or mean'
    530 
--> 531         features = pd.DataFrame(self.ad.raw.X.todense()).set_index(self.ad.obs_names)
    532         features = features.join(self.ad.obs[['Metacell']])
    533 

AttributeError: 'NoneType' object has no attribute 'X'

The data cannot be downloaded

This data cannot be downloaded https://dp-lab-data-public.s3.amazonaws.com/SEACells/covid_pbmcs.h5ad, please help to check.

Different metacells gained when repeating analysis on same data set

Hi,
thanks again for your package:).

I've noticed that whenever I run the metacell calculation on the same dataset, I get different initialized archetypes and because of that different metacells. I tried to add a random seed via numpy in the notebook but I'm still getting varying results.

Do you provide a solution for that I've overseen?
Thanks a lot and best regards,
Marie

MemoryError when fitting model.fit

Thank you for a great tool! I tested the tutorial workflow on a subset of my data and it worked like charm. Now I'm doing it on the full dataset of 440k cells. When running model.fit(min_iter=10, max_iter=50) after 2.5 hours I got the memory error:

MemoryError                               Traceback (most recent call last)
Cell In[12], line 1
----> 1 model.fit(min_iter=10, max_iter=50)

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/SEACells/core.py:574, in SEACells.fit(self, max_iter, min_iter, initial_archetypes)
    572 if max_iter < min_iter:
    573     raise ValueError("The maximum number of iterations specified is lower than the minimum number of iterations specified.")
--> 574 self._fit(max_iter=max_iter, min_iter=min_iter, initial_archetypes=initial_archetypes, initial_assignments=None)

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/SEACells/core.py:531, in SEACells._fit(self, max_iter, min_iter, initial_archetypes, initial_assignments)
    519 def _fit(self, max_iter: int = 50, min_iter:int=10, initial_archetypes=None, initial_assignments=None):
    520     """
    521     Compute archetypes and loadings given kernel matrix K. Iteratively updates A and B matrices until maximum
    522     number of iterations or convergence has been achieved.
   (...)
    529 
    530     """
--> 531     self.initialize(initial_archetypes=initial_archetypes, initial_assignments=initial_assignments)
    533     converged = False
    534     n_iter = 0

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/SEACells/core.py:181, in SEACells.initialize(self, initial_archetypes, initial_assignments)
    178 self.B_ = B
    180 # Create convergence threshold
--> 181 RSS = self.compute_RSS(A, B)
    182 self.RSS_iters.append(RSS)
    184 if self.convergence_threshold is None:

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/SEACells/core.py:464, in SEACells.compute_RSS(self, A, B)
    461 if B is None:
    462     B = self.B_
--> 464 reconstruction = self.compute_reconstruction(A, B)
    465 return np.linalg.norm(self.kernel_matrix - reconstruction)

File ~/anaconda3/envs/scanpy_env/lib/python3.8/site-packages/SEACells/core.py:446, in SEACells.compute_reconstruction(self, A, B)
    444 if A is None or B is None:
    445     raise RuntimeError('Either assignment matrix A or archetype matrix B is None.')
--> 446 return (self.kernel_matrix.dot(B)).dot(A)

MemoryError: Unable to allocate 1.41 TiB for an array with shape (440911, 440911) and data type float64

Is it not using sparse matrices?

Multiome Dataset

Hello,

I hope you are doing well. Thank you for the amazing tool.

I have a multiome dataset on the same cells and would like to take advantage of SEAcells. Would you recommend I run the seacell algorithm on both modalities separately and map the ATAC seacells to the RNA seacells?

[Question] Algorithm for PWM computation

Dear authors,

First of all, thank you very much for this great work and package!

I have a question regarding the TF-activity inference. In your tutorial (https://github.com/dpeerlab/SEACells/blob/main/notebooks/SEACell_tf_activity.ipynb), you use a precomputed PWM matrix. In the methods of the associated paper, you mention you used FIMO to compute the scores.

Is it mandatory to use FIMO, or would any motif matching algorithm that provides confidence scores suffice? I've tried to use MotifMatchR (implemented natively in the Signac package) as a PWM, however the results I obtain are somewhat confusing. Given this is the only step in which I differ from your analysis, I was wondering if this might be the cause.

If FIMO is the only motif matching algorithm you'd recommend using, could you give a bit more information on how to do it? I couldn't see any python/R package to apply it. Any help would be appreciated!

Thank you again,
Best,

Plotting functions cannot be silenced

Hi,

Very nice package and documentation, it's very straightforward to use!

I have an issue with the plotting functions though. When I use them in a pipeline, I save the resulting plots as pdfs but I do not want to show the plot at runtime because then it pauses my script until I close the shown plot.

This behavior is triggered because in many of your plotting functions you add the plt.show line. Instead, what you could do is add a show argument, like in many scanpy functions, so that the user can toggle it. One example would be:

def plot_convergence(self, save_as=None, show=True):
    """
    Plot behaviour of squared error over iterations.
    :param save_as: (str) name of file which figure is saved as. If None, no plot is saved.
    """
    import matplotlib.pyplot as plt
    import seaborn as sns

    plt.figure()
    plt.plot(self.RSS_iters)
    plt.title('Reconstruction Error over Iterations')
    plt.xlabel('Iterations')
    plt.ylabel("Squared Error")
    if save_as is not None:
        plt.savefig(save_as, dpi=150)
    if show:
        plt.show()

I know it's a minor complain but it is very frustrating to have to close the plots manually, thanks!

Add pre-commit hooks

For increased readability and easier contributions, a standardized coding style would be ideal. These can easily be enforced using pre-commit hooks.

number of SEACells

Hi,

I ran SEACells requesting for 700 SEACells but I am getting only 543 SEACell. Any suggestions why this is happening? It also takes a long time on my data with default parameters (I have about 60K cells) so I did not try if simply re-running would change the number of SEACells I obtain, yet.

Compute gene scores without GC content annotation

Hey,
thanks a lot for your package. It is very useful:).

I'm currently working with a multiome dataset for which I want to calculate gene scores for the ATAC data to compare them to the gene expression data. I'm using the data set from the NeurIPS Competition Open Problems in Single Cell Analysis (https://openproblems.bio/neurips_docs/data/dataset/). However, in my dataset there is no GC annotation. That's why the method genescore.prepare_multiome_anndata() crashes and I'm not able to use the follow up methods to compute the gene scores.
Do you provide a method to calculate this or do you have another easy way to add this annotation to the data? Or is there another way to calculate the gene scores without the GC annotation.

Thanks a lot for your help.
Best regards,
Marie Becker

Local paths in requirements.txt

I've noticed that requirements.txt mostly consists of @ file references, so it's not really usable.

Would be nice to have it since I don't use conda for this.

No speed increase was observed when use_gpu=True

The time consumption by the model.fit step seems almost the same with or without using gpu. What's going on here?

Renaming the Seacells

After your this commit "Renaming SEACells and fixing plotting" 03cd61c, every SEACell nam is like this ':[f'SEACell-{i}' for i in self.A_.argmax(0)]. So you may need to update the result of get_hard_assignments in this tutorial, which would otherwise be confusing. Thank you!

Many cells have zero non-trivial (>0.1) assignments

Hello,

I am running SEACells on a dataset of sc-ATAC-seq of 31000 cells and I am inferring 310 cells.

When checking for quality of results, I am surprised at the amount of cells that have 0 non-trivial assignment. Is this normal? Is there a way I can maximise the amount of cells having at least one SEACell assigned?

This is the plot I get:

Thank you.

SeaCells analysis for each cell type

Given that the cell number is high, runtime and memory required increased exponentially, I wonder if I can perform SeaCell analysis for each cell type or split the datasets into parts and process them individually? Thank you!

dpeerlab / seacells Goto Github PK

seacells's Introduction

SEACells:

Installation and dependencies

Usage

Citations

Release Notes

seacells's People

Contributors

Stargazers

Watchers

Forkers

seacells's Issues

Recommend Projects

Recommend Topics

Recommend Org