brianhie / scanorama Goto Github PK

View Code? Open in Web Editor NEW

259.0 259.0 48.0 231 KB

Panoramic stitching of single cell data

Home Page: http://scanorama.csail.mit.edu

License: MIT License

Python 92.70% R 7.30%

scanorama's People

Contributors

Stargazers

Watchers

Forkers

biobenkj trichelab dbrg77 alabarga j3gu ivirshup canergen grst aswinssoman zorrodong jiajiepeng jun-lizst gunsagargulati feigeliudan01 aspirincode danielstrobl tariqdaouda haroon123 minhng22 marnifora bobchen1701 natnaelt puriney lisasikkema abuchin mengchengyao dhtc comodr simonmfr shalevy1 jfnavarro emberwhirl cpusummer-wdn biosyy oliverbancroft leslielin7 ajinkya-kulkarni javadp4 dennyli-github chancen-king jwrth metaspine skim245 imahabub yxwucq

scanorama's Issues

Maybe fix version?

scanorama/scanorama/__init__.py

Line 3 in f2435c4

__version__ = 1.5.1

>>> import scanorama
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "anaconda3/lib/python3.6/site-packages/scanorama/__init__.py", line 3
    __version__ = 1.5.1
                      ^
SyntaxError: invalid syntax

If I change to __version__ = "1.5.1", it will work.

How to run different datasets through Scanorama?

Hi,

I'm attempting to integrate spatial data with scRNA-seq using scanorama from this paper (https://www.nature.com/articles/s41587-019-0113-3). I was wondering if it was possible to use csv files as the input rather than the h5 (spatial data) and h5ad (scRNA seq data) files as the input? I currently have csv files with my data of the numeric cluster assignment for each cell in the scRNA seq data and a file with a barcodes by gene counts format. For my spatial data, I have a csv file with spatial locations of the pixels and a file with DGE (gene counts by barcodes) that represents raw counts at each pixel.

Integration vs Preprocessing

Very excited to try this method on my data! I have a design question that is not clear to me. Which methods can be replaced by any pre-processing pipeline and which are core methods for integration? For instance, if I process raw data through PCA in Seurat or Scanpy, can I run the assemble method on the PC matrices and expect similar results? It would be helpful to have this distinction reflected in the code base.

correct and assemble methods

This is a wonderful method and thank you for sharing the software!!
I am trying to apply for my sets of scRNA-seq data but I've got bit confused when I was trying to understand correct and assemble method.

What I want in the end is expression value of all genes corrected by scanorama. As far as I understand, assemble method (which I assume the main part of scanorama?) is run on low dimension than the original expression value. How can I use this assembled results to correct for entire expression? I got confused because in the correct method, it returns 'datasets' but results of assemble is stored in 'datasets_dimred'. So the output of correct method is output of merge_datasets method. Am I correct? Or am I missing something?
Could you maybe explain how I can get scanorama corrected expression value of all genes exists across all data sets?

I am also wondering if it's ok to mix data sets with different expression values? For example, some of the data sets have row read count, some have UMI count and some have only processed value such as CPM or TPM. I see in the load_name, you do l2 normalization so CPM and TPM are not suitable since they are already corrected for that?

Thank you for your help in advance!!
Best,
Kyoko

Reversing L2-normalisation in corrected data

Hi Brian, thanks for the good work on scanorama! It is mentioned in your paper that expression values are L2-normalised before batch correction. I presume that this is the reason why the corrected values are much smaller than the input values. Is there an option in the correct function that would allow us to reverse the L2-normalisation? Thank you

Does scanorama require scaled data or not?

Thank you so much for this great tool! I have a simple question. After finding highly variable genes, should the data be scaled prior to implementing scanorama or not?

Failed to correct two scanpy objects

Hi there,

I tried to correct two scanpy objects with scanorama, but meet some errors. These two matrics have different gene numbers. Here is my codes.

data_10X = sc.read_h5ad("./wholeCell_10x_day00.h5ad")
data_iDrop = sc.read_h5ad("wholeCell_iDrop_day00.h5ad")
adatas = [ data_10X, data_iDrop ]

len(data_10X.X[0])
# 13374

len(data_iDrop.X[0])
# 15381

len(data_iDrop.var_names.values)
# 15381

len(data_10X.var_names.values)
# 13374

integrated, corrected = scanorama.correct_scanpy(adatas, return_dimred=True)

And error logs.

Found 12953 genes among all datasets
[[0.         0.56709452]
 [0.         0.        ]]
Processing datasets (0, 1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-54723a00efc8> in <module>
      1 # Integration and batch correction.
----> 2 integrated, corrected = scanorama.correct_scanpy(adatas, return_dimred=True)

/usr/local/lib/python3.6/site-packages/scanorama-1.2-py3.6.egg/scanorama/scanorama.py in correct_scanpy(adatas, **kwargs)
    211     new_adatas = []
    212     for i, adata in enumerate(adatas):
--> 213         adata.X = datasets[i]
    214         new_adatas.append(adata)
    215 

/usr/local/lib/python3.6/site-packages/anndata/base.py in X(self, value)
   1055         else:
   1056             raise ValueError('Data matrix has wrong shape {}, need to be {}'
-> 1057                              .format(value.shape, self.shape))
   1058 
   1059     @property

ValueError: Data matrix has wrong shape (2571, 12953), need to be (2571, 13374)

Thanks for your time and this wonderful program .
Quan

_binary_search_perplexity takes exactly 3 positional arguments (4 given)

Hello I tried to use the tSNE visualization in the package. However, I met across this error:

TypeError: _binary_search_perplexity() takes exactly 3 positional arguments (4 given)

scanorama/scanorama/t_sne_approx.py

Lines 63 to 64 in fdd1232

    
           conditional_P = _utils._binary_search_perplexity( 
        
               distances, None, desired_perplexity, verbose)

I realized the sklearn's _binary_search_perplexity function changed their parameters design from 4 to 3 parameters.

From this (currently assumed by Scanorama) to this (currently in the latest sklearn).

If there any way to make Scanorama work? Thank you.

Both sklearn 0.22.2 and 0.22.1 have this problem.

batch correction order

Dear,
Thanks for your great work, it's always a trouble to integrate different datasets.

I have read scanorama supplemental materials in details. It is known that MNN and CCA are sensitive to the order in which the data sets are considered. According to the scanorama paper, pairs of data sets 𝐷𝑖 and 𝐷𝑗 are in decreasing order of the 𝑟𝑖𝑗 overlap percentages. The first 𝐷𝑖 and 𝐷𝑗 are merged together to initialize a panorama and successive pairs are considered. It seems that the order is depend on the algorithm itself and would not be changed by the users. So, the first Di and Dj (I am not sure Di or Dj) is chosen as the first reference data set and it's expression value would be the same as its raw value. Am I right ? In summary, I wonder that which data set is chosen as the reference data.

Besides, is the colorbar in tsne plot made by raw expression value or corrected expression value ? what's the function of assemble(datasets), I cannot find any tutorials to explain how to use correct(datasets) and aseemble(datasets). it seems that correct() is enough for users, given the corrected data sets, there are a lot of tools to perform the downstream analysis. Perhaps you can provide a detailed document.
Best regards

Integrating datasets of different units (FPKM and UMI)

Hi,
I was wondering how to specify that my input datasets are in different units on measures (4 data sets in UMI and one in FPKM). I read in this article- https://www.nature.com/articles/s41587-019-0113-3?proof=true that it is possible to integrate such datasets. I will appreciate any help on how to.

Thanks

Recover Gene/Feature Loadings from the computed PCA (SC) for downstream analysis?

Hi,
Thanks for developing Scanorama. Despite having little coding experience in R and Python, its use has been relatively easy, resulting in a great integration of my dataset.
I am now wondering if it would be possible to get the gene/feature loadings for the computed scanorama PCAs, as these are required for other downstream analysis (i.e. finding gene modules in Monocle 3).

I am sorry if this is a silly question, I am posting it as I couldn't find an answer to it.

Many thanks in advance.

scanorama bbknn and scanpy

Please have a look at this link
Teichlab/bbknn#3 (comment)

After getting scanorama_corrected datasets, does it need to do normalization, logarithmize and scale transformation?

ValueError: Can only assign numpy ndarrays to .obsm['X_scanorama'], not objects of class <class 'scipy.sparse.csr.csr_matrix'>

Hi! I would like to perform batch correction on data which has already been pre-processed with library size normalization and log-transformation. The input is a list of AnnData object with the adata.X field storing the aforementioned data as data_type: np.ndarray. Yet when running this through scanorama.correct_scanpy() I get the Value Error:

"ValueError: Can only assign numpy ndarrays to .obsm['X_scanorama'], not objects of class <class 'scipy.sparse.csr.csr_matrix'>".

Code:

import scanorama
import scanpy as sc
from scanpy import *
import numpy as np
#increase width of cells
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))
from scipy import sparse


#imoprt data
train4ds_adata = sc.read("./tests/data/pancreas.h5ad", backup_url="https://goo.gl/V29FNk")

#scanormam takes seperate adata object therefore separate the data into subsets.
def subsetBatches(adata):
    """
        Arguments:

        adata(obj): Annotation data object from scanpy package containing all batches, cell-labels and batch labels. 
                    note: batches should be stored under andata.obs['sample']. Celltype should be stored under andata.obs['celltype'].

        returns: 

        batches(dict): A dict of {batchName: batchSubset}.
    
    """
    print("Seperating into Batches...")
    batches = {}
    samples = np.unique(adata.obs['sample'])
    
    for sample in samples:
        #subset the data
        data = adata[adata.obs['sample'] == sample]
        #deepcopy
        data.uns = data._adata_ref._uns
        batches.update({sample: data})
    
    for batchName, batchData in  batches.items():
        print(batchName)
        print(batchData)
    print()
    print("Complete.")    
    return batches

#subset
batches = subsetBatches(train4ds_adata)
print()
#generate list of adatas
adatas = [batch for batch in batches.values()]

#check input type is ndarray
print("The data type for each adata.X field is: ")

for adata in adatas:
    print(adata)
    print()
    print(type(adata.X))
    print()

print(adatas[0].X)

#Batch correction
#important note, gene order is not preserved in scanorama
corrected = scanorama.correct_scanpy(adatas)

Output:

Seperating into Batches...
Baron
AnnData object with n_obs × n_vars = 8569 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
Muraro
AnnData object with n_obs × n_vars = 2126 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
Segerstolpe
AnnData object with n_obs × n_vars = 3363 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
Wang
AnnData object with n_obs × n_vars = 635 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'

Complete.

The data type for each adata.X field is:
AnnData object with n_obs × n_vars = 8569 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'

<class 'numpy.ndarray'>

AnnData object with n_obs × n_vars = 2126 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'

<class 'numpy.ndarray'>

AnnData object with n_obs × n_vars = 3363 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'

<class 'numpy.ndarray'>

AnnData object with n_obs × n_vars = 635 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'

<class 'numpy.ndarray'>

[[-0.18548188 1.2636875 -0.27988526 ... -0.1822604 -0.12235923
-0.11705018]
[-0.18548188 -0.41321668 -0.27988526 ... -0.1822604 -0.12235923
-0.11705018]
[-0.18548188 -0.41321668 -0.27988526 ... -0.1822604 -0.12235923
-0.11705018]
...
[-0.18548188 -0.41321668 -0.27988526 ... -0.1822604 -0.12235923
-0.11705018]
[-0.18548188 -0.41321668 -0.27988526 ... -0.1822604 -0.12235923
-0.11705018]
[-0.18548188 -0.41321668 -0.27988526 ... -0.1822604 -0.12235923
-0.11705018]]
Found 2448 genes among all datasets
[[0. 0.10959548 0.17633066 0.02362205]
[0. 0. 0.55644403 0.06771654]
[0. 0. 0. 0.36062992]
[0. 0. 0. 0. ]]
Processing datasets (1, 2)
Processing datasets (2, 3)
Processing datasets (0, 2)
Processing datasets (0, 1)

ValueError Traceback (most recent call last)
in
53 #Bacth correction
54 #important note, gene order is not preserved in scanorama
---> 55 corrected = scanorama.correct_scanpy(adatas)

~/miniconda3/envs/keras2/lib/python3.6/site-packages/scanorama/scanorama.py in correct_scanpy(adatas, **kwargs)
216 new_adatas = []
217 for i, adata in enumerate(adatas):
--> 218 adata.obsm['X_scanorama'] = datasets[i]
219 new_adatas.append(adata)
220

~/miniconda3/envs/keras2/lib/python3.6/site-packages/anndata/base.py in setitem(self, key, arr)
117 raise ValueError(
118 'Can only assign numpy ndarrays to .{}[{!r}], not objects of class {}'
--> 119 .format(self._attr, key, type(arr))
120 )
121 if arr.ndim == 1:

ValueError: Can only assign numpy ndarrays to .obsm['X_scanorama'], not objects of class <class 'scipy.sparse.csr.csr_matrix'>

Keep Anndata .obs and .var information in batch correction

Hi!

Thank you very much for your great tool, Unfortunately, I am still having some problems with it and I hope you can help me with that.

I am using the scanorama.correct_scanpy() method for data integration and batch correction on a list of AnnData objects. As a result I get the corrected expression values but all the additional information that was stored in AnnData.obs and AnnData.var is lost. Is this to be expected?

If so, how can I add the corrected expression values to my originial AnnData objects as the gene order is changed after batch correction?

Thank you!

downstream analysis

Dear,
Are there any special requirements for the input data (raw counts, cpm, tpm, rpkm or log transformation)? Is the scanorama-corrected data compatible with routine downstream analysis, such as hierarchical clustering, differential expression, duffusion map, pseudotime inference et al ?

Logic error in `correct` function

Hey,

I think there's a logic error in the current implementation of correct. The default value for hvg is None, which gets passed to process_data, which tries hvg > 0, which throws (on Python 3.6):

TypeError: '>' not supported between instances of 'NoneType' and 'int'

I'm really liking the updates to the api and the documentation by the way!

Could not find: data/macrophage/monocytes

Hi,

I was running the monocyte_macrophage.py, and met this error:

Warning: Could not find data/macrophage/monocytes
Successfully processed data/pbmc/10x/cd14_monocytes
Successfully processed data/macrophage/mcsf_day3_1
Successfully processed data/macrophage/mcsf_day3_2
Successfully processed data/macrophage/mcsf_day6_1
Successfully processed data/macrophage/mcsf_day6_2
Could not find: data/macrophage/monocytes

The data I downloaded in the data/macrophage directory are:

infected.txt     mcsf_day6_2.npz     mono_macro_diffexpr.txt
mcsf_day3_1.npz  mcsf_day6_2.txt     mono_macro_genes.txt
mcsf_day3_1.txt  mixed_infected.txt  mono_macro_hours.txt
mcsf_day3_2.npz  monocytes_1.txt     mono_macro.txt
mcsf_day3_2.txt  monocytes_2.txt     uninfected_donor2.txt
mcsf_day6_1.npz  monocytes_3.txt     uninfected.txt
mcsf_day6_1.txt  monocytes_4.txt

Could you please give me some advices on how to integrate monocytes_(1,2,3,4).txt to monocytes.txt, or should I add these four monocytes_(1,2,3,4).txt files as extra input for integration?

Thanks in advance,
BP

Use all genes or highly variable genes for scanorama?

Hi Brian,
I am trying integrated 4 SC datasets across different platfrom (10X , dropseq and inDrop) .
As mnnCorrect() in scran recommend to use HVGs as input, I didn't see any discussion about feature selection.
I want to asked that should I use all genes (after filtering low abundance genes) or just HVGs in each dataset for scanorama.correct() ?

Any advice would be appreciated !

Illegal instruction (core dumped) on HPC cluster

Hi Brian

Thanks for your work. I am running into a 'Illegal instruction' error when running scanorama on a >100000 cell dataset on a HPC. I downgraded annoy to 1.11.5 and have tried playing with both sketch and batch_size to no avail. Same command runs fine on my local mac using 5% subset of total dataset to be run on the HPC. Any info you might have would be really appreciated.

Thanks again

>> import os
>> import scanpy as sc
>> import anndata as ad
>> import scanorama

>> scrm = scanorama.integrate_scanpy(adatas, knn=20)
Found 5907 genes among all datasets
Illegal instruction (core dumped)

HPC info -
CentOS release 6.10 (Final)

Python info -
Python 3.6.7
annoy 1.11.5
scanorama 1.6

how to set reduced dimension

I used the following code to get the integrated and corrected data.

datasets=[adata_b1,adata_b2]
integrated, corrected = scanorama.correct_scanpy(datasets, return_dimred=True)

Is there a way to set the number of reduced dimension?

Minimal example with working R code

I am working in R using scanorama with the reticulate package, and tried running the integrate routine and got this error:

> integrated_data <- scanorama$integrate(scale_matrix_list, genes_list)
Found 4335 genes among all datasets
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  IndexError: index (4334) out of range

Detailed traceback: 
  File "/home/ramezqui/.local/lib/python3.7/site-packages/scanorama-0.4-py3.7.egg/scanorama/scanorama.py", line 147, in integrate
    ds_names=ds_names)
  File "/home/ramezqui/.local/lib/python3.7/site-packages/scanorama-0.4-py3.7.egg/scanorama/scanorama.py", line 282, in merge_datasets
    datasets[i] = datasets[i][:, uniq_idx]
  File "/home/ramezqui/.local/lib/python3.7/site-packages/scipy-1.1.0-py3.7-linux-x86_64.egg/scipy/sparse/csr.py", line 307, in __getitem__
    P = extractor(col,self.shape[1]).T        # [1:2,[1,2]]
  File "/home/ramezqui/.local/lib/python3.7/site-packages/scipy-1.1.0-py3.7-linux-x86_64.egg/scipy/sparse/csr.py", line 270, in extractor
    min_indx, max_indx = check_bounds(indices, N)
  File "/home/ramezqui/.local/lib/python3.7/site-packages/scipy-1.1.0-py3.7-linux-x86_64.egg/scipy/sparse/csr.py", line 256, in check_bounds
    raise IndexError('index (%d) out of range' % max_indx)

The matrices are contained in a list, each 4335 x [variable number of cells], and the genes are character vectors each of length 4335 and contained within a list.

Is there a more detailed manual available with minimal working R examples? Maybe something is up on my end, not sure.

inputs as parameters

This really isn't an issue, but more of a suggestion. Currently, the data_names variable for the input data must be hard-coded. Is it possible to make that a command-line argument? That way, you can just modify the command instead of having to edit a file every time you run scanorama.

Adding new dataset into previously integrated datasets

I am wondering about the feasibility of using scanorama (R version) in the face of new datasets. For example, I had 8 datasets that I wanted to integrate, and used scanorama's correct method to do so (let's call the integrated dataset X). The method seems to have done a nice job at integrating and correcting the datasets, and I would now like to use X as a reference for any new data to be integrated into the same space. My understanding of the method is that if I try to integrate X and a new dataset, both will be "corrected." Also, I'm not sure if it would be a bad idea to try to integrate X, which has already been corrected with scanorama, and a new dataset that has not been normalized or corrected in the same way. Would you suggest that I try to use the correct() method on X and the new dataset, and therefore X will have been corrected twice?

I also thought about using scanorama as a method to integrate the reference space itself, as I have done, and then use a different integration method that does take a reference to place new datasets within the space of X without changing X itself. Any thoughts?

Workflow Question

Hi Brian,

I read over the manuscript, and wanted to ask if you could explain some finer details of your preprocessing workflow. I got some preliminary results from scanorama, but unfortunately some of my cell populations got mixed in with each other that should definitely be discrete. Thank you in advance!

In your README you say that scanorama is downstream of other preprocessing steps (and manuscript has filtering out low quality cells, l2 normalization). Did you also reduce the matrices to be only genes that are highly variable? Should technical effects (or any other covariates) such as the number of UMIs be regressed out before feeding the data into scanorama?

Also, once the scanorama corrected matrices are returned, do you think imputation on this matrix would be appropriate? Additionally, setting dimred = TRUE returns a lower dimensional representation of the data. Is this the SVD calculation results, and would this be appropriate to then input into a tSNE/UMAP calculation?

Thanks for your time and any other tips you might have to offer to get the data to behave nicely!

caught illegal operation: addreses 0x7f5438636bda, cause 'illegal operand'

I'm getting a cryptic error message when trying to run Scanorama through reticulate in R on an HPC cluster. I've been using the Seurat-compatible method suggested by @gdagstn on the open Seurat compatibility issue (#38 (comment)).

After running scanorama$integrate(assaylist, genelist), I get the error message

*** caught illegal operation ***
addreses 0x7f5438636bda, cause 'illegal operand'

I've tried generating the assaylist in a few different ways using a few different datasets, so I do not believe it is an issue with the data or a memory issue.

I'll include possibly relevant information below, happy to provide any further details
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

reticulate v 1.16
Seurat v 3.1.4
Python3 v 3.7.2
scanorama v 1.6

Matplotlib backend settings prevent plot display in jupyter notebook

Importing scanorama setts the agg backend which prevents plots to be visualized in jupyter notebook.

sc.pl.umap(adata, color = ["clusters"])
>>> /Users/giovanni.palla/Projects/scanpy/scanpy/plotting/_utils.py:307: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  pl.show()

Is this line needed?

scanorama/scanorama/utils.py

Line 5 in a437be5

mpl.use('Agg')

Setting back to standard jupyter inline backend works:
mpl.use('module://ipykernel.pylab.backend_inline')

Session info:

seaborn==0.9.0
scanorama==1.5
pandas==0.25.3
numpy==1.18.1
matplotlib==3.1.1
scanpy==1.4.5.post3.dev27+gcea5f4f.d20200212
anndata==0.7.2.dev13+g4440b90

pip install

What she said.

The authors should consider submitting the software to something like PyPi so that users will be able to install via pip to make Scanorama more accessible to the scientific community.

The easier it is to install Scanorama on research servers at university, the better.

Integrating scanorama with Seurat?

I am interested in trying out Scanorama to see how it fares with our data, but I have the issue that we have been working with our data and doing our processing and analysis in Seurat in R. Are there any plans to integrate Scanorama in a way that makes it easily usable with Seurat? Or is this already possible and I have not been able to figure it out?

Question about the values returned by scanorama.correct() with and without return_dimred

Hi Bryan,

thanks for the great tool! I've tried out the example code you gave for using scanorama via reticulate and I just have a question regarding the output returned by scanorama.correct.
In your example code, you run that function twice, once with return_dimred turned on and once turned off. I assumed, the only difference was going to be that one will result in a bigger object than the other but that the corrected values should be the same. This is not 100% the case when I look at my data though and I was wondering where the differences are coming from.

integrated.corrected.data <- scanorama$correct(datasets, genes_list,
                                               return_dimred=TRUE, 
                                               return_dense=TRUE)
                                               
corrected.data <- scanorama$correct(datasets, genes_list, 
                                    return_dense=TRUE)
                                    
 > str(corrected.data)
List of 2
 $ :List of 3
  ..$ : num [1:4337, 1:17281] 2.20e-04 1.92e-05 1.77e-05 1.64e-05 1.66e-05 ...
  ..$ : num [1:5839, 1:17281] 2.71e-04 6.28e-05 4.65e-05 6.58e-05 5.86e-05 ...
  ..$ : num [1:2835, 1:17281] 3.36e-05 3.58e-05 4.41e-05 5.40e-05 3.28e-04 ...
 $ : chr [1:17281(1d)] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
 
 
 > str(integrated.corrected.data)
List of 3
 $ :List of 3
  ..$ : num [1:4337, 1:100] -0.0513 -0.0731 -0.0565 -0.107 -0.2204 ...
  ..$ : num [1:5839, 1:100] 0.2674 0.2699 -0.0853 0.2077 0.1268 ...
  ..$ : num [1:2835, 1:100] -0.0854 -0.1937 -0.4252 -0.4026 -0.2517 ...
 $ :List of 3
  ..$ : num [1:4337, 1:17281] 2.22e-04 2.13e-05 1.97e-05 1.86e-05 1.95e-05 ...
  ..$ : num [1:5839, 1:17281] 2.71e-04 6.26e-05 4.81e-05 6.52e-05 5.90e-05 ...
  ..$ : num [1:2835, 1:17281] 3.54e-05 3.83e-05 4.84e-05 6.08e-05 3.31e-04 ...
 $ : chr [1:17281(1d)] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...

I was under the impression that the content of the second list of integrated.corrected.data should be the same as the first list of corrected.data, but that doesn't seem to be the case. Can you explain to me what I'm missing? Thanks a lot!

pip release 0.6.2 ModuleNotFoundError

I used pip to install scanorama version 0.6.2 and ran

import scanorama

with Python 3.6, which resulted in

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-4ea9f8d58e85> in <module>()
----> 1 import scanorama

/opt/conda/lib/python3.6/site-packages/scanorama/__init__.py in <module>()
----> 1 from .scanorama import *

/opt/conda/lib/python3.6/site-packages/scanorama/scanorama.py in <module>()
     15 
     16 from .t_sne_approx import TSNEApprox
---> 17 from .integration import harmony_integrate
     18 from .utils import plt, dispersion, reduce_dimensionality
     19 from .utils import visualize_cluster, visualize_expr, visualize_dropout

/opt/conda/lib/python3.6/site-packages/scanorama/integration.py in <module>()
----> 1 from ample import gs, uniform, srs, kmeanspp
      2 import numpy as np
      3 from subprocess import Popen
      4 from time import time
      5 

ModuleNotFoundError: No module named 'ample'

Google searches aren't immediately showing what the "ample" module might be - is it misspelled? Uninstalling and reverting to version 0.6 fixes this issue. Let me know if I can provide any more information.

Multiple CPUs

Could we run scanorama.correct_scanpy on multiple CPUs to accelerate the speed?

scanorama_batch

Command usage

In the readme, it mentions using the following commands to integrate and correct the adata files. What are the difference between these commands? It seems they are doing the same thing for integration and batch correction. How could I get one integrated and corrected AnnData, and then plotting and clustering with Scanorama ? Thanks!

# Integration.
integrated = scanorama.integrate_scanpy(adatas)

# Batch correction.
corrected = scanorama.correct_scanpy(adatas)

# Integration and batch correction.
integrated, corrected = scanorama.correct_scanpy(adatas, return_dimred=True)

Insights about corrected expression values

Hi Brian,

I was wondering whether you could explain the interpretation and downstream usage of the batch-corrected expression values returned by scanorama a bit more. I understand that you're adding weights based on a translation vector to the individual values, but I'm not sure I totally understand the nature of the returned values and how to use them for typical differential gene expression analyses.

I noticed that the range of the values seems to be capped at 1, but the lower limit is somewhat elusive.

> lapply(integrated.corrected.data[[2]], min)
[[1]]
[1] -0.0006699665

[[2]]
[1] -0.006567805

[[3]]
[1] -0.01890116

Did you test them with any of the "gene marker" finding tools, e.g. Seurat's or scater's findMarkers, which basically use wrappers around the tools developed for bulk RNA-seq, that will often turn to the integer count values? Do you have any comments on that?

Could not find function "add.Gaussian.noise" in simulate_nonoverlap.R

Hi,

I was running the scanorama/bin/R/simulate_nonoverlap.R, and met this error:

[Show in New Window] [Clear Output] [Expand/Collapse Output]
Error in add.Gaussian.noise(counts(sim.groups.A)) : could not find function "add.Gaussian.noise"

Is this function implemented elsewhere or is it imported from other packages?
I found the function elsewhere: https://www.rdocumentation.org/packages/RMThreshold/versions/1.1/topics/add.Gaussian.noise, but not sure I should use it or not.

By the way, I found the following codes a little bit strange, since sim.groups.A has already been output to files and therefore needs no more operations. Should I replace sim.groups.A with sim.groups.B here?

sim.groups.B <- splatSimulate(
    batchCells = c(1000), group.prob = c(0, 0, 0.50, 0.50),
    method = "groups", verbose = FALSE
)
sim.groups.A <- splatSimBatchEffects(sim.groups.A, newSplatParams())

Thanks in advance,
BP

Recommended to log data before running?

I see that l2 normalization is mostly done in the paper code. Is there a reason this is done over log?

ValueError: blocks must be 2-D

Hi,

I encounter the following error when running scanorama:

Found 105476 cells among all datasets
After data/293t_jurkat/293t: 32639 genes
After data/293t_jurkat/jurkat: 32639 genes
After data/293t_jurkat/jurkat_293t_50_50: 32639 genes
After data/293t_jurkat/jurkat_293t_99_1: 32639 genes
After data/brain/neuron_9k: 15642 genes
After data/macrophage/infected: 12731 genes
After data/macrophage/mixed_infected: 11622 genes
After data/macrophage/uninfected: 10510 genes
After data/macrophage/uninfected_donor2: 9414 genes
After data/hsc/hsc_mars: 5386 genes
After data/hsc/hsc_ss2: 5328 genes
After data/pancreas/pancreas_inDrop: 5311 genes
After data/pancreas/pancreas_multi_celseq2_expression_matrix: 5298 genes
After data/pancreas/pancreas_multi_celseq_expression_matrix: 5286 genes
After data/pancreas/pancreas_multi_fluidigmc1_expression_matrix: 5216 genes
After data/pancreas/pancreas_multi_smartseq2_expression_matrix: 5216 genes
After data/pbmc/10x/68k_pbmc: 5216 genes
After data/pbmc/10x/b_cells: 5216 genes
After data/pbmc/10x/cd14_monocytes: 5216 genes
After data/pbmc/10x/cd4_t_helper: 5216 genes
After data/pbmc/10x/cd56_nk: 5216 genes
After data/pbmc/10x/cytotoxic_t: 5216 genes
After data/pbmc/10x/memory_t: 5216 genes
After data/pbmc/10x/regulatory_t: 5216 genes
After data/pbmc/pbmc_kang: 5216 genes
After data/pbmc/pbmc_10X: 5216 genes
Found 5216 genes among all datasets
Traceback (most recent call last):
  File "bin/panorama.py", line 17, in <module>
    return_dimred=True
  File "/storage/home/sturm/projects/single_cell_data_integration/scanorama/scanorama/scanorama.py", line 89, in correct
    datasets_dimred, genes = process_data(datasets, genes, hvg=hvg)
  File "/storage/home/sturm/projects/single_cell_data_integration/scanorama/scanorama/scanorama.py", line 343, in process_data
    datasets_dimred = dimensionality_reduce(datasets, dimred=dimred)
  File "/storage/home/sturm/projects/single_cell_data_integration/scanorama/scanorama/scanorama.py", line 308, in dimensionality_reduce
    X = vstack(datasets)
  File "/storage/home/sturm/.conda/envs/scanorama2/lib/python2.7/site-packages/scipy/sparse/construct.py", line 498, in vstack
    return bmat([[b] for b in blocks], format=format, dtype=dtype)
  File "/storage/home/sturm/.conda/envs/scanorama2/lib/python2.7/site-packages/scipy/sparse/construct.py", line 547, in bmat
    raise ValueError('blocks must be 2-D')
ValueError: blocks must be 2-D

Steps to reproduce:

git clone [email protected]:brianhie/scanorama.git
cd scanorama
pip install -e . 
wget http://scanorama.csail.mit.edu/data_light.tar.gz
tar xvzf data_light.tar.gz
python bin/preprocess.py conf/panorama.txt
python bin/panorama.py conf/panorama.txt

Environment

Here's a conda enviroment in that the problem occurs.

scanorama.yml

name: scanorama
channels:
  - conda-forge
  - bioconda
  - defaults
  - grst
dependencies:
  - python=2.7
  - docopt=0.6.2
  - python-annoy=1.11.5
  - intervaltree=2.1.0
  - matplotlib=2.0.2
  - networkx=2.1
  - numpy=1.12.0
  - scipy=1.0.0
  - scikit-learn=0.19.1
  - statsmodels=0.8.0
  - pip:
    - fbpca==1.0

Actually, I also tried with python 3.6 and more recent versions of the packages - with the same result.

convert Numpy matrices to Seurat

Can the numpy matrices containing the batch-corrected data sets be converted to Seurat compatible matrices?

Example for 293t_jurkat does not work

Very small issue; the data needs to be loaded with load_mtx rather than load_data as called within the load_names function in process.py. A simple fix, but not sure if this would break any other examples currently in your code.

Incorrect order of return value for correct_scanpy in dosctring

scanorama/scanorama/scanorama.py

Line 189 in d995504

corrected, integrated

Order should be integrated, corrected

Also, shouldn't this case be when return_dimred=True?

failed in Integrating 1 million cells

I used the scripts in ./bin/mouse_brain.py

BATCH_SIZE = 10000
t0 = time()
datasets_dimred, genes = integrate(
datasets, genes_list, ds_names=data_names,
batch_size=BATCH_SIZE,
)
print('Integrated panoramas in {:.3f}s'.format(time() - t0))

t0 = time()
datasets_dimred, datasets, genes = correct(
    datasets, genes_list, ds_names=data_names,
    return_dimred=True, batch_size=BATCH_SIZE,
)

The error looks like this:
-packages/scipy/sparse/compressed.py", line 88, in init
self._set_self(self.class(coo_matrix(arg1, dtype=dtype)))
File "/home/jialu/packages/anaconda3/envs/scanoramaEnv/lib/python3.7/site-packages/scipy/sparse/coo.py", line 192, in init
self.data = M[self.row, self.col]
MemoryError: Unable to allocate array with shape (2123370384,) and data type float64

The machine has 256G memory. It should be enough for running Scanorama on this data, right?

BLAS/LAPACK routine 'DLASWP' gave error code -6

Hi,

I have used scanorama through reticulate on a set of Seurat objects basically following the Rscript you provided. On our cluster the code run into such error:

corrected.data <- scanorama$correct(assaylist, genelist, return_dense=TRUE)
Found 60662 genes among all datasets
Error in py_call_impl(callable, dots$args, dots$keywords) :
BLAS/LAPACK routine 'DLASWP' gave error code -6

The thing is strange that other dependencies on numpy, such as scanpy, scvelo, etc, did not encounter similar issue. We firstly wonder it was a problem with BLAS but np.config.show() does not sound abnormal.

Environment:
R 3.6.2
reticulate_1.16
python 3.8.3 via anaconda

Thanks a lot!
Best,
Yi

Integration different seq data

Is it possible to use Scanorama to integrate both single cell DNA methylation and single cell RNA-seq data by Scanorama, and then use RNA-seq data to identify the cell types that clustered with methylation dataset?

Thanks!

Cross species comparison

Hi there,

I have single cell data from 3 different species. I'm curious that is Scanorama a proper tool for different species comparison?

Is cell order preserved in corrected data?

Hi Brian,

I have a naive question on the the scanorama correction methods. In a test case, I fed data from several batches into the correct() method and then analysed scatterplots of (original data for marker X, scanorama corrected data for marker X) for several genes/markers out of curiousity. I would expect there to be some kind of relationship; not necessairly linear, maybe not even monotonous, not free of noise - but some kind of structure. What I get is fairly wild though and this persists if I look within one batch only.

I was wandering whether cells/events are still ordererd the same way after scanorama correction as they were before? Or would you think there is some other kind of user error here? Are the many zeros causing problems here - based on what information are they being assigned values during correction?
In principle I thought that the method had worked well, as batches nicely overlayed where they should in PCA and UMAP dimension reductions of the corrected data.

Thanks.

Lisa

(blue - all batches, orange - one example batch)

The number of downloaded 10x Jurkat cells does not match with the paper

Hi,

I downloaded the dataset fromhttp://cb.csail.mit.edu/cb/scanorama/ and analyzed the 293T:Jurkat dataset, but I found the number of Jukrat cells in the original 10x dataset (pure 100%, n=3258 cells) does not match with the reported number (n=3257 cells) in your NBT paper. In addition, the number of Jurkat cells (pure 100%) is 3257 in the assigned cell labels file: "data/cell_labels/293t_jurkat_cluster.txt".
Could you please offer me some instructions to adjust the Jurkat cell number, thanks in advance!

Best regards,
BP

memory and run time for large dataset

Hi, I am attempting to run scanorama on 150K cells from 40 samples on a c5.9xlarge EC2 instance with 32 cores and 64GB of RAM. Under this configuration scanorama fails with batch_size=1000 or 500. Now I am running it with batch_size=100 and RAM usage is at 40G. Processing each pair of datasets takes around 2 mins, and if there are 780 (40 choose 2) such pairs, I would expect this to finish in 26 hours. I saw the analysis of runtime and memory usage as a function of number cells in your publication. I was wondering if you have done analysis on time and memory usage as a function of number of datasets, each with around 3000 cells? Should I expect the run time to increase combinatorially with the number of dataset? Should I invest in more RAM or more CPUs for my EC2 instance?

Previously, I tested scanorama after downsampling this 40 samples dataset to 30K cells, and scanorama finished in an hour and produced a beautifully integrated umap. However my stdout only showed 570 lines of "Processing datasets (i, j)". Should I have expected 780 (40 choose 2) such lines?

AnnData import

I use two AnnData as the input, but it doesn't work well. For the AnnData I use, the obs is genes, vars is cells.

After these codes:

adatas = [data1, data2]

integrated = scanorama.integrate_scanpy(adatas)

corrected = scanorama.correct_scanpy(adatas)

integrated, corrected = scanorama.correct_scanpy(adatas, return_dimred=True)

I got this error:
`

AttributeError Traceback (most recent call last)
in
3
4 # Integration.
----> 5 integrated = scanorama.integrate_scanpy(adatas)
6
7 # Batch correction.

~/anaconda3/envs/cs/lib/python3.6/site-packages/scanorama/scanorama.py in integrate_scanpy(adatas, **kwargs)
235 """
236 datasets_dimred, genes = integrate(
--> 237 [adata.X for adata in adatas],
238 [adata.var_names.values for adata in adatas],
239 **kwargs

~/anaconda3/envs/cs/lib/python3.6/site-packages/scanorama/scanorama.py in (.0)
235 """
236 datasets_dimred, genes = integrate(
--> 237 [adata.X for adata in adatas],
238 [adata.var_names.values for adata in adatas],
239 **kwargs

AttributeError: 'function' object has no attribute 'X'
`

'Error after 400 iterations' message

Hi,

I'm running the 293t data on my machine, follow the tutorial comand, when running python bin/293t_jurkat.py the last output message 'Error after 400 iterations: 1.044037', it still generate the svg figure, I don't know whether it is just running message or have something wrong in it, just want to make sure the result is right.

python bin/process.py conf/panorama.txt

Successfully processed data/293t_jurkat/293t
Successfully processed data/293t_jurkat/jurkat
Successfully processed data/293t_jurkat/jurkat_293t_50_50

python bin/293t_jurkat.py

Loaded data/293t_jurkat/293t with 32738 genes and 2885 cells
Loaded data/293t_jurkat/jurkat with 32738 genes and 3257 cells
Loaded data/293t_jurkat/jurkat_293t_50_50 with 32738 genes and 3388 cells
Found 9530 cells among all datasets
After data/293t_jurkat/293t: 32639 genes
After data/293t_jurkat/jurkat: 32639 genes
After data/293t_jurkat/jurkat_293t_50_50: 32639 genes
Found 32639 genes among all datasets
[[0. 0.00859687 0.71265165]
[0. 0. 0.26251151]
[0. 0. 0. ]]
Processing datasets data/293t_jurkat/293t <=> data/293t_jurkat/jurkat_293t_50_50
Processing datasets data/293t_jurkat/jurkat <=> data/293t_jurkat/jurkat_293t_50_50
[t-SNE] Computing 1801 nearest neighbors...
[t-SNE] Indexed 9530 samples in 0.320s...
[t-SNE] Computed neighbors for 9530 samples in 33.926s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9530
[t-SNE] Computed conditional probabilities for sample 2000 / 9530
[t-SNE] Computed conditional probabilities for sample 3000 / 9530
[t-SNE] Computed conditional probabilities for sample 4000 / 9530
[t-SNE] Computed conditional probabilities for sample 5000 / 9530
[t-SNE] Computed conditional probabilities for sample 6000 / 9530
[t-SNE] Computed conditional probabilities for sample 7000 / 9530
[t-SNE] Computed conditional probabilities for sample 8000 / 9530
[t-SNE] Computed conditional probabilities for sample 9000 / 9530
[t-SNE] Computed conditional probabilities for sample 9530 / 9530
[t-SNE] Mean sigma: 0.094365
[t-SNE] Computed conditional probabilities in 8.172s
[t-SNE] Iteration 50: error = 53.7817993, gradient norm = 0.0253594 (50 iterations in 62.223s)
[t-SNE] Iteration 100: error = 49.3612633, gradient norm = 0.0013276 (50 iterations in 66.378s)
[t-SNE] Iteration 150: error = 49.1272125, gradient norm = 0.0007659 (50 iterations in 65.474s)
[t-SNE] Iteration 200: error = 49.0316925, gradient norm = 0.0004471 (50 iterations in 64.936s)
[t-SNE] Iteration 250: error = 48.9777985, gradient norm = 0.0003479 (50 iterations in 63.844s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 48.977798
[t-SNE] Iteration 300: error = 1.3533084, gradient norm = 0.0019510 (50 iterations in 64.437s)
[t-SNE] Iteration 350: error = 1.1094712, gradient norm = 0.0003535 (50 iterations in 64.903s)
[t-SNE] Iteration 400: error = 1.0440365, gradient norm = 0.0001733 (50 iterations in 65.079s)
[t-SNE] Error after 400 iterations: 1.044037

Loaded data/293t_jurkat/293t with 32738 genes and 2885 cells
...

Imputation after scanorama?

Hi!

I understand that scanorama should be used in the very last steps of scRNAseq data analysis. However, Yuval Kluger, a Prof. of Applied Mathematics in Yale, suggested me to run imputation after merging my datasets (I was originally working with Seurat, but the integration yielded poor results).

Is it possible/advisable to run imputation on the output of scanorama? If so, do you have any recommendations regarding how to properly do it?

	conditional_P = _utils._binary_search_perplexity(
	distances, None, desired_perplexity, verbose)

brianhie / scanorama Goto Github PK

scanorama's People

Contributors

Stargazers

Watchers

Forkers

scanorama's Issues

Steps to reproduce:

Environment

I got this error: `

Recommend Projects

Recommend Topics

Recommend Org

I got this error:
`