scverse / scanpy Goto Github PK

View Code? Open in Web Editor NEW

1.8K 51.0 572.0 39.5 MB

Single-cell analysis in Python. Scales to >1M cells.

Home Page: https://scanpy.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 99.80% R 0.20%

machine-learning data-science visualize-data transcriptomics bioinformatics scanpy anndata python scverse

scanpy's Introduction

Scanpy – Single-Cell Analysis in Python

Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It includes preprocessing, visualization, clustering, trajectory inference and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.

Discuss usage on the scverse Discourse. Read the documentation. If you'd like to contribute by opening an issue or creating a pull request, please take a look at our contribution guide.

scanpy is part of the scverse project (website, governance) and is fiscally sponsored by NumFOCUS. If you like scverse and want to support our mission, please consider making a donation to support our efforts.

Citation

If you use scanpy in your work, please cite the scanpy publication as follows:

SCANPY: large-scale single-cell gene expression data analysis

F. Alexander Wolf, Philipp Angerer, Fabian J. Theis

Genome Biology 2018 Feb 06. doi: 10.1186/s13059-017-1382-0.

You can cite the scverse publication as follows:

The scverse project provides a computational ecosystem for single-cell omics data analysis

Isaac Virshup, Danila Bredikhin, Lukas Heumos, Giovanni Palla, Gregor Sturm, Adam Gayoso, Ilia Kats, Mikaela Koutrouli, Scverse Community, Bonnie Berger, Dana Pe’er, Aviv Regev, Sarah A. Teichmann, Francesca Finotello, F. Alexander Wolf, Nir Yosef, Oliver Stegle & Fabian J. Theis

Nat Biotechnol. 2023 Apr 10. doi: 10.1038/s41587-023-01733-8.

scanpy's People

Contributors

Stargazers

Watchers

Forkers

charliex210 jhung0 mpaperlee tcallies resurgo-genetics cyang-2014 gioelelm m0hammadl nchlis gokceneraslan caot dawe jorvis lucasesbs mm-shah msellamitn luckymd ahy1221 adelq ckmah nishalpattan bioinfomatics manusetty iandriver rfechtner puriney terooatt w9 teslaa22 scottgigante eamid sumonahmeduom ivirshup dbrg77 jpreall swolock fancy23 stashkov fidelram vd4mmind hrk2109 volkerbergen dfajar2 ddbbt jamestwebber tomwhite yugeji maximilianh gyd1990 davidhbrann hongzhonglu hawaiiki lux563624348 lijielife roryk a-munoz-rojas bebatut dilawar 10xgenomics umangv alabarga idroz jan-glx yuexinxu suger0917 zzwch jun-lizst mgood2 zorrodong sirno slukassen grst color4 biterbilen timrandg lewlin chitrita yamagatm joschif loyale julietchen daneseanna xparx mashehu akmazad lucapinello mbuttner ioannisa92 spadavec ahsv gibbsdavidl vpresnyak anonymousengineer andrea-tango vkozareva mengchengyao shouwenwang dpeerlab gamazeps sjfleming

scanpy's Issues

tl.sim ignores additional parameters

I am trying to toy with the krumsiek11 model, but the sc.tl.sim call seems to ignore parameters and always uses the parameters from the krumsiek11_params.txt file. In particular, running:

adam_krumsiek11 = sc.tl.sim('krumsiek11')
adam_krumsiek11_2 = sc.tl.sim('krumsiek11', nrRealizations = 1, seed = 1665487)
sc.pl.sim(adam_krumsiek11 )
sc.pl.sim(adam_krumsiek11_2)

produces two exactly identical figures with 4 realizations.

I also tried to set read_params_from_file = False (this is not documented at http://scanpy.readthedocs.io/en/latest/api/scanpy.api.tl.sim.html but seemed relevant). However, running

adam_krumsiek11 = sc.tl.sim('krumsiek11', nrRealizations = 1, read_params_from_file = False)
sc.pl.sim(adam_krumsiek11)

results in IndexError
and running

adam_krumsiek11 = sc.tl.sim('krumsiek11', nrRealizations = 1, tmax = 800, read_params_from_file = False)
sc.pl.sim(adam_krumsiek11)

avoids the error, but gives the exact same figure as the first code segment.

Maybe I am not understanding correctly, how the function should work? (in which case this would be a documentation issue) Or is there really something wrong?

Thanks for any hints.

indexing of AnnData

The task is to write the following preprocessing sequence using an AnnData instance adata.

meanFilter = 0.01
cvFilter = 2
nr_pcs = 50

ddata = adata.to_dict()
X = ddata['X']
# row normalize                                                                                                                                                                  
X = row_norm(X, max_fraction=0.05, mult_with_mean=True)
# filter out genes with mean expression < 0.1 and coefficient of variance <                                                                                                      
# cvFilter                                                                                                                                                                       
X, gene_filter = filter_genes_cv(X, meanFilter, cvFilter)
# compute zscore of filtered matrix                                                                                                                                              
Xz = zscore(X)
# PCA                                                                                                                                                                            
Xpca = pca(Xz, nr_comps=nr_pcs)
# update dictionary                                                                                                                                                              
ddata['X'] = X
ddata['Xpca'] = Xpca
ddata['var_names'] = ddata['var_names'][gene_filter]
sett.m(0, 'Xpca has shape',
    ddata['Xpca'].shape[0], 'x', ddata['Xpca'].shape[1])
from ..ann_data import AnnData
adata = AnnData(ddata)
print(adata.X)

While the previous snippet works just as expected, when I want to do the same without a ddata object, some uncontrolled behavior comes up. Indexing doesn't work as expected anymore. @flying-sheep: could you have a look at why adata['Xpca'] = Xpca in the following throws an

>>> adata['Xpca'] = Xpca
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

in the following snippet

X = adata.X
# row normalize                                                                                                                                                                  
X = row_norm(X, max_fraction=0.05, mult_with_mean=True)
# filter out genes with mean expression < 0.1 and coefficient of variance <                                                                                                      
# cvFilter                                                                                                                                                                       
X, gene_filter = filter_genes_cv(X, meanFilter, cvFilter)
# compute zscore of filtered matrix                                                                                                                                              
Xz = zscore(X)
# PCA                                                                                                                                                                            
Xpca = pca(Xz, nr_comps=nr_pcs)
# update adata                                                                                                                                                                   
adata.X = X
adata = adata.var_names[gene_filter] # filter genes                                                                                                                              
adata['Xpca'] = Xpca
sett.m(0, 'Xpca has shape',
    adata['Xpca'].shape[0], 'x', adata['Xpca'].shape[1])
print(adata.X)

I played around quite some bit, but the only solution that I got running then had the numerically incorrect result. It's quite to hard to keep this sequence of steps nicely organized.

PS: the snippet appears in scanpy/preprocess/advanced.py and an example would be ./scanpy.py nestorowa16 diffmap -r pp.

New features

hi Alex
here my list :) thanks a lot! and I might expand it....

'needed'

concatenate mulitple data sets
scale up sc.pp.regress_out, add some function for batch correction?
additional heatmap annotation for cells/genes (e.g. .smp)
aga-graph: labels in pie charts are misleading (sometimes switched), maybe better to have just a legend with the colors than labeling every node.

'nice to have'

In scatterplot showing gene expression: plot points ordered by expression, i.d. cells with higher expression on top of cells with lower expression
log transform (let user choose the base in sc.pp.log1p (natural log., log2 or log10)
cell cycle scoring (and scoring for any other gene list)
function and plots for basic qc metrics (n_counts, n_genes, CV, %drop out)
sc.pl.scatter also for genes (adata.var, e.g. to plot e.g. mean expression vs dropout rate)
table for high scoring genes, in addtition to sc.pl.rank_genes_groups
additional differential expression test
Heatmap for genes per cluster/sample/condition (not just along a path in pseuodtime order) including custom sample and gene annotation
Maybe also an option for simplified visualization with just mean expression per cluster.

SetKeyError in "Graph abstraction for minimal examples"

Hello,

I'm trying out the Graph abstraction and I get this error:

SetKeyError                               Traceback (most recent call last)
<ipython-input-12-928a85d4478e> in <module>()
----> 1 sc.tl.tsne(adata)
      2 sc.tl.draw_graph(adata, random_state=5)  # random_state just makes a cosmetic change
      3 sc.write('krumsiek11_blobs', adata)

~/Downloads/scanpy/scanpy/tools/tsne.py in tsne(adata, n_pcs, perplexity, early_exaggeration, learning_rate, random_state, use_fast_tsne, recompute_pca, n_jobs, copy)
    108         X_tsne = tsne.fit_transform(X)
    109     # update AnnData instance
--> 110     adata.smp['X_tsne'] = X_tsne  # annotate samples with tSNE coordinates
    111     logg.info('    finished', t=True, end=' ')
    112     logg.info('and added\n'

~/Downloads/scanpy/scanpy/data_structs/ann_data.py in __setitem__(self, keys, values)
    382                     # TODO: need to reallocate memory
    383                     # or allow storing objects, or use pd.dataframes
--> 384                     raise SetKeyError(k, v.dtype, self.dtype[k])
    385                 super(BoundStructArray, self).__setitem__(k, v)
    386 

SetKeyError: Currently you cannot implicitly reallocate memory:
Setting the array for key X_tsne001of002 with dtype float64 requires too much memory, you should init AnnData with a large enough data type from the beginning.
Probably you try to assign a string of length 8 although the array can only store strings of length 4.

I'm using the latest git version of scanpy.
Any ideas?
Best wishes

AnnData.write() requires path specification

I have a short script which reads a tab file and writes h5 using scanpy. I've found that unless I provide a full path to the write() function or at least a relative one via "./foo.h5" it fails. Simplified version:

adata = sc.read(args.input_file, ext='txt', first_column_names=True).transpose()
adata.write('./test.h5')   # this works
adata.write('test2.h5')   # this fails

Here's the stack:

WARNING: This might be very slow. Consider passing `cache=True`, which enables much faster reading from a cache file.
Traceback (most recent call last):
  File "./convert_gear_group_single_cell_to_hdf5.py", line 47, in <module>
    main()
  File "./convert_gear_group_single_cell_to_hdf5.py", line 43, in main
    adata.write('test2.h5')
  File "/usr/local/lib/python3.5/dist-packages/anndata/base.py", line 1471, in write
    compression=compression, compression_opts=compression_opts)
  File "/usr/local/lib/python3.5/dist-packages/anndata/base.py", line 1513, in _write_h5ad
    os.makedirs(os.path.dirname(filename))
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
_____________________________

Feature request - Merge AnnData objects

I would be nice to have a method that can merge several AnnData objects together.

Documentation request, import formats

I've been able to successfully use scanpy for MEX-formatted datasets, and the documentation here was great in the Jupyter notebooks.

Many of our other datasets are in a simple tab format where the first column is gene symbol and the rest are in GROUP--CELLID format, like this:

Gene_symbol	lymph_1--cell_avg	lymph_1--Cell_1	lymph_1--Cell_10
A1BG.AS1	13.9085855833156	0	54.3778851449283
A2M.AS1	10.2185780428145	0	0
A2MP1	0	0	0
AADACL2	0	0	0
AAGAB	136.889472532613	0	0
AAR2	76.3090843598131	0	0
AATF	360.127068564485	0	0
AATK	2.93980819712579	0	0
AATK.AS1	0	0	0

These contain up to 30,000 rows and (so far) 400,000 columns.

I'd love to see more use cases for how to absorb different formats of data into scanpy. I'd be happy to help write it, but so far have only got MEX data to work.

Development branch

Hi, I was just trying to use the package but it seems that somebody is working on the master branch right now. Would it be possible to set up a development branch and maybe add a few tags for the working versions so that people could download a particular release instead of an in-progress master branch? I also noticed that the notebooks disappeared right after I cloned the repository. It seems like there are some big changes going on, so sorry if the timing for this issue is not right and you are just cleaning up the repository.

runnign tSNE

I have some issues runnign tSNE with sc.tsne(adata). It seems to work on the moignard15 data set but running the same code with my data set results in the following error.

compute tSNE
preprocess using PCA with 50 PCs
--> avoid this by setting n_pcs = 0
0:00:02.013 - compute PCA with n_comps = 50
0:00:00.162 - finished
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-5-ea03cbb426c5> in <module>()
----> 1 sc.tsne(adata)

/opt/conda/lib/python3.6/site-packages/scanpy/tools/tsne.py in tsne(adata, random_state, n_pcs, perplexity)
     59             sett.m(0, 'preprocess using PCA with', n_pcs, 'PCs')
     60             sett.m(0, '--> avoid this by setting n_pcs = 0')
---> 61             X = pca(adata.X, random_state=random_state, n_comps=n_pcs)
     62             adata['X_pca'] = X
     63         else:

/opt/conda/lib/python3.6/site-packages/scanpy/tools/pca.py in pca(adata_or_X, n_comps, zero_center, svd_solver, random_state)
     60                        zero_center, svd_solver,
     61                        random_state=random_state)
---> 62         adata['X_pca'] = X_pca
     63     if isadata:
     64         return adata

UnboundLocalError: local variable 'adata' referenced before assignment

Error installing scanpy through pip

I'm trying to install scanpy through pip install scanpy but I'm getting this weird error

$ pip install scanpy
Collecting scanpy
  Downloading scanpy-0.2.9.1.tar.gz (208kB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-lbk_t73k/scanpy/setup.py", line 39, in <module>
        readme = readme_f.read()
      File "/opt/conda/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 296: ordinal not in range(128)

That's with pip version 9.0.1 and python 3.6. I'm getting similar errors for older versions of scanpy, including 0.2.1. Perhaps this is a bug in pip but I'm not sure, I installed a bunch of other unrelated packages without any issues.

Error running example notebook

I tried running https://github.com/theislab/scanpy_usage/blob/master/170501_moignard15/moignard15.ipynb and got this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/networkx/classes/graph.py in neighbors(self, n)
   1058         try:
-> 1059             return list(self.adj[n])
   1060         except KeyError:

KeyError: None

During handling of the above exception, another exception occurred:

NetworkXError                             Traceback (most recent call last)
<ipython-input-11-f3d9663e2b3b> in <module>()
      1 adata.add['dpt_groups_names'] = ['undecided/endothelial', 'endothelial', 'erythrocytes', 'trunk'] # optional
----> 2 sc.pl.dpt(adata, color=['dpt_pseudotime', 'dpt_groups', 'exp_groups'], legendloc='upper left')

~/Documents/scanpy/scanpy/plotting/__init__.py in dpt(adata, basis, color, names, comps, cont, layout, legendloc, cmap, pal, right_margin, size, titles, show)
    385         if not isinstance(color, list): colors = color.split(',')
    386         else: colors = color
--> 387     if 'dpt_groups' in colors: dpt_tree(adata, show=False)
    388     dpt_timeseries(adata, cmap=cmap, show=show)
    389 

~/Documents/scanpy/scanpy/plotting/__init__.py in dpt_tree(adata, root, colors, names, show, fontsize)
    463         if name in sett._ignore_categories: colors[iname] = 'grey'
    464     G = nx.Graph(adata.add['dpt_groups_adjacency'])
--> 465     pos = utils.hierarchy_pos(G, root)
    466     fig = pl.figure(figsize=(5, 5))
    467     ax = pl.axes([0, 0, 1, 1], frameon=False)

~/Documents/scanpy/scanpy/plotting/utils.py in hierarchy_pos(G, root, levels, width, height)
    455 
    456     if levels is None:
--> 457         levels = make_levels({})
    458     else:
    459         levels = {l: {TOTAL: levels[l], CURRENT: 0} for l in levels}

~/Documents/scanpy/scanpy/plotting/utils.py in make_levels(levels, node, currentLevel, parent)
    434             levels[currentLevel] = {TOTAL: 0, CURRENT: 0}
    435         levels[currentLevel][TOTAL] += 1
--> 436         neighbors = G.neighbors(node)
    437         if parent is not None:
    438             neighbors.remove(parent)

/usr/local/lib/python3.6/site-packages/networkx/classes/graph.py in neighbors(self, n)
   1059             return list(self.adj[n])
   1060         except KeyError:
-> 1061             raise NetworkXError("The node %s is not in the graph." % (n,))
   1062 
   1063     def neighbors_iter(self, n):

NetworkXError: The node None is not in the graph.

dpt_timeseries with adata.n_var > 50

Hi,
when I use dpt_timeseries function with adata.n_var > 50, there is no output or any hint and just empty.
I saw the source code and there is a comment # only if number of genes is not too high
Does it mean this function could not deal with n_var > 50?
I think there would be a hint when n_var > 50, if not it will confuse others. Thank you~

fix github links for functions in docs

i found pointers about the “Edit on GitHub” button here, and this is the relevant template code.

maybe we can set meta['github_url'] like this, but this approach does nothing:

def process_docstring(app, what, name, obj, options, lines):
    nm = name.replace('.', '/').replace('api/pp', 'preprocessing')

    lines[:0] = [f':github_url: https://https://github.com/theislab/scanpy/tree/master/{nm}.py', '']


def setup(app):
    app.connect('autodoc-process-docstring', process_docstring)

maybe we can get inspiration from the viewcode extension

Could not use Scanpy 0.4 to load AnnData file generated by v0.2.8

Recently I installed scanpy 0.4. However, with this new version I could not correctly load result files generated by an old version (v0.2.8). In particular, I could not load the old add_keys as uni_keys.

Any suggestions?

Scanpy does not work with with few permissions

I'm running scanpy in a docker container with few privileges, e.g. the guest user.
Unfortunately, scanpy tries to create a folder in the local working directory:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/scanpy/__init__.py", line 14, in <module>
    from . import settings
  File "/usr/local/lib/python3.6/site-packages/scanpy/settings.py", line 13, in <module>
    os.makedirs('.scanpy/')
  File "/usr/local/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '.scanpy/'

Could you use a tempdir alternativley?

sc.tl.dpt with error: detected group with only [] cells

Hello,
it's me again, really thanks for your kindly reply before.
when I analyze my own data using sc.tl.dpt with default n_branches, it worked well, but when I set n_branches more than 0, it occurred an error:

--> To enable computation of pseudotime, pass the index or expression vector
    of a root cell. Either add
        adata.add['iroot'] = root_cell_index
    or (robust to subsampling)
        adata.var['xroot'] = adata.X[root_cell_index, :]
    where "root_cell_index" is the integer index of the root cell, or
        adata.var['xroot'] = adata[root_cell_name, :].X
    where "root_cell_name" is the name (a string) of the root cell.
perform Diffusion Pseudotime analysis
    using "X_pca" for building graph
    using stored data graph with n_neighbors = 30 and spectrum
    [ 1.            0.9944264293  0.9934666753  0.9925051928  0.9899699688
      0.9893597364  0.9855745435  0.9840251803  0.981688261   0.9806631804]
    detect 1 branching
    do not consider groups with less than 2742 points for splitting
    branching 1: split group 0
WARNING: detected group with only [] cells

ValueError                                Traceback (most recent call last)
<ipython-input-3-b1749d943ac4> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'sc.tl.dpt(adata_corrected,n_jobs=48,n_pcs=30,allow_kendall_tau_shift=False,n_branchings=1)\nsc.logging.print_memory_usage()')

/public/bioapps/ana/anaconda3/envs/python35/lib/python3.5/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2113             magic_arg_s = self.var_expand(line, stack_depth)
   2114             with self.builtin_trap:
-> 2115                 result = fn(magic_arg_s, cell)
   2116             return result
   2117 

<decorator-gen-59> in time(self, line, cell, local_ns)

/public/bioapps/ana/anaconda3/envs/python35/lib/python3.5/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    186     # but it's overkill for just that one bit of state.
    187     def magic_deco(arg):
--> 188         call = lambda f, *a, **k: f(*a, **k)
    189 
    190         if callable(arg):

/public/bioapps/ana/anaconda3/envs/python35/lib/python3.5/site-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
   1178         else:
   1179             st = clock2()
-> 1180             exec(code, glob, local_ns)
   1181             end = clock2()
   1182             out = None

<timed exec> in <module>()

/public/workspace/jiping/scanpy-master/scanpy/tools/dpt.py in dpt(adata, n_branchings, n_neighbors, knn, n_pcs, n_dcs, min_group_size, n_jobs, recompute_graph, recompute_pca, allow_kendall_tau_shift, flavor, copy)
    127         adata.smp['dpt_pseudotime'] = dpt.pseudotime
    128     # detect branchings and partition the data into segments
--> 129     dpt.branchings_segments()
    130     # vector of length n_groups
    131     adata.add['dpt_groups_order'] = [str(n) for n in dpt.segs_names_unique]

/public/workspace/jiping/scanpy-master/scanpy/tools/dpt.py in branchings_segments(self)
    188             for each segment.
    189         """
--> 190         self.detect_branchings()
    191         self.postprocess_segments()
    192         self.set_segs_names()

/public/workspace/jiping/scanpy-master/scanpy/tools/dpt.py in detect_branchings(self)
    258                                   segs_connects,
    259                                   segs_undecided,
--> 260                                   segs_adjacency, iseg, tips3)
    261         # store as class members
    262         self.segs = segs

/public/workspace/jiping/scanpy-master/scanpy/tools/dpt.py in detect_branching(self, segs, segs_tips, segs_connects, segs_undecided, segs_adjacency, iseg, tips3)
    464         # branching on the segment, return the list ssegs of segments that
    465         # are defined by splitting this segment
--> 466         result = self._detect_branching(Dseg, tips3, seg)
    467         ssegs, ssegs_tips, ssegs_adjacency, ssegs_connects, trunk = result
    468         # map back to global indices

/public/workspace/jiping/scanpy-master/scanpy/tools/dpt.py in _detect_branching(self, Dseg, tips, seg_reference)
    632             if len(np.flatnonzero(newseg)) <= 1:
    633                 logg.warn('detected group with only {} cells'.format(np.flatnonzero(newseg)))
--> 634             secondtip = newseg[np.argmax(Dseg[tips[inewseg]][newseg])]
    635             ssegs_tips.append([tips[inewseg], secondtip])
    636         undecided_cells = np.arange(Dseg.shape[0], dtype=int)[nonunique]

/public/bioapps/ana/anaconda3/envs/python35/lib/python3.5/site-packages/numpy/core/fromnumeric.py in argmax(a, axis, out)
    971     except AttributeError:
    972         return _wrapit(a, 'argmax', axis, out)
--> 973     return argmax(axis, out)
    974 
    975 

ValueError: attempt to get argmax of an empty sequence```


    Does it mean that this data didn't contain any branches? 
    Howerer, I do see some small tips when I plot using `diffmap` result or `sc.tl.dpt` with default branch. Could you help me to figure it out?

    Thanks,
    jiping

Error when filtering: AttributeError: 'Series' object has no attribute 'is_dtype_equal'

Hi, I hit this error when trying to filter genes.
A minimal working example is included below.

Any help appreciated.

def paul15_raw():
    filename = 'data/paul15/paul15.h5'
    backup_url = 'http://falexwolf.de/data/paul15.h5'
    adata = sc.read(filename, 'data.debatched', backup_url=backup_url)
    # each row has to correspond to a sample, therefore transpose                                                                                                               
    adata = adata.transpose()    # cluster assocations identified by Paul et al.
    clusters = sc.read(filename, 'cluster.id', return_dict=True)['X'].flatten()
    # names reflecting the cell type identifications from the paper
    cell_types = {i: 'Ery' for i in range(1, 7)}
    cell_types[7] = 'MEP'
    cell_types[8] = 'Mk'
    cell_types[9] = 'GMP'
    cell_types[10] = 'GMP'
    cell_types[11] = 'DC'
    cell_types[12] = 'Baso'
    cell_types[13] = 'Baso'
    cell_types[14] = 'Mo'
    cell_types[15] = 'Mo'
    cell_types[16] = 'Neu'
    cell_types[17] = 'Neu'
    cell_types[18] = 'Eos'
    cell_types[19] = 'Other'
    adata.smp['paul15_clusters'] = [str(i) + cell_types[i] for i in clusters.astype(int)]
    infogenes_names = sc.read(filename, 'info.genes_strings', return_dict=True)['X']
    # just keep the first of the two equivalent names per gene                                                                                                                       
    adata.var_names = np.array([gn.split(';')[0] for gn in adata.var_names])
    # remove 10 corrupted gene names                                                                                                                                                 
    infogenes_names = np.intersect1d(infogenes_names, adata.var_names)
    # restrict the data to the 3461 informative genes                                                                                                                              
    adata = adata[:, infogenes_names]
    adata.add['iroot'] = np.flatnonzero(adata.smp['paul15_clusters']  == '7MEP')[0]
    return adata
    
 adata = paul15_raw()
 afilter = sc.pp.recipe_zheng17(adata, n_top_genes=1000, zero_center=True, plot=True, copy=True)

afilter = sc.pp.filter_genes_dispersion(adata, n_top_genes=1000)

both fail with
AttributeError: 'Series' object has no attribute 'is_dtype_equal'
when computing the dispersion norm (line 207, simple.py)

    207         df['dispersion_norm'] = (df['dispersion'].values  # use values here as index differs
--> 208                                  - disp_mean_bin[df['mean_bin']].values) \
    209                                  / disp_std_bin[df['mean_bin']].values

Running Scanpy version 0.2.6 on 2017-08-23 14:35.

Setup on osx (both pip and from source)

Hello,

I've tried setting up your package via pip, as instructed. This crashed out very quickly:

mib111492i:~ kp9$ pip install scanpy
Collecting scanpy
  Downloading scanpy-0.2.9.1.tar.gz (208kB)
	100% |################################| 215kB 2.8MB/s 
	Complete output from command python setup.py egg_info:
	Traceback (most recent call last):
	  File "<string>", line 1, in <module>
	  File "/private/tmp/pip-build-cx2i4lbu/scanpy/setup.py", line 39, in <module>
		readme = readme_f.read()
	  File "/Users/kp9/anaconda3/lib/python3.6/encodings/ascii.py", line 26, in decode
		return codecs.ascii_decode(input, self.errors)[0]
	UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 296: ordinal not in range(128)

	----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/tmp/pip-build-cx2i4lbu/scanpy/

The offender seems to be the stylised README file, so I downloaded the source code, got rid of it, and proceeded with the installation. I'm unsure how representative the following encountered issues are of an ideal pip installation, but I figured I'd bring them to your attention anyway just in case they're relevant:

h5py crashed out on account of not having hdf5 available. This was remedied via brew install hdf5, and it seems like the most likely of these issues to affect other users.
The installer ignored my 2.1.0 setup of matplotlib, tried to install 2.0.0 in some weird way and crashed out. Installing 2.0.0 via pip (absolutely painlessly, mind you - what was that weird installer that crashed the thing out?) allowed the setup to proceed past this point. This might be an isolated incident, but it seemed weird enough to alert you of just in case.
As is, louvain crashes immediately and uninformatively when attempts are made to pip it in. Installing from the GitHub source code still works fine though. I'll notify the louvain team of this situation, but this may be of relevance to you too.

Thanks!

Allow plotting genes not used in a dimension reduction

Looking forward to plot genes that are not used in pca/tsne analysis while exist in the raw data!

How to find the optimal k?

is there a way to systematically find the ideal k? There is a find_sigmas function in the R destiny package. Is the parameter k related to the parameter sigma? Thanks.

how to proceed

make things work*
find out how to store PCA and friends in/with the AnnData 1cec418#commitcomment-20744162
determine how to read/write AnnData. maybe fields named var_* in the HDF5 will be var metadata and so on?

*apart from things still crashing, especially the group plotting should be fixed (should probably be transformed to one scatter call with a list of all groups)

sc.pp.scale(adata) generates NaN error

I encountered this error when using data with a relatively small number of cells (~2,600). I have not encountered this error with my previous data with more cells (>10,000).

rank_genes_groups broken when `not use_raw`

adata_comp is only defined in the if branch:

https://github.com/theislab/scanpy/blob/61f3fe4b8ef16933750d6358c86bc15104d951a4/scanpy/tools/rank_genes_groups.py#L91-L93

StructDict.getitem broken

struct_dict['k'] should return a simple 1D ndarray, not a StructDict
struct_dict[['a', 'b']] should return a proper StructDict including all additional fields like _keys

both problems are visible in this test: https://travis-ci.org/theislab/scanpy/jobs/227216146#L224

but a test should be added for the second point once the first point is fixed

i would have fixed it, but i don’t have the slightest idea what all the “multicolumn” fields are for. also i don’t believe in too many caches and private fields, they get out of sync too easily. recomputing tiny things is fast, e.g. self._keys could simply be replaced with self.dtype.names

version problems

installing the master branch via pip3 install --user --upgrade git+https://github.com/theislab/scanpy.git fails:
```
UPDATING build/lib.linux-x86_64-3.5/None
error: [Errno 2] No such file or directory: 'build/lib.linux-x86_64-3.5/None'
```
maybe _version.py returns None instead of a version string for the master branch.
installing the tag works: pip3 install --user --upgrade git+https://github.com/theislab/[email protected] but the installed version will be 0.1, not 0.0 as the tag says (i’m not sure 0.0 is a legal version anyway)

module 'scanpy.api.tools' has no attribute 'diffrank'

hi,
I installed the scanpy-master , but when I type scanpy --help in bash the error occurred. I noticed that diffrank was replaced by rank_genes_groups, but I don't know how to fix it.

Traceback (most recent call last):
  File "/public/bioapps/ana/anaconda3/envs/python35/bin/scanpy", line 11, in <module>
    load_entry_point('scanpy==0+unknown', 'console_scripts', 'scanpy')()
  File "/public/bioapps/ana/anaconda3/envs/python35/lib/python3.5/site-packages/scanpy-0+unknown-py3.5-linux-x86_64.egg/scanpy/__main__.py", line 278, in main
    init_main_parser().print_help()
  File "/public/bioapps/ana/anaconda3/envs/python35/lib/python3.5/site-packages/scanpy-0+unknown-py3.5-linux-x86_64.egg/scanpy/__main__.py", line 117, in init_main_parser
    descr = 78*'-' + '\n' + getattr(tools, key).__doc__
AttributeError: module 'scanpy.api.tools' has no attribute 'diffrank'

Feature request - DPT predict

Thanks for you work!

There is a feature that I would be nice to have. Once the DPT has been calculated, it would be nice to be able to "predict" (like in the R package destiny) new data into the calculate DifussionMap . It would be great to include a label that tells apart the new cells from the already existing ones.

Thanks

painless C extensions with CFFI

so idk about you, but i think CFFI is pretty painless for interacting with C/C++ code:

you simply compile a .so file, use ffi.dlopen('….so'), ffi.cdef('void myfunc(…)') and are able to call it from python.
it works with PyPy
it can be used with numpy (example)

we just have to think if we need to pass complex data structures, but i assume a few dense and sparse matrices is all you need. we just have to figure that out beforehand, and how to create a sparse matrix from raw memory (AFAIK everything under the sun supports at least column compressed layout, we might want to switch csr_matrix → csc_matrix)

HDF5 serialization format

hi, we want to converge our HDF5 representation with SingleCellExperiment’s.

SummarizedExperiment already has a HDF5 repr

cc @davismcc @LTLA

interplay with jupyter

%matplotlib inline
import scanpy

when importing scanpy from a jupyter notebook I get this warning, because apparently scanpy calls matplotlib.use(). if at all, it should only do that after checking that no backend is already selected

/home/icb/philipp.angerer/.local/lib/python3.5/site-packages/matplotlib/__init__.py:1401: UserWarning:  This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)

and this, which isn’t actually a Warning, is printed to stdout (why?)

... WARNING: did not find DISPLAY variable needed for interactive plotting
--> try ssh with `-X` or `-Y`
    setting `sett.savefigs = True`

in an interactive notebook or other shell, sett.savefigs shouldn’t be automatically set to True

Order of pre-processing steps

Hi,

In the example notebook, seurat.ipynb, the function sc.pp.normalize_per_cell() is run before sc.pp.regress_out(). Is it not better to regress out the effect of n_counts before normalization? I do not completely understand this and it would be great if the authors could explain this order of pre-processing. Also, is there certain order(s) of steps which should always be avoided?

Thank you.

Best,
Parashar

Error pip installing

I got an error doing pip3 install -e .:

clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c scanpy/cython/utils_cy.c -o build/temp.macosx-10.11-x86_64-3.6/scanpy/cython/utils_cy.o
scanpy/cython/utils_cy.c:435:10: fatal error: 'numpy/arrayobject.h' file not found
#include "numpy/arrayobject.h"
         ^
1 error generated.
error: command 'clang' failed with exit status 1

----------------------------------------
Command "/usr/local/opt/python3/bin/python3.6 -c "import setuptools, tokenize;file='/Users/jyhung/Documents/scanpy/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" develop --no-deps" failed with error code 1

It worked after I installed cython: pip3 install cython

sc.pp.normalize_per_cell

Hi,

May be I do not understand this clearly, but does the function sc.pp.normalize_per_cell not supposed normalize the counts? However when I run this function, I see that the values do not change at all. Am I doing something wrong here?

raw_data.X.toarray().sum(axis=1)

Out[18]:  array([ 23037.,  18883.,  20755., ...,  14785.,  20996.,   7604.], dtype=float32)

normed_data = sc.pp.normalize_per_cell(raw_data, copy=True)
normed_data.X.toarray().sum(axis=1)

Out[19]: array([ 23037.,  18883.,  20755., ...,  14785.,  20996.,   7604.], dtype=float32)

Thanks

psutil error

I just have scanpy 0.2.7 and am trying to produce bpmc3 results. BUT right at the beginning (sc.read()) the following error! I will appreciate your help.
thanks

`--------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in ()
2 filename_genes = '/ifs/projects/proj077/backup/public_data/scanpy_tutorials_data/PBMC3K/filtered_gene_bc_matrices/hg19/genes.tsv'
3 filename_barcodes = '/ifs/projects/proj077/backup/public_data/scanpy_tutorials_data/PBMC3K/filtered_gene_bc_matrices/hg19/barcodes.tsv'
----> 4 adata = sc.read(filename_data, cache=True).transpose()
5 adata.var_names = np.genfromtxt(filename_genes, dtype=str)[:, 1]
6 adata.smp_names = np.genfromtxt(filename_barcodes, dtype=str)

/ifs/devel/hashem/sw-v1/conda/lib/python3.6/site-packages/scanpy/readwrite.py in read(filename_or_filekey, sheet, ext, delimiter, first_column_names, backup_url, return_dict, cache)
73 if is_filename(filename_or_filekey):
74 data = read_file(filename_or_filekey, sheet, ext, delimiter,
---> 75 first_column_names, backup_url, cache)
76 if isinstance(data, dict):
77 return data if return_dict else AnnData(data)

/ifs/devel/hashem/sw-v1/conda/lib/python3.6/site-packages/scanpy/readwrite.py in read_file(filename, sheet, ext, delimiter, first_column_names, backup_url, cache)
364 os.makedirs(os.path.dirname(filename_cache))
365 # write for faster reading when calling the next time
--> 366 write_dict_to_file(filename_cache, ddata, sett.file_format_data)
367 return ddata
368

/ifs/devel/hashem/sw-v1/conda/lib/python3.6/site-packages/scanpy/readwrite.py in write_dict_to_file(filename, d, ext)
771 d_write[key] = value
772 # now open the file
--> 773 wait_until_file_unused(filename) # thread-safe writing
774 if ext == 'h5':
775 with h5py.File(filename, 'w') as f:

/ifs/devel/hashem/sw-v1/conda/lib/python3.6/site-packages/scanpy/readwrite.py in wait_until_file_unused(filename)
935
936 def wait_until_file_unused(filename):
--> 937 while (filename in get_used_files()):
938 time.sleep(1)
939

/ifs/devel/hashem/sw-v1/conda/lib/python3.6/site-packages/scanpy/readwrite.py in get_used_files()
919 def get_used_files():
920 """Get files used by processes with name scanpy."""
--> 921 loop_over_scanpy_processes = (proc for proc in psutil.process_iter()
922 if proc.name() == 'scanpy')
923 filenames = []

AttributeError: module 'psutil' has no attribute 'process_iter'

New features of version 0.3.2

This collects new features of version 0.3.1 while extending other issues such as #45.

error in dpt_scatter when 'groups' parameter set

Hello,
When I call 'dpt_scatter' with the groups parameter I get the following error:
NameError: name 'names' is not defined

It looks like this is from line 230 in scanpy/plotting/ann_data.py and the 'names' variable just doesn't exist.
I'm assuming it should just be 'groups'?

Thanks,
Sarah

Identify optional dependencies

E.g. matplotlib is only necessary when plotting, and for e.g. Docker images, it would be useful to have a slim scanpy core.

An idea would be to do it like Jupyter:

A scanpy-core PyPI package with just the essentials.
A scanpy metapackage, which depends on scanpy-core and most (or all) of the optional dependencies.

Users doing pip install scanpy will get the full package, with no annoying runtime errors, and packagers needing flexibility get scanpy-core and can slim everything down as needed.

cc @hensing

Error in tl.rank_genes_groups

I found that running the function 'tl.rank_genes_groups' gives the error the following error message:
UnboundLocalError: local variable 'adata_comp' referenced before assignment.

Questions about dpt_order, branch and figure size.

Hi,
I have tried the Moignard early blood development example and your results can be reproduced. However, I have some questions as follows.

What does the dpt_order in the final result mean?
I thought it should be the order of cells according to their pseudotime. However, after checking the "smp.csv" in the generated data, it seems there is no correspondence between dpt_order and dpt_pseudotime. For instance, the order of 3 cells with the smallest dpt_pseudotime are 3308, 1813, 3301 respectively (instead of 0, 1, 2).
How to adjust the figure size of sc.pl.dpt?
I have tried to put plt.figure(figsize=(15,15)), but it seems that it doesn't work. Maybe we have to plot each figure manually.
Can we get information about branch assignment?
In this example, there are two branches. How can we know which branch each cell belongs to if possible? Do we have to find this indirectly with the dpt_group information?

Thanks.

colouring of highly variable genes on pl.filter_genes_dispersion

I was trying to reproduce the results in Example 1 on notebook
https://github.com/theislab/scanpy_usage/tree/master/170505_seurat

I'm getting two problems in the filtering steps in cell 9:

although genes seem to be filtered (there are 1838 genes left versus 13714 before), the plot does not show a different colour for 'highly variable' and 'other' genes. Both appear black (see attached figure). I've both tried it in a jupyter notebook and ipython. I'm running python in a conda environment with matplotlib 4.3.2.25.py35_0 and seaborn 0.8_py35
There's also the following warning message, that seems to complain of a divide by zero on the mean:
/anaconda/lib/python3.5/site-packages/scanpy/preprocessing/simple.py:193: RuntimeWarning: invalid value encountered in true_divide
dispersion = var / mean
Is

Thanks!

conda installation?

Hi Guys,
Thanks for developing such a wonderful tool in python.
Do you mind to point me on how to install scanpy under anaconda environment? currently "conda search scanpy" doesn't find it?!
Thanks
Hashem

Duplicated cell names after concatenation

In DropSeq experiments cell names are encoded by 12nt barcodes. It seems that no name check is performed when merging multiple datasets in ScanPy.

>>> from collections import Counter
>>> import scanpy.api  as sc

>>> f = sc.read("data1.txt").transpose()
>>> g = sc.read("data2.txt").transpose()
>>> c = f.concatenate(g)

>>> len(c.obs_names)
7932
>>> len(set(c.obs_names))
7890

>>> cc = Counter(c.obs_names)
>>> cc.most_common(10)
[('AAAAAAAAAAAA', 2), ('TCCTGTCTCTTA', 2), ('CGCAAGGGAAAG', 2), ('ACCCGTCTATGT', 2), ('CTCCTGTCTCTT', 2), ('TTCCTGTCTCTT', 2), ('CCCTGTCTCTTA', 2), ('CCGCTGTCTCTT', 2), ('GACAAACCTACC', 2), ('ACACTGTCTCTT', 2)]

don’t return dicts

e.g pp.filter_genes_dispersion returns a dict with known keys. if you know the keys, this means you should return an object where you can use attribute access instead. usually the pythonic thing to do in this case is to return a namedtuple.

in this case though, we have

>>> {k: v.shape for k, v in filter_result.items()}
{'dispersions': (173351,),
 'dispersions_norm': (173351,),
 'gene_filter': (173351,),
 'means': (173351,)}

which is a perfect case for a recarray.

About data preprocessing for diffusion pseudotime

Hi, first thanks for sharing this analysis tool. I prefer Python much more to R, though most Bioinformatics tools are written in R. Here I want to ask a question about data processing before we feed it as adata into dpt for pseudotime ordering.

As the DPT algorithm can accept multiple types of data, such as the most commonly single-cell qPCR (Ct values) and RNA-Seq (FPKM/TPM) data, is the data processing procedure identical with each other? Since I have also checked the Monocle 2 algorithm, it seems much more complicated in Monocle 2. For instance, in the 4th page of its document link, it asks you to specify different expressionFamily, i.e., the proper distribution of the data, for different kinds of data. Then, how about the dpt function in scanpy? Does it take all kinds of data the same way?

According to my understanding,

For qPCR data, we should provide delta_Ct=LOD-Ct values to dpt (LOD: limit of detection);
For RNA-Seq data, we should offer log2(FPKM+1) to dpt.
Is it right?

Any help is appreciated.

AnnData cannot be row-sliced after tl.diffmap

I want to split AnnData after tl.diffmap according to each cell's library. But it appears that row-slicing AnnData after diffmap, dpt, or louvain gives the error message AttributeError: 'AnnData' object has no attribute '_n_obs'. But AnnData.X and AnnData.obs can be sliced. Could you please give me advice?

>>> adata = sc.read_10x_h5('filtered_gene_bc_matrices_h5.h5', 'mm10')
>>> scanpy.api.tl.diffmap(adata)
>>> adata_diffmap[:, 0]
View of AnnData object with n_obs × n_vars = 5000 × 1
>>> adata_diffmap[0, :] 
AttributeError: 'AnnData' object has no attribute '_n_obs'

read_file_to_dict to csv or xlsx

dont think to csv is actually coded
to xlsx raises error: pandas.core.common.PandasError: DataFrame constructor not properly called!

eliminate jet

everyone agrees it’s objectively bad 😉

therefore the characters “jet” should only appear in the commit message “jettisoned bad colormap defaults”

i’ll do it if you want, and i’ll make sure contrast and distinction is preserved in all figures

Rethink group IDs in rank_genes_groups

rank_genes_groups “returns” two recarrays, each with the shape #cells×#groups. one of them stores gene IDs, one the genes’ scores.

the problem with this is that recarrays store their column index (names) in the dtype, in a place where only strings are accepted. however users (and indeed both our wilcoxon example and the tests) may choose to use numeric group IDs.

genes with score 0 are unimportant anyway, so maybe we should return sparse data, in the form of a long-form recarray with something like this shape (with <group_by> being the rank_genes_groups parameter of the same name):

obs	var	<group_by>	score
0	ENSGXXXX	5	9.728
…	…	…	…

This way the three IDs can have user-defined types, and the data is easier to process via e.g. pd.DataFrame.fromrecords(adata.obs['gene_ranking'])

The data should probably be sorted by descending z-scores by group, i.e. if it was a DataFrame: return gene_ranking.groupby(group_by).sort_values('score')

Docs Enhancements

It would be better to refer to the anndata docs for the AnnData class (http://anndata.readthedocs.io/en/latest/anndata.AnnData.html) whenever :class:~scanpy.api.AnnData appears. :class:AnnData <http://anndata.readthedocs.io/en/latest/anndata.AnnData.html> has the correct css style, but does not hyperlink. Probably a solution via http://www.sphinx-doc.org/en/1.5.1/ext/extlinks.html together with the definition of an :extclass: role that inherits the :class: properties would be the correct way to do it.
A few references, like "[Traag1723]" are not rendered correctly... Who knows what's going on there. I couldn't figure it out with a few tests... Let's see.
the Neighbors class docstring doesn't render properly
changing to the slim docstring style from the numpy docstring style messes up readability when calling the docstring lookup in jupyter or other IDEs, hence I'd advocate for maintaining this information

AnnData

__init__ method appears in AnnData
attributes appear after methods

Branching points detection

Is there any way to estimate the number of branching points? It seems that this number has to be explicitly given before running the algorithm, which might be difficult if it's not clear from simply looking at the dimensionally reduced data.