perslab / cellex Goto Github PK
View Code? Open in Web Editor NEWCELLEX (CELL-type EXpression-specificity)
License: GNU General Public License v3.0
CELLEX (CELL-type EXpression-specificity)
License: GNU General Public License v3.0
Hello!
There seems to be a syntax error issue in the 'det.py' file under cellex/metrics/. I am trying to install and run cellex, but whenever I try to install cellex it always gives me an 'invalid syntax' error at the following line:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/vidalarroyo/CELLECT/CELLEX/setup.py", line 5, in <module>
from cellex import __author__, __email__
File "cellex/__init__.py", line 17, in <module>
from . import metrics
File "cellex/metrics/__init__.py", line 1, in <module>
from .det import det
File "cellex/metrics/det.py", line 9
def _det(mean: pd.DataFrame, var: pd.DataFrame, n_cells: pd.DataFrame, verbose: bool=False):
^
SyntaxError: invalid syntax
First time running CELLEX, I get the following error.
data.shape
(28621, 118345)
metadata.shape
(118345, 1)
I disabled ANOVA gene filtering (see #24).
Any advice?
eso.compute(verbose=True)
Computing DET ...
esw ...
empirical p-values ...
esw_s ...
finished in 0 min 49 sec
Computing EP ...
esw ...
empirical p-values ...
esw_s ...
finished in 0 min 0 sec
Computing GES ...
esw ...
empirical p-values ...
esw_s ...
finished in 0 min 47 sec
Computing NSI ...
esw ...
Traceback (most recent call last):
File "", line 1, in
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/cellex/esobject.py", line 120, in compute
esm_result = getattr(metrics, m.lower())(self.summary_data, verbose, compute_meta)
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/cellex/metrics/nsi.py", line 123, in nsi
esw = _nsi(df, verbose)
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/cellex/metrics/nsi.py", line 69, in _nsi
fc_mean = fc_ranked_norm.mean(axis=1)
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 138, in _mean
rcount = _count_reduce_items(arr, axis)
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 57, in _count_reduce_items
items *= arr.shape[ax]
IndexError: tuple index out of range
Describe the issue
A clear and concise description of what your issue is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
pip freeze
in the env you are using]Additional context
Add any other context about the problem here, e.g. sample data.
Test if hdf5 files produced by CELLEX can be read by other implementations (R specifically). Numpy may have added elements that don't play nice with R.
cellex.utils.mapping.ens_mouse_to_ens_human(eso.results["esmu"], drop_unmapped=True, verbose=True)
Map gene names
Mapping: mouse ensembl gene id's --> human ensembl gene id's ...
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-11-75787d394ce1> in <module>
----> 1 cellex.utils.mapping.ens_mouse_to_ens_human(eso.results["esmu"], drop_unmapped=True, verbose=True)
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/cellex/utils/mapping/ens_mouse_to_ens_human.py in ens_mouse_to_ens_human(df_unmapped, drop_unmapped, verbose)
35 fp_mapping_file = "CELLEX/cellex/utils/mapping/maps/hsapiens_mmusculus_unique_orthologs.GRCh37.ens_v91.txt.gz"
36
---> 37 df_map = pd.read_csv(fp_mapping_file, delim_whitespace=True)
38
39 # create dictionary for mapping mouse ensemble gene id's to human ensembl gene id's
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
455
456 # Create the parser.
--> 457 parser = TextFileReader(fp_or_buf, **kwds)
458
459 if chunksize or iterator:
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
893 self.options["has_index_names"] = kwds["has_index_names"]
894
--> 895 self._make_engine(self.engine)
896
897 def close(self):
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1133 def _make_engine(self, engine="c"):
1134 if engine == "c":
-> 1135 self._engine = CParserWrapper(self.f, **self.options)
1136 else:
1137 if engine == "python":
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1904 kwds["usecols"] = self.usecols
1905
-> 1906 self._reader = parsers.TextReader(src, **kwds)
1907 self.unnamed_cols = self._reader.unnamed_cols
1908
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
161 mode += 'b'
162 if fileobj is None:
--> 163 fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
164 if filename is None:
165 filename = getattr(fileobj, 'name', '')
FileNotFoundError: [Errno 2] No such file or directory: 'CELLEX/cellex/utils/mapping/maps/hsapiens_mmusculus_unique_orthologs.GRCh37.ens_v91.txt.gz'
I’m using:
data = pd.read_csv("./data.csv", index_col=0)
to read the expression matrix of primary cells downloaded from https://cells.ucsc.edu/?ds=organoidreportcard . There are nearly 200,000 primary cells in this dataset (11GB). Python is taking several hours to read it. I read that pd.read_cvs is not recommended when there’s a large number of columns in the file (I have 189,410). Do you have any suggestion / recommendation to read this and similarly big csv files in a format that would still make CELLEX work?
import numpy as np # needed for formatting data for this tutorial
import pandas as pd # needed for formatting data for this tutorial
import CELLEX.cellex as cellex # needed when importing directly from this repo
load modules
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-12-6f0745f9fe8a> in <module>
1 import numpy as np # needed for formatting data for this tutorial
2 import pandas as pd # needed for formatting data for this tutorial
----> 3 import CELLEX.cellex as cellex # needed when importing directly from this repo
/nfsdata/projects/jonatan/tools/sc-genetics/CELLEX/cellex/__init__.py in <module>
15 from . import preprocessing
16 from . import utils
---> 17 from .esobject import ESObject
18 from .summarydata import SummaryData
/nfsdata/projects/jonatan/tools/sc-genetics/CELLEX/cellex/esobject.py in <module>
8 from . import metrics
9 from . import utils
---> 10 from cellex import ES_METRICS
11
12
ModuleNotFoundError: No module named 'cellex'
Trying to generate a n_es_gene plot, I get the following error:
PlotnineError Traceback (most recent call last)
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
400 if cls is not object \
401 and callable(cls.__dict__.get('__repr__')):
--> 402 return _repr_pprint(obj, self, cycle)
403
404 return _default_pprint(obj, self, cycle)
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
695 """A pprint that just redirects to the normal repr function."""
696 # Find newlines and replace them with p.break_()
--> 697 output = repr(obj)
698 for idx,output_line in enumerate(output.splitlines()):
699 if idx:
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/ggplot.py in __repr__(self)
93 # in the jupyter notebook.
94 if not self.figure:
---> 95 self.draw()
96 plt.show()
97 return '<ggplot: (%d)>' % self.__hash__()
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/ggplot.py in draw(self, return_ggplot)
186 # new frames knowing that they are separate from the original.
187 with pd.option_context('mode.chained_assignment', None):
--> 188 return self._draw(return_ggplot)
189
190 def _draw(self, return_ggplot=False):
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/ggplot.py in _draw(self, return_ggplot)
193 # assign a default theme
194 self = deepcopy(self)
--> 195 self._build()
196
197 # If no theme we use the default
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/ggplot.py in _build(self)
312
313 # Apply position adjustments
--> 314 layers.compute_position(layout)
315
316 # Reset position scales, then re-train and map. This
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/layer.py in compute_position(self, layout)
90 def compute_position(self, layout):
91 for l in self:
---> 92 l.compute_position(layout)
93
94 def use_defaults(self):
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/layer.py in compute_position(self, layout)
427 in concert with the other objects in the panel
428 """
--> 429 params = self.position.setup_params(self.data)
430 data = self.position.setup_data(self.data, params)
431 data = self.position.compute_layer(data, params, layout)
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/positions/position_dodge.py in setup_params(self, data)
33 msg = ("Width not defined. "
34 "Set with `position_dodge(width = ?)`")
---> 35 raise PlotnineError(msg)
36
37 params = copy(self.params)
PlotnineError: 'Width not defined. Set with `position_dodge(width = ?)`'
Probably, this is due to the fact that I have many metadata classes (tried both 37 cell annotations and 10 main trajectories as metadata classes) and therefore the generated plot is going to be too wide.
Hi - many thanks for this great tool. I'm struggling to get things to run as I keep running into an error:
ValueError: operands could not be broadcast together with shapes (9217,9) (9217,9216,10)
for the eso.compute(verbose=True)
step. This is happening both with my own dataset, and with the tutorial dataset from demo_mousebrain_vascular_cells.ipynb
(the shape varies with which dataset I use).
I am running this on a Linux desktop, with CELLEX version 1.2.2. Is there any reason why this might be happening? Please let me know what else I could provide that would be useful.
Many thanks in advance.
I'm using the dataset available on https://singlecell.broadinstitute.org/single_cell/study/SCP1376/a-single-cell-atlas-of-human-and-mouse-white-adipose-tissue (from https://www.nature.com/articles/s41586-022-04518-2).
I extracted the UMI counts using:
GetAssayData(object = adipocytes, slot = "counts")
which returned a large R S4 dgCMatrix. It has everything I need, genes as row names, cells as column names, UMI counts as values.
I'm trying to convert it to data frame to then use it as input on CELLEX, but because it's too big, I'm unable to convert it to data frame or matrix.
Is there a way to make CELLEX accept sparse matrix?
Or is there a way that you're aware of to convert this sparse matrix to data frame, keeping columns and row names?
Thank you,
Make a tutorial that demonstrates how to import data from commonly used single cell data formats into CELLEX:
Implement 'geneset enrichment test'
Statistical test: wilcox test on ESmu. Test if genes in gene list has higher ESmu than genes not in the lidt.
Input: gene list(s)
Output: enrichment statistics (stat, pval, conf int, ...) for each cell-type and gene list.
See BMIbrain manuscript for method description.
(See Jon's code for inspiration: https://github.com/perslab/19-BMI-brain-wgcna)
Add R scripts for plotting ES object using HDF5 files.
Hi there,
Thanks for the great package.
While running the eso.compute(), I encountered an error like this.
I did install all the requirements as in the requirement.txt file, which was also suggested in previous issues.
I was able to run CELLEX with the mousebrain_vascular_cells example, just not my dataset.
I'd greatly appreciate any thought on what is causing the error and how might one solve it.
Thanks!
Best regards,
Michelle
Problem
Many scRNA-seq data sets come with human gene symbols. We should make it easier for users to map to Ensembl IDs, since this is used in CELLECT.
Solution
Write function to map from human ens to symbol.
Additionally, consider updating mapping function names for more consistency:
(NEW: ens_human_symbol_to_ens --> human_symbol_to_human_ens)
ens_human_to_symbol --> human_ens_to_human_symbol
ens_mouse_to_ens_human --> mouse_ens_to_human_ens
mgi_mouse_to_ens_mouse --> mouse_symbol_to_mouse_ens
The appropriate file to make the mapping is attached (which allows for mapping genes with 'version numbers' to the appropriate Ensembl ID):
GRCh38.ens_v90.gene_name_version2ensembl.txt.gz
Hi there,
I'm trying to run CELLEX to create the specificity score for downstream CELLECT analysis.
Our current dataset has ~500K cells and exporting it into dense matrix looks to be prohibitive, plus we may not have a system that has enough memory to run the software.
Do you have any suggestion how we can do this, would bootstrapping work?
Or do you have a version that makes use of sparse matrices natively?
Many thanks
Brian
I'm trying to run CELLEX on ~120k cells from a single-nucleus experiment. In the preprocessing step, the ANOVA gene filtering removes all my genes. Of course, as you describe in the workflow, I could just omit this step (ANOVA=False), however, it seems relevant to include.
What would you recommend I do? Would some initial gene filtering for low-expressed genes or similar help?
Preprocessing - checking input ... input parsed in 0 min 0 sec
Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 0 min 56 sec
Preprocessing - normalizing data ... data normalized in 2 min 18 sec
Preprocessing - running ANOVA ... excluded 28621 / 28621 genes in 2 min 14 sec
Problem: CELLEX crashes if data contains duplicate cell_ids.
Solution: check for duplicated cell_ids before running function
data = pd.DataFrame(np.random.randint(0,100,size=(100, 5)), columns=list('ABCDD'))
data.head()
metadata = pd.DataFrame(data={"cell_type":["X","X","X", "Y", "Y"]}, index=data.columns)
metadata.head()
eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
Preprocessing - running remove_non_expressed ... excluded 0 / 100 genes in 0 min 0 sec
Preprocessing - normalizing data ... data normalized in 0 min 0 sec
---------------------------------------------------------------------------
InvalidIndexError Traceback (most recent call last)
<ipython-input-17-537009d482a9> in <module>
3 metadata = pd.DataFrame(data={"cell_type":["X","X","X", "Y", "Y"]}, index=data.columns)
4 metadata.head()
----> 5 eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/cellex/esobject.py in __init__(self, data, annotation, remove_non_expressed, normalize, anova, verbose)
53
54 if type(annotation) is pd.Series:
---> 55 annotation = data.columns.map(annotation, na_action="ignore").values.astype(str)
56
57 if anova:
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/core/indexes/base.py in map(self, mapper, na_action)
4872 from .multi import MultiIndex
4873
-> 4874 new_values = super()._map_values(mapper, na_action=na_action)
4875
4876 attributes = self._get_attributes_dict()
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
1275 values = self.values
1276
-> 1277 indexer = mapper.index.get_indexer(values)
1278 new_values = algorithms.take_1d(mapper._values, indexer)
1279
/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
2976 if not self.is_unique:
2977 raise InvalidIndexError(
-> 2978 "Reindexing only valid with uniquely" " valued Index objects"
2979 )
2980
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Cellex uses np.float in the anova.py file which was deprecated in numpy 1.24. Changing np.float to just float seems to fix the issue.
Hi,
quick question: what is the purpose of the metadata_class use in the vignette? I assume it's needed to compute ESµ between conditions? It is not specified anywhere, not in the documentation or in the publication itself or in the longer CELLECT tutorial.
the CELLEX workflow wiki page gives the following example of running ESObject
:
cellex.ESObject(df=data, annotation=metadata, normalize=False, verbose=True)
however the name of the first argument should be 'data'
I get an error when I try to save summary data from an ESObject:
eso.summary_data.save()
I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_1017639/1709116986.py in <module>
1 # Save summary data
----> 2 eso.summary_data.save()
... lib/python3.8/site-packages/cellex/summarydata.py in save(self, dir_name, verbose)
190 att = getattr(self, s)
191 if isinstance(att, pd.DataFrame) and s != "data":
--> 192 fp = "{}/summarystat.{}.csv.gz".format(dir_name, self.name, s)
193 att.to_csv(fp, compression="gzip")
194 if verbose:
AttributeError: 'SummaryData' object has no attribute 'name'
It looks like there are too many parameters provided for the filename.
I am using CELLEX version 1.2.1
A wrong parameter name in the ESObject constructor:
eso = cellex.ESObject(df=data, annotation=metadata, verbose=True)
Here we must have 'data' as the first parameter name instead of 'df':
eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
When the dataframe contains 20-40 cell-types, labels are offset to the left.
When the dataframe contains <= 20 cell-types, the labels are shifted one place to the left and the leftmost one disappears.
This may be an issue with plotnine.
Labels are offset, but match the plot.
Labels are offset and one label is missing (on the left)
Print warning if receiving negative gene expression values.
Dear friend,
I came across the Value Error when I used cellex on my data like this:
And I fixed this bug by modifying the line 44 in cellex/metrics/det.py like this:
Change
n_cells = np.array([n_cells.values] * mean.shape[0]) # faster than count
To
n_cells = np.array(n_cells.values * mean.shape[0]) # faster than count
My lab member use the same tools but he has never come across this problem. I am wondering if this bug is cause by the different version of numpy, because it seems to be an array calculation problem. My numpy version is 1.21.6, and my cellex version is 1.2.2.
I would be thankful If this problem can be fixed for broader environment toleration. And I would appreciate it if you can tell me the real reason for the different calculation rule in my computer.
Thanks a lot!
Amy
eso.save(verbose=True)
save the results to disk
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
----> 1 eso.save(verbose=True)
AttributeError: 'ESObject' object has no attribute 'save'`
User DaianeH has experienced following error when attempting to use cellex.utils.mapping.mgi_mouse_to_ens_mouse()
function:
eso_mapped = cellex.utils.mapping.mgi_mouse_to_ens_mouse(eso.results["esmu"])
Traceback (most recent call last):
File "", line 1, in
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream
self, resource_name
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream
return io.BytesIO(self.get_resource_string(manager, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string
return self._get(self._fn(self.module_path, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get
return self.loader.get_data(path)
OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'
Indicating that Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz
was not found. I replicated the error and also got a
FileNotFoundError: [Errno 2] No such file or directory: 'CELLEX/cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'
at the end.
Tried to use other map functions and they returned same error for their corresponding files used for mapping
cellex.utils.mapping.ens_mouse_to_ens_human()
:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-8-74444136e9e6> in <module>
----> 1 cellex.utils.mapping.ens_mouse_to_ens_human(df.iloc[:,0])
~/miniconda3/envs/cellex/lib/python3.6/site-packages/cellex/utils/mapping/ens_mouse_to_ens_human.py in ens_mouse_to_ens_human(df_unmapped, drop_unmapped, verbose)
35 fp_mapping_file = "CELLEX/cellex/utils/mapping/maps/hsapiens_mmusculus_unique_orthologs.GRCh37.ens_v91.txt.gz"
36
---> 37 df_map = pd.read_csv(fp_mapping_file, delim_whitespace=True)
38
39 # create dictionary for mapping mouse ensemble gene id's to human ensembl gene id's
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
455
456 # Create the parser.
--> 457 parser = TextFileReader(fp_or_buf, **kwds)
458
459 if chunksize or iterator:
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
893 self.options["has_index_names"] = kwds["has_index_names"]
894
--> 895 self._make_engine(self.engine)
896
897 def close(self):
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1133 def _make_engine(self, engine="c"):
1134 if engine == "c":
-> 1135 self._engine = CParserWrapper(self.f, **self.options)
1136 else:
1137 if engine == "python":
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1904 kwds["usecols"] = self.usecols
1905
-> 1906 self._reader = parsers.TextReader(src, **kwds)
1907 self.unnamed_cols = self._reader.unnamed_cols
1908
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
~/miniconda3/envs/cellex/lib/python3.6/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
161 mode += 'b'
162 if fileobj is None:
--> 163 fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
164 if filename is None:
165 filename = getattr(fileobj, 'name', '')
FileNotFoundError: [Errno 2] No such file or directory: 'CELLEX/cellex/utils/mapping/maps/hsapiens_mmusculus_unique_orthologs.GRCh37.ens_v91.txt.gz'
cellex.utils.mapping.ens_human_to_symbol()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-9-6808aa31ce1a> in <module>
----> 1 cellex.utils.mapping.ens_human_to_symbol(df.iloc[:,0])
~/miniconda3/envs/cellex/lib/python3.6/site-packages/cellex/utils/mapping/ens_human_to_symbol.py in ens_human_to_symbol(df_unmapped, drop_unmapped, verbose)
36 fp_mapping_file = "CELLEX/cellex/utils/mapping/maps/GRCh38.ens_v90.ensembl2gene_name_version.txt.gz"
37
---> 38 df_map = pd.read_csv(fp_mapping_file, delim_whitespace=True, compression="gzip")
39
40 # create dictionary for mapping human ensemble gene id's to gene names
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
455
456 # Create the parser.
--> 457 parser = TextFileReader(fp_or_buf, **kwds)
458
459 if chunksize or iterator:
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
893 self.options["has_index_names"] = kwds["has_index_names"]
894
--> 895 self._make_engine(self.engine)
896
897 def close(self):
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1133 def _make_engine(self, engine="c"):
1134 if engine == "c":
-> 1135 self._engine = CParserWrapper(self.f, **self.options)
1136 else:
1137 if engine == "python":
~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1904 kwds["usecols"] = self.usecols
1905
-> 1906 self._reader = parsers.TextReader(src, **kwds)
1907 self.unnamed_cols = self._reader.unnamed_cols
1908
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
~/miniconda3/envs/cellex/lib/python3.6/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
161 mode += 'b'
162 if fileobj is None:
--> 163 fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
164 if filename is None:
165 filename = getattr(fileobj, 'name', '')
FileNotFoundError: [Errno 2] No such file or directory: 'CELLEX/cellex/utils/mapping/maps/GRCh38.ens_v90.ensembl2gene_name_version.txt.gz'
I have been running CELLEX on a huge dataset with ~1.3kk cells recently. To my surprise, I encountered the following error:
MemoryError: Unable to allocate array with shape (1331984, 26182) and data type float64
Thus, the server has not enough memory to complete the task if the expression matrix is stored as float64 (by default). CELLEX consumes > 50% of RAM (more than 1 TB) and then the analysis inextricably stops.
2 developers: is it really necessary to store the expression matrix as float64? This super high precision is relevant? Are you sure that float32 is not sufficient?
2 users: I was able to solve the problem by converting my gene expression matrix (the variable data
in the tutorial) from the default data type float64 to float32 before creating ESObject
as follows
data_float32 = data.astype(np.float32)
Don’t forget to delete the variables after (we need to save the Yggdrasil’s RAM):
del data
del data_float32
Implement additional methods for gene filtering.
The ANOVA model is relatively slow and other simple 'QC methods' must exist.
The ANOVA model works best if data is normally distributed, which may not always be the case.
Seek inspiration from established software, e.g. scanpy filter_genes
eso.compute()
, eso.results[“esmu”]
, lacks index column name (should be 'gene')A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.