perslab / cellex Goto Github PK

View Code? Open in Web Editor NEW

34.0 34.0 9.0 10.06 MB

CELLEX (CELL-type EXpression-specificity)

License: GNU General Public License v3.0

Python 12.78% Jupyter Notebook 86.29% R 0.93%

cellex's People

Contributors

Stargazers

Watchers

Forkers

pvtodorov chau-byte letaylor mayunlong89 pdworzynski qindan2008 alexzanahg zerland sharkts666

cellex's Issues

Syntax Error in det.py file

Hello!

There seems to be a syntax error issue in the 'det.py' file under cellex/metrics/. I am trying to install and run cellex, but whenever I try to install cellex it always gives me an 'invalid syntax' error at the following line:

 Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/Users/vidalarroyo/CELLECT/CELLEX/setup.py", line 5, in <module>
        from cellex import __author__, __email__
      File "cellex/__init__.py", line 17, in <module>
        from . import metrics
      File "cellex/metrics/__init__.py", line 1, in <module>
        from .det import det
      File "cellex/metrics/det.py", line 9
        def _det(mean: pd.DataFrame, var: pd.DataFrame, n_cells: pd.DataFrame, verbose: bool=False):
                     ^
    SyntaxError: invalid syntax

IndexError: tuple index out of range

First time running CELLEX, I get the following error.

data.shape
(28621, 118345)
metadata.shape
(118345, 1)

I disabled ANOVA gene filtering (see #24).
Any advice?

eso.compute(verbose=True)
Computing DET ...
esw ...
empirical p-values ...
esw_s ...
finished in 0 min 49 sec
Computing EP ...
esw ...
empirical p-values ...
esw_s ...
finished in 0 min 0 sec
Computing GES ...
esw ...
empirical p-values ...
esw_s ...
finished in 0 min 47 sec
Computing NSI ...
esw ...
Traceback (most recent call last):
File "", line 1, in
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/cellex/esobject.py", line 120, in compute
esm_result = getattr(metrics, m.lower())(self.summary_data, verbose, compute_meta)
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/cellex/metrics/nsi.py", line 123, in nsi
esw = _nsi(df, verbose)
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/cellex/metrics/nsi.py", line 69, in _nsi
fc_mean = fc_ranked_norm.mean(axis=1)
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 138, in _mean
rcount = _count_reduce_items(arr, axis)
File "/home/rasmusr/bin/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 57, in _count_reduce_items
items *= arr.shape[ax]
IndexError: tuple index out of range

[BUG]: MissingOutputException in line 200 of /home/wbone/CELLECT/cellect-ldsc.snakefile:

Describe the issue
A clear and concise description of what your issue is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment

Hardware: [e.g. desktop, laptop, HPC, cluster]
OS: [e.g. MacOS, Linux, Windows]
CELLEX version [e.g. v1.1.1]
Environment packages and versions [call pip freeze in the env you are using]

Additional context
Add any other context about the problem here, e.g. sample data.

Are hdf5 files produced by CELLEX compatible with R?

Test if hdf5 files produced by CELLEX can be read by other implementations (R specifically). Numpy may have added elements that don't play nice with R.

mapping crashes due to FileNotFoundError

code entered

cellex.utils.mapping.ens_mouse_to_ens_human(eso.results["esmu"], drop_unmapped=True, verbose=True)

expected behavior

Map gene names

actual behavior

Mapping: mouse ensembl gene id's --> human ensembl gene id's ...

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-11-75787d394ce1> in <module>
----> 1 cellex.utils.mapping.ens_mouse_to_ens_human(eso.results["esmu"], drop_unmapped=True, verbose=True)

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/cellex/utils/mapping/ens_mouse_to_ens_human.py in ens_mouse_to_ens_human(df_unmapped, drop_unmapped, verbose)
     35     fp_mapping_file = "CELLEX/cellex/utils/mapping/maps/hsapiens_mmusculus_unique_orthologs.GRCh37.ens_v91.txt.gz"
     36 
---> 37     df_map = pd.read_csv(fp_mapping_file, delim_whitespace=True)
     38 
     39     # create dictionary for mapping mouse ensemble gene id's to human ensembl gene id's

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    455 
    456     # Create the parser.
--> 457     parser = TextFileReader(fp_or_buf, **kwds)
    458 
    459     if chunksize or iterator:

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    893             self.options["has_index_names"] = kwds["has_index_names"]
    894 
--> 895         self._make_engine(self.engine)
    896 
    897     def close(self):

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1133     def _make_engine(self, engine="c"):
   1134         if engine == "c":
-> 1135             self._engine = CParserWrapper(self.f, **self.options)
   1136         else:
   1137             if engine == "python":

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1904         kwds["usecols"] = self.usecols
   1905 
-> 1906         self._reader = parsers.TextReader(src, **kwds)
   1907         self.unnamed_cols = self._reader.unnamed_cols
   1908 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
    161             mode += 'b'
    162         if fileobj is None:
--> 163             fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
    164         if filename is None:
    165             filename = getattr(fileobj, 'name', '')

FileNotFoundError: [Errno 2] No such file or directory: 'CELLEX/cellex/utils/mapping/maps/hsapiens_mmusculus_unique_orthologs.GRCh37.ens_v91.txt.gz'

Reading in data with pandas in a server setting is slow. How can I speed this up?

I’m using:

data = pd.read_csv("./data.csv", index_col=0)

to read the expression matrix of primary cells downloaded from https://cells.ucsc.edu/?ds=organoidreportcard . There are nearly 200,000 primary cells in this dataset (11GB). Python is taking several hours to read it. I read that pd.read_cvs is not recommended when there’s a large number of columns in the file (I have 189,410). Do you have any suggestion / recommendation to read this and similarly big csv files in a format that would still make CELLEX work?

quick start tutorial: bug when importing cellex module

code entered

import numpy as np # needed for formatting data for this tutorial
import pandas as pd # needed for formatting data for this tutorial
import CELLEX.cellex as cellex # needed when importing directly from this repo

Expected behaviour

load modules

Actual behaviour

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-12-6f0745f9fe8a> in <module>
      1 import numpy as np # needed for formatting data for this tutorial
      2 import pandas as pd # needed for formatting data for this tutorial
----> 3 import CELLEX.cellex as cellex # needed when importing directly from this repo

/nfsdata/projects/jonatan/tools/sc-genetics/CELLEX/cellex/__init__.py in <module>
     15 from . import preprocessing
     16 from . import utils
---> 17 from .esobject import ESObject
     18 from .summarydata import SummaryData

/nfsdata/projects/jonatan/tools/sc-genetics/CELLEX/cellex/esobject.py in <module>
      8 from . import metrics
      9 from . import utils
---> 10 from cellex import ES_METRICS
     11 
     12 

ModuleNotFoundError: No module named 'cellex'

Can not generate n_es_gene plot

Trying to generate a n_es_gene plot, I get the following error:

PlotnineError                             Traceback (most recent call last)
/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    400                         if cls is not object \
    401                                 and callable(cls.__dict__.get('__repr__')):
--> 402                             return _repr_pprint(obj, self, cycle)
    403 
    404             return _default_pprint(obj, self, cycle)

/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    695     """A pprint that just redirects to the normal repr function."""
    696     # Find newlines and replace them with p.break_()
--> 697     output = repr(obj)
    698     for idx,output_line in enumerate(output.splitlines()):
    699         if idx:

/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/ggplot.py in __repr__(self)
     93         # in the jupyter notebook.
     94         if not self.figure:
---> 95             self.draw()
     96         plt.show()
     97         return '<ggplot: (%d)>' % self.__hash__()

/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/ggplot.py in draw(self, return_ggplot)
    186         # new frames knowing that they are separate from the original.
    187         with pd.option_context('mode.chained_assignment', None):
--> 188             return self._draw(return_ggplot)
    189 
    190     def _draw(self, return_ggplot=False):

/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/ggplot.py in _draw(self, return_ggplot)
    193         # assign a default theme
    194         self = deepcopy(self)
--> 195         self._build()
    196 
    197         # If no theme we use the default

/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/ggplot.py in _build(self)
    312 
    313         # Apply position adjustments
--> 314         layers.compute_position(layout)
    315 
    316         # Reset position scales, then re-train and map.  This

/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/layer.py in compute_position(self, layout)
     90     def compute_position(self, layout):
     91         for l in self:
---> 92             l.compute_position(layout)
     93 
     94     def use_defaults(self):

/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/layer.py in compute_position(self, layout)
    427         in concert with the other objects in the panel
    428         """
--> 429         params = self.position.setup_params(self.data)
    430         data = self.position.setup_data(self.data, params)
    431         data = self.position.compute_layer(data, params, layout)

/tools/anaconda/envs/kfm338/jp/lib/python3.7/site-packages/plotnine/positions/position_dodge.py in setup_params(self, data)
     33             msg = ("Width not defined. "
     34                    "Set with `position_dodge(width = ?)`")
---> 35             raise PlotnineError(msg)
     36 
     37         params = copy(self.params)

PlotnineError: 'Width not defined. Set with `position_dodge(width = ?)`'

Probably, this is due to the fact that I have many metadata classes (tried both 37 cell annotations and 10 main trajectories as metadata classes) and therefore the generated plot is going to be too wide.

[BUG]:Error running eso.compute

Hi - many thanks for this great tool. I'm struggling to get things to run as I keep running into an error:

ValueError: operands could not be broadcast together with shapes (9217,9) (9217,9216,10) for the eso.compute(verbose=True) step. This is happening both with my own dataset, and with the tutorial dataset from demo_mousebrain_vascular_cells.ipynb (the shape varies with which dataset I use).

I am running this on a Linux desktop, with CELLEX version 1.2.2. Is there any reason why this might be happening? Please let me know what else I could provide that would be useful.

Many thanks in advance.

Load input data as sparse matrix

I'm using the dataset available on https://singlecell.broadinstitute.org/single_cell/study/SCP1376/a-single-cell-atlas-of-human-and-mouse-white-adipose-tissue (from https://www.nature.com/articles/s41586-022-04518-2).
I extracted the UMI counts using:
GetAssayData(object = adipocytes, slot = "counts")
which returned a large R S4 dgCMatrix. It has everything I need, genes as row names, cells as column names, UMI counts as values.
I'm trying to convert it to data frame to then use it as input on CELLEX, but because it's too big, I'm unable to convert it to data frame or matrix.
Is there a way to make CELLEX accept sparse matrix?
Or is there a way that you're aware of to convert this sparse matrix to data frame, keeping columns and row names?

Thank you,

Make tutorial for how to import data from H5ad/loom(/Seurat)

Make a tutorial that demonstrates how to import data from commonly used single cell data formats into CELLEX:

H5ad
loom
(Seurat)

New functionality: implement geneset enrichment test

Implement 'geneset enrichment test'
Statistical test: wilcox test on ESmu. Test if genes in gene list has higher ESmu than genes not in the lidt.
Input: gene list(s)
Output: enrichment statistics (stat, pval, conf int, ...) for each cell-type and gene list.
See BMIbrain manuscript for method description.
(See Jon's code for inspiration: https://github.com/perslab/19-BMI-brain-wgcna)

add R scripts for plotting ES object

Add R scripts for plotting ES object using HDF5 files.

[BUG]: eso.compute error.

Hi there,

Thanks for the great package.

While running the eso.compute(), I encountered an error like this.

I did install all the requirements as in the requirement.txt file, which was also suggested in previous issues.

I was able to run CELLEX with the mousebrain_vascular_cells example, just not my dataset.

I'd greatly appreciate any thought on what is causing the error and how might one solve it.

Thanks!

Best regards,
Michelle

gene mapping function: from symbol to ensembl

Problem
Many scRNA-seq data sets come with human gene symbols. We should make it easier for users to map to Ensembl IDs, since this is used in CELLECT.

Solution
Write function to map from human ens to symbol.

Additionally, consider updating mapping function names for more consistency:
(NEW: ens_human_symbol_to_ens --> human_symbol_to_human_ens)
ens_human_to_symbol --> human_ens_to_human_symbol
ens_mouse_to_ens_human --> mouse_ens_to_human_ens
mgi_mouse_to_ens_mouse --> mouse_symbol_to_mouse_ens

The appropriate file to make the mapping is attached (which allows for mapping genes with 'version numbers' to the appropriate Ensembl ID):
GRCh38.ens_v90.gene_name_version2ensembl.txt.gz

[FEATURE REQUEST]: Work with large datasets

Hi there,

I'm trying to run CELLEX to create the specificity score for downstream CELLECT analysis.

Our current dataset has ~500K cells and exporting it into dense matrix looks to be prohibitive, plus we may not have a system that has enough memory to run the software.

Do you have any suggestion how we can do this, would bootstrapping work?

Or do you have a version that makes use of sparse matrices natively?

Many thanks
Brian

parse_input does not detect invalid metadata value types

I'm trying to run CELLEX on ~120k cells from a single-nucleus experiment. In the preprocessing step, the ANOVA gene filtering removes all my genes. Of course, as you describe in the workflow, I could just omit this step (ANOVA=False), however, it seems relevant to include.

What would you recommend I do? Would some initial gene filtering for low-expressed genes or similar help?

Preprocessing - checking input ... input parsed in 0 min 0 sec
Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 0 min 56 sec
Preprocessing - normalizing data ... data normalized in 2 min 18 sec
Preprocessing - running ANOVA ... excluded 28621 / 28621 genes in 2 min 14 sec

CELLEX crashes if duplicated cell_ids

Problem: CELLEX crashes if data contains duplicate cell_ids.
Solution: check for duplicated cell_ids before running function

data = pd.DataFrame(np.random.randint(0,100,size=(100, 5)), columns=list('ABCDD'))
data.head()
metadata = pd.DataFrame(data={"cell_type":["X","X","X", "Y", "Y"]}, index=data.columns)
metadata.head()
eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
Preprocessing - running remove_non_expressed ... excluded 0 / 100 genes in 0 min 0 sec
Preprocessing - normalizing data ... data normalized in 0 min 0 sec
---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-17-537009d482a9> in <module>
      3 metadata = pd.DataFrame(data={"cell_type":["X","X","X", "Y", "Y"]}, index=data.columns)
      4 metadata.head()
----> 5 eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/cellex/esobject.py in __init__(self, data, annotation, remove_non_expressed, normalize, anova, verbose)
     53 
     54         if type(annotation) is pd.Series:
---> 55             annotation = data.columns.map(annotation, na_action="ignore").values.astype(str)
     56 
     57         if anova:

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/core/indexes/base.py in map(self, mapper, na_action)
   4872         from .multi import MultiIndex
   4873 
-> 4874         new_values = super()._map_values(mapper, na_action=na_action)
   4875 
   4876         attributes = self._get_attributes_dict()

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
   1275                 values = self.values
   1276 
-> 1277             indexer = mapper.index.get_indexer(values)
   1278             new_values = algorithms.take_1d(mapper._values, indexer)
   1279 

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   2976         if not self.is_unique:
   2977             raise InvalidIndexError(
-> 2978                 "Reindexing only valid with uniquely" " valued Index objects"
   2979             )
   2980 

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Improve mapping documentation

Problem

The documentation is not clear on which genome build(s) are being used for mapping.
Furthermore, the resources used from mapping are from different builds.

Solution

Update documentation of mapping functions.
Update the mapping files to be consistent.

np.float needs to be changed float due to deprecation in Numpy

Cellex uses np.float in the anova.py file which was deprecated in numpy 1.24. Changing np.float to just float seems to fix the issue.

Metadata_class use

Hi,

quick question: what is the purpose of the metadata_class use in the vignette? I assume it's needed to compute ESµ between conditions? It is not specified anywhere, not in the documentation or in the publication itself or in the longer CELLECT tutorial.

Update parameter name in CELLEX wiki

the CELLEX workflow wiki page gives the following example of running ESObject:

cellex.ESObject(df=data, annotation=metadata, normalize=False, verbose=True)

however the name of the first argument should be 'data'

[BUG]: Error saving summary data

I get an error when I try to save summary data from an ESObject:

eso.summary_data.save()

I get this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_1017639/1709116986.py in <module>
      1 # Save summary data
----> 2 eso.summary_data.save()

... lib/python3.8/site-packages/cellex/summarydata.py in save(self, dir_name, verbose)
    190             att = getattr(self, s)
    191             if isinstance(att, pd.DataFrame) and s != "data":
--> 192                 fp = "{}/summarystat.{}.csv.gz".format(dir_name, self.name, s)
    193                 att.to_csv(fp, compression="gzip")
    194                 if verbose:

AttributeError: 'SummaryData' object has no attribute 'name'

It looks like there are too many parameters provided for the filename.

I am using CELLEX version 1.2.1

List requirements.txt in README

Error in READ.ME

A wrong parameter name in the ESObject constructor:

Create ESObject and compute ESmu

eso = cellex.ESObject(df=data, annotation=metadata, verbose=True)

Here we must have 'data' as the first parameter name instead of 'df':

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

gene_profile plot x-axis labels are offset and sometimes missing

When the dataframe contains 20-40 cell-types, labels are offset to the left.
When the dataframe contains <= 20 cell-types, the labels are shifted one place to the left and the leftmost one disappears.

This may be an issue with plotnine.

Example: dataframe with 21 cell-types

Labels are offset, but match the plot.

Example: dataframe with 20 cell-types

Labels are offset and one label is missing (on the left)

Warn if negative gene expression values

Print warning if receiving negative gene expression values.

[BUG]: Operands could not be broadcast together with shapes (15831,12) (15831,15830,13)

Dear friend,

I came across the Value Error when I used cellex on my data like this:

And I fixed this bug by modifying the line 44 in cellex/metrics/det.py like this:
Change
n_cells = np.array([n_cells.values] * mean.shape[0]) # faster than count

To
n_cells = np.array(n_cells.values * mean.shape[0]) # faster than count

My lab member use the same tools but he has never come across this problem. I am wondering if this bug is cause by the different version of numpy, because it seems to be an array calculation problem. My numpy version is 1.21.6, and my cellex version is 1.2.2.

I would be thankful If this problem can be fixed for broader environment toleration. And I would appreciate it if you can tell me the real reason for the different calculation rule in my computer.

Thanks a lot!

Amy

quick start tutorial: AttributeError: 'ESObject' object has no attribute 'save'

Tutorial

Code entered

eso.save(verbose=True)

expected behaviour

save the results to disk

actual behaviour

`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
----> 1 eso.save(verbose=True)

AttributeError: 'ESObject' object has no attribute 'save'`

.txt.gz files for "mapping"-functions are not found

User DaianeH has experienced following error when attempting to use cellex.utils.mapping.mgi_mouse_to_ens_mouse() function:

eso_mapped = cellex.utils.mapping.mgi_mouse_to_ens_mouse(eso.results["esmu"])
Traceback (most recent call last):
File "", line 1, in
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream
self, resource_name
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream
return io.BytesIO(self.get_resource_string(manager, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string
return self._get(self._fn(self.module_path, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get
return self.loader.get_data(path)
OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'

Indicating that Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz was not found. I replicated the error and also got a

FileNotFoundError: [Errno 2] No such file or directory: 'CELLEX/cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'

at the end.

Tried to use other map functions and they returned same error for their corresponding files used for mapping
cellex.utils.mapping.ens_mouse_to_ens_human():

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-8-74444136e9e6> in <module>
----> 1 cellex.utils.mapping.ens_mouse_to_ens_human(df.iloc[:,0])

~/miniconda3/envs/cellex/lib/python3.6/site-packages/cellex/utils/mapping/ens_mouse_to_ens_human.py in ens_mouse_to_ens_human(df_unmapped, drop_unmapped, verbose)
     35     fp_mapping_file = "CELLEX/cellex/utils/mapping/maps/hsapiens_mmusculus_unique_orthologs.GRCh37.ens_v91.txt.gz"
     36 
---> 37     df_map = pd.read_csv(fp_mapping_file, delim_whitespace=True)
     38 
     39     # create dictionary for mapping mouse ensemble gene id's to human ensembl gene id's

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    455 
    456     # Create the parser.
--> 457     parser = TextFileReader(fp_or_buf, **kwds)
    458 
    459     if chunksize or iterator:

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    893             self.options["has_index_names"] = kwds["has_index_names"]
    894 
--> 895         self._make_engine(self.engine)
    896 
    897     def close(self):

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1133     def _make_engine(self, engine="c"):
   1134         if engine == "c":
-> 1135             self._engine = CParserWrapper(self.f, **self.options)
   1136         else:
   1137             if engine == "python":

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1904         kwds["usecols"] = self.usecols
   1905 
-> 1906         self._reader = parsers.TextReader(src, **kwds)
   1907         self.unnamed_cols = self._reader.unnamed_cols
   1908 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

~/miniconda3/envs/cellex/lib/python3.6/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
    161             mode += 'b'
    162         if fileobj is None:
--> 163             fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
    164         if filename is None:
    165             filename = getattr(fileobj, 'name', '')

FileNotFoundError: [Errno 2] No such file or directory: 'CELLEX/cellex/utils/mapping/maps/hsapiens_mmusculus_unique_orthologs.GRCh37.ens_v91.txt.gz'

cellex.utils.mapping.ens_human_to_symbol()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-9-6808aa31ce1a> in <module>
----> 1 cellex.utils.mapping.ens_human_to_symbol(df.iloc[:,0])

~/miniconda3/envs/cellex/lib/python3.6/site-packages/cellex/utils/mapping/ens_human_to_symbol.py in ens_human_to_symbol(df_unmapped, drop_unmapped, verbose)
     36     fp_mapping_file = "CELLEX/cellex/utils/mapping/maps/GRCh38.ens_v90.ensembl2gene_name_version.txt.gz"
     37 
---> 38     df_map = pd.read_csv(fp_mapping_file, delim_whitespace=True, compression="gzip")
     39 
     40     # create dictionary for mapping human ensemble gene id's to gene names

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    455 
    456     # Create the parser.
--> 457     parser = TextFileReader(fp_or_buf, **kwds)
    458 
    459     if chunksize or iterator:

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    893             self.options["has_index_names"] = kwds["has_index_names"]
    894 
--> 895         self._make_engine(self.engine)
    896 
    897     def close(self):

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1133     def _make_engine(self, engine="c"):
   1134         if engine == "c":
-> 1135             self._engine = CParserWrapper(self.f, **self.options)
   1136         else:
   1137             if engine == "python":

~/miniconda3/envs/cellex/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1904         kwds["usecols"] = self.usecols
   1905 
-> 1906         self._reader = parsers.TextReader(src, **kwds)
   1907         self.unnamed_cols = self._reader.unnamed_cols
   1908 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

~/miniconda3/envs/cellex/lib/python3.6/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
    161             mode += 'b'
    162         if fileobj is None:
--> 163             fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
    164         if filename is None:
    165             filename = getattr(fileobj, 'name', '')

FileNotFoundError: [Errno 2] No such file or directory: 'CELLEX/cellex/utils/mapping/maps/GRCh38.ens_v90.ensembl2gene_name_version.txt.gz'

Memory usage needs to be optimized

I have been running CELLEX on a huge dataset with ~1.3kk cells recently. To my surprise, I encountered the following error:

MemoryError: Unable to allocate array with shape (1331984, 26182) and data type float64

Thus, the server has not enough memory to complete the task if the expression matrix is stored as float64 (by default). CELLEX consumes > 50% of RAM (more than 1 TB) and then the analysis inextricably stops.

2 developers: is it really necessary to store the expression matrix as float64? This super high precision is relevant? Are you sure that float32 is not sufficient?

2 users: I was able to solve the problem by converting my gene expression matrix (the variable data in the tutorial) from the default data type float64 to float32 before creating ESObject as follows

data_float32 = data.astype(np.float32)

Don’t forget to delete the variables after (we need to save the Yggdrasil’s RAM):

del data
del data_float32

Additional methods for gene filtering

Implement additional methods for gene filtering.
The ANOVA model is relatively slow and other simple 'QC methods' must exist.
The ANOVA model works best if data is normally distributed, which may not always be the case.
Seek inspiration from established software, e.g. scanpy filter_genes

eso.results[“esmu”] data.frame lacks index column name

Description:

the output of eso.compute(), eso.results[“esmu”], lacks index column name (should be 'gene')

perslab / cellex Goto Github PK

cellex's People

Contributors

Stargazers

Watchers

Forkers

cellex's Issues

code entered

expected behavior

actual behavior

code entered

Expected behaviour

Actual behaviour

Problem

Solution

Create ESObject and compute ESmu

Example: dataframe with 21 cell-types

Example: dataframe with 20 cell-types

Tutorial

Code entered

expected behaviour

actual behaviour

Description:

Recommend Projects

Recommend Topics

Recommend Org