Giter Site home page Giter Site logo

martinjzhang / scdrs Goto Github PK

View Code? Open in Web Editor NEW
94.0 3.0 11.0 200.34 MB

Single-cell disease relevance score (scDRS)

Home Page: https://martinjzhang.github.io/scDRS/

License: MIT License

Jupyter Notebook 97.92% Python 1.89% Shell 0.18% R 0.01% JavaScript 0.01%
cell-state gwas single-cell-rna-seq diseases-and-complex-traits within-cell-type-heterogeneity

scdrs's Introduction

DOI

scDRS (single-cell disease-relevance score) is a method for associating individual cells in single-cell RNA-seq data with disease GWASs, built on top of AnnData and Scanpy.

Read the documentation: installation, usage, command-line interface (CLI), file formats, etc.

Check out instructions for making customized gene sets using MAGMA.

Reference

Zhang*, Hou*, et al. "Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data", Nature Genetics, 2022.

Versions

  • v1.0.3: development version. Fixing a bug of negative values of ct_mean when --adj-prop and --cov are on and there are genes extremely low expression; print --adj-prop info in scdrs compute-score; check p-value and z-score files that the gene column should have header GENE; force index in df_cov and df_score to be str; add --min-genes and --min-cells in CLI for customized filtering; adjustable FDR threshold for plot_group_stats #75.
  • v1.0.2: latest stable version. Bug fixes on scdrs.util.plot_group_stats; input checks in scdrs munge-gs and scdrs.util.load_h5ad.

Older versions

  • v1.0.1: stable version used in publication. Identical to v1.0.0 except documentation.
  • v1.0.0: stable version used in revision 1. Results are identical to v0.1 for binary gene sets. Changes with respect to v0.1:
    • scDRS command-line interface (CLI) instead of .py scripts for calling scDRS in bash, including scdrs munge-gs, scdrs compute-score, and scdrs perform-downstream.
    • More efficient in memory use due to the use of sparse matrix throughout the computation.
    • Allow the use of quantitative weights.
    • New feature --adj-prop for adjusting for cell type-proportions.
  • v0.1: stable version used in the initial submission.

Code and data to reproduce results of the paper

See scDRS_paper for more details (experiments folder is deprecated). Data are at figshare.

  • Download GWAS gene sets (.gs files) for 74 diseases and complex traits.
  • Download scDRS results (.score.gz and .full_score.gz files) for TMS FACS + 74 diseases/trait.

Older versions

Explore scDRS results via CELLxGENE

cellxgene cellxgene
110,096 cells from 120 cell types in TMS FACS IBD-associated cells

scDRS scripts (deprecated)


NOTE: scDRS scripts are still maintained but deprecated. Consider using scDRS command-line interface instead.


scDRS script for score calculation

Input: scRNA-seq data (.h5ad file) and gene set file (.gs file)

Output: scDRS score file ({trait}.score.gz file) and full score file ({trait}.full_score.gz file) for each trait in the .gs file

h5ad_file=your_scrnaseq_data
cov_file=your_covariate_file
gs_file=your_gene_set_file
out_dir=your_output_folder

python compute_score.py \
    --h5ad_file ${h5ad_file}.h5ad\
    --h5ad_species mouse\
    --cov_file ${cov_file}.cov\
    --gs_file ${gs_file}.gs\
    --gs_species human\
    --flag_filter True\
    --flag_raw_count True\
    --n_ctrl 1000\
    --flag_return_ctrl_raw_score False\
    --flag_return_ctrl_norm_score True\
    --out_folder ${out_dir}
  • --h5ad_file (.h5ad file) : scRNA-seq data
  • --h5ad_species ("hsapiens"/"human"/"mmusculus"/"mouse") : species of the scRNA-seq data samples
  • --cov_file (.cov file) : covariate file (optional, .tsv file, see file format)
  • --gs_file (.gs file) : gene set file (see file format)
  • --gs_species ("hsapiens"/"human"/"mmusculus"/"mouse") : species for genes in the gene set file
  • --flag_filter ("True"/"False") : if to perform minimum filtering of cells and genes
  • --flag_raw_count ("True"/"False") : if to perform normalization (size-factor + log1p)
  • --n_ctrl (int) : number of control gene sets (default 1,000)
  • --flag_return_ctrl_raw_score ("True"/"False") : if to return raw control scores
  • --flag_return_ctrl_norm_score ("True"/"False") : if to return normalized control scores
  • --out_folder : output folder. Score files will be saved as {out_folder}/{trait}.score.gz (see file format)

scDRS script for downsteam applications

Input: scRNA-seq data (.h5ad file), gene set file (.gs file), and scDRS full score files (.full_score.gz files)

Output: {trait}.scdrs_ct.{cell_type} file (same as the new {trait}.scdrs_group.{cell_type} file) for cell type-level analyses (association and heterogeneity); {trait}.scdrs_var file (same as the new {trait}.scdrs_cell_corr file) for cell variable-disease association; {trait}.scdrs_gene file for disease gene prioritization.

h5ad_file=your_scrnaseq_data
out_dir=your_output_folder
python compute_downstream.py \
    --h5ad_file ${h5ad_file}.h5ad \
    --score_file @.full_score.gz \
    --cell_type cell_type \
    --cell_variable causal_variable,non_causal_variable,covariate\
    --flag_gene True\
    --flag_filter False\
    --flag_raw_count False\ # flag_raw_count is set to `False` because the toy data is already log-normalized, set to `True` if your data is not log-normalized
    --out_folder ${out_dir}
  • --h5ad_file (.h5ad file) : scRNA-seq data
  • --score_file (.full_score.gz files) : scDRS full score files; supporting use of "@" to match strings
  • --cell_type (str) : cell type column (supporting multiple columns separated by comma); must be present in adata.obs.columns; used for cell type-disease association analyses (5% quantile as test statistic) and detecting association heterogeneity within cell type (Geary's C as test statistic)
  • --cell_variable (str) : cell-level variable columns (supporting multiple columns separated by comma); must be present in adata.obs.columns; used for cell variable-disease association analyses (Pearson's correlation as test statistic)
  • --flag_gene ("True"/"False") : if to correlate scDRS disease scores with gene expression
  • --flag_filter ("True"/"False") : if to perform minimum filtering of cells and genes
  • --flag_raw_count ("True"/"False") : if to perform normalization (size-factor + log1p)
  • --out_folder : output folder. Score files will be saved as {out_folder}/{trait}.scdrs_ct.{cell_type} for cell type-level analyses (association and heterogeneity); {out_folder}/{trait}.scdrs_var file for cell variable-disease association; {out_folder}/{trait}.scdrs_var.{trait}.scdrs_gene file for disease gene prioritization. (see file format)

scdrs's People

Contributors

kangchenghou avatar martinjzhang avatar yyoshiaki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

scdrs's Issues

turning off fdr correction

Hi, is there any option to turn off fdr correction while plotting scdrs.util.plot_group_stats. I would like to plot on the graph the same pvalues i get with scdrs.method.downstream_group_analysis. Right now i see they are different, way less significant pvals on the plot than in df:stats, i guess probably due to fdr correction. Thanks

Cell-type disease association vs proportion of significant cells

Thanks for this tool, has been fun and insightful to use. Sorry if this was covered in the paper but I was just wondering what are some plausible reasons if there is significant cell-type disease association but the proportion of significant cells are very low. I see in the extended example there are some traits with a box for cell-type disease association but minimal significant cells.

Thanks for your time,
Dan

error metaclass conflict

Hello,
I am trying to run scdrs but am getting the following error related to a metaclass conflict. Any input would be much appreciated.
Thank you,
Jason

import scdrs

TypeError Traceback (most recent call last)
/var/folders/ld/1195vjjx1dz0tz1sdrgq6pshdm454q/T/ipykernel_70082/1366053663.py in
----> 1 import scdrs

/opt/miniconda3/lib/python3.9/site-packages/scdrs/init.py in
----> 1 from . import method, data_loader, util, pp
2
3 # expose functions so that scdrs.score_cell, scdrs.preprocess can be called
4 from .method import score_cell
5 from .pp import preprocess

/opt/miniconda3/lib/python3.9/site-packages/scdrs/method.py in
----> 1 import scanpy as sc
2 import numpy as np
3 import scipy as sp
4 import pandas as pd
5 from tqdm import tqdm

/opt/miniconda3/lib/python3.9/site-packages/scanpy/init.py in
14 from . import tools as tl
15 from . import preprocessing as pp
---> 16 from . import plotting as pl
17 from . import datasets, logging, queries, external, get, metrics
18

/opt/miniconda3/lib/python3.9/site-packages/scanpy/plotting/init.py in
----> 1 from ._anndata import (
2 scatter,
3 violin,
4 ranking,
5 clustermap,

/opt/miniconda3/lib/python3.9/site-packages/scanpy/plotting/_anndata.py in
26 from .._utils import sanitize_anndata, _doc_params, _check_use_raw
27 from .._compat import Literal
---> 28 from . import _utils
29 from ._utils import scatter_base, scatter_group, setup_axes, check_colornorm
30 from ._utils import ColorLike, _FontWeight, _FontSize

/opt/miniconda3/lib/python3.9/site-packages/scanpy/plotting/_utils.py in
33
34
---> 35 class _AxesSubplot(Axes, axes.SubplotBase, ABC):
36 """Intersection between Axes and SubplotBase: Has methods of both"""
37

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Convert Seurat object to h5ad file

Hi!

Our scRNA-seq data was processed using the Seurat package. In order to run scDRS with our scRNA-seq data, I first converted the Seurat object saved as an RDS file to an h5ad file using the following script,


seurat_counts <- seurat_obj[["RNA"]]@counts
  seurat_metadata <- [email protected]
  rownames(seurat_metadata) <- colnames(seurat_counts)

  adata <- anndata$AnnData(
    X = as.matrix(t(as.matrix(seurat_counts))),
    obs = as.data.frame(seurat_metadata)
  )
 return(adata)
}

adata <- seurat_to_anndata(df)
write_h5ad(adata, "raw.h5ad")

However, when I tried to use this h5ad file for the compute-score step, I encountered the following error,

Traceback (most recent call last):
  File "$HOME/miniconda3/envs/scRNA-seq/bin/scdrs", line 740, in <module>
    fire.Fire()
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "$HOME/miniconda3/envs/scRNA-seq/bin/scdrs", line 211, in compute_score
    scdrs.preprocess(
  File "~/softwares/scDRS/scdrs/pp.py", line 200, in preprocess
    v_resid = reg_out(np.ones(n_cell), df_cov.values)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/softwares/scDRS/scdrs/pp.py", line 453, in reg_out
    mat_coef = np.linalg.solve(mat_xtx, mat_xty)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in solve
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/numpy/linalg/linalg.py", line 386, in solve
    r = gufunc(a, b, signature=signature, extobj=extobj)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._UFuncInputCastingError: Cannot cast ufunc 'solve' input 0 from dtype('O') to dtype('float64') with casting rule 'same_kind'

I'm more experienced with R than Python, so I was hoping you could help me suggest a better way to convert the Seurat object to an h5ad file. Thanks so much!

gene weights for the control genes

Hi!

Thanks a lot for making this wonderful tool. I am currently studying how exactly scDRS works. When calculating for raw control score, it appears in the formula stated in the scDRS paper w_g (stands for MAGMA gene weight). However, where does this control gene weight come from? Because from what I understand is that the input gs file is the file containing the disease gene-set with its corresponding p-value or z-score, but we never provide the gene weight for the control genes. Then how can the control genes have MAGMA gene weights as well?

Thanks

Issues with generating gs files

Hello, when I used this software to generate xx.gs files with weights, I found that the xx.gs file contained many negative values. Is this normal ?Such as: ZSCAN31:-0.16077,CMC2:-0.17637,PRH1:-0.17909,FDXR:-0.18069,PROM2:-0.18672,IFI35:-0.19512,AP001029.1:-0.19552,AC087500.2:-0.19587,LINC00642:-0.20189,TLCD3A:-0.20189,AC011468.5:-0.21043.
image

KeyError: at plot_group_stats step

Hi! Thanks again for a great package. I was running the code in the Quick Start tutorial and hit the following error when plotting the group level statistics:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
   3620 try:
-> 3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'fdr_prop'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Input In [76], in <cell line: 18>()
      6 dict_df_stats
      8 dict_celltype_display_name = {
      9     "pyramidal_CA1": "Pyramidal CA1",
     10     "oligodendrocytes": "Oligodendrocyte",
   (...)
     15     "microglia": "Microglia",
     16 }
---> 18 scdrs.util.plot_group_stats(
     19     {
     20         trait: df_stats.rename(index=dict_celltype_display_name)
     21         for trait, df_stats in dict_df_stats.items()
     22     }
     23 )

File ~/opt/anaconda3/lib/python3.9/site-packages/scdrs/util.py:441, in plot_group_stats(dict_df_stats, df_fdr_prop, df_assoc_fdr, df_hetero_fdr)
    438 trait_list = list(dict_df_stats.keys())
    439 # compile df_fdr_prop, df_assoc_fdr, df_hetero_fdr from dict_df_stats
    440 df_fdr_prop = pd.concat(
--> 441     [dict_df_stats[trait]["fdr_prop"] for trait in trait_list], axis=1
    442 ).T
    443 df_assoc_fdr = pd.concat(
    444     [dict_df_stats[trait]["assoc_pval"] for trait in trait_list], axis=1
    445 ).T
    446 df_assoc_fdr = pd.DataFrame(
    447     multipletests(df_assoc_fdr.values.flatten(), method="fdr_bh")[1].reshape(
    448         df_assoc_fdr.shape
   (...)
    451     columns=df_assoc_fdr.columns,
    452 )

File ~/opt/anaconda3/lib/python3.9/site-packages/scdrs/util.py:441, in <listcomp>(.0)
    438 trait_list = list(dict_df_stats.keys())
    439 # compile df_fdr_prop, df_assoc_fdr, df_hetero_fdr from dict_df_stats
    440 df_fdr_prop = pd.concat(
--> 441     [dict_df_stats[trait]["fdr_prop"] for trait in trait_list], axis=1
    442 ).T
    443 df_assoc_fdr = pd.concat(
    444     [dict_df_stats[trait]["assoc_pval"] for trait in trait_list], axis=1
    445 ).T
    446 df_assoc_fdr = pd.DataFrame(
    447     multipletests(df_assoc_fdr.values.flatten(), method="fdr_bh")[1].reshape(
    448         df_assoc_fdr.shape
   (...)
    451     columns=df_assoc_fdr.columns,
    452 )

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:3505, in DataFrame.__getitem__(self, key)
   3503 if self.columns.nlevels > 1:
   3504     return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
   3506 if is_integer(indexer):
   3507     indexer = [indexer]

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance)
   3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:
-> 3623     raise KeyError(key) from err
   3624 except TypeError:
   3625     # If we have a listlike key, _check_indexing_error will raise
   3626     #  InvalidIndexError. Otherwise we fall through and re-raise
   3627     #  the TypeError.
   3628     self._check_indexing_error(key)

KeyError: 'fdr_prop'`

I am able to generate df_fdr_prop using the following code snippet:

trait_list = list(dict_df_stats.keys())
df_fdr_prop = pd.concat(
            [
                dict_df_stats[trait]["n_fdr_0.1"] / dict_df_stats[trait]["n_cell"]
                for trait in trait_list
            ],
            axis=1,
        ).T

How should I proceed? Thank you!

scdrs compute-score output

Hello and thank you for this great tool. I am running in to an issue at the scdrs compute-score step.

Each of my traits are a separate .gs file at this point. Here is an example of the first few lines of one of the trait files:
Screenshot 2023-04-06 at 10 18 06 AM

Here is the code I am running:
image

But my output files are both not as expected:
Screenshot 2023-04-06 at 10 20 41 AM

I would really appreciate any insight that you have on this issue!

Thank you!

plot_group_stats() got an unexpected keyword argument 'plot_kws'

Hi,

I am getting this error when trying to plot. Using both scdrs 1.0.1 and 1.0.2


TypeError Traceback (most recent call last)
Input In [23], in <cell line: 23>()

---> 23 fig, ax = scdrs.util.plot_group_stats(
24 dict_df_stats={
25 trait: df_stats.rename(index=dict_celltype_display_name)
26 for trait, df_stats in dict_df_stats.items()
27 },
28 plot_kws={
29 "vmax": 0.2,
30 "cb_fraction":0.12
31 }
32 )

TypeError: plot_group_stats() got an unexpected keyword argument 'plot_kws'

munge-gs input format

--zscore-file loaded: n_gene=62787, n_trait=0 (sys_time=0.0s)
Print info for the first 3 traits and first 10 genes
Traits []
PSRC1,-9.289225811 []
AL592148.3,-7.992844093 []
MIA3,-7.120338886 []
TCF21,-6.520148163 []
SLC22A1,-6.220870419 []
NOS3,-6.197266755 []
NBEAL1,-6.112940111 []
ICA1L,6.017449523 []
LIPA,5.811769649 []
AL359922.2,5.799499614 []
Warning: zscore-file values are all between 0 and 1.

Hi, when I'm using munge-gs to process a gene set csv file for a trait, it only reads all the rows, but it doesn't seem to be able to distinguish the two columns well, is there a mistake in the formatting of my csv file somewhere?

Adjusting for donor effects

Hi! Thanks very much for the great work behind scDRS. It's great to use.

I was wondering if you have thoughts on how to adjust for potential donor effects?
Is it best to add one-hot encodings for each donor, or provide one column with a different value for each donor?.

Check inputs

Including NaN, expression values >= 0 for log-transformed matrix.

See #25

Broader control gene sets

Hey,

congratulations on your paper's acceptance!

You mention in the Discussion that it might make sense when using specialized data sets to select matched control genes based on a broad cell atlas in order to increase power. Could you explain how this can be done, e.g. based on the TMS data set? (Several of my colleagues planning to work with scDRS are using specialized data sets, so I imagine it might be quite useful to more people).

Also, do you have recommendations for the choice of this broad data set? I imagine that if I use a highly specialized data set and pair it with a broad data set that does not include the general type of tissue in my specialized data set at all, I might introduce a bias. What do you think? Would merging the data sets be an option, or would that be problematic / insufficient?

MC-z and MC-z-based p-value

  1. Change the column name of "zscore" to "mc_zscore", to be consistent with "mc_pval"
  2. Additionally output one-sided p-values based on the normal distribution and mc_zscore ("MC-z"), maybe with column name of "mcz_pval".

scRNA data input

Hi, thank you so much for the very nice software.
I have a pretty simple question about scRNA data input for the software. We have scRNA data with illness and normal conditions, and we could identify the disease-related cell subpopulation using GWAS data. Is it appropriate for this software to infer these interesting subpopulations from scRNA data from normal conditions? It is well understood that using scRNA-seq data from illness condition makes distinguishing the impact of the tissue's original genetic background and the impact of the examined GWAS signals challenging. Or this software allow us to using the scRNA data from normal and disease condition together? How about the power?
Thanks a lot.
Best,
Lee

perform-downstream outputs zero significant cells

Dear Author,

Thank you for such an informative tool.
When running scDRS compute-score, the command output and results seemed normal (see umaps below)
image
image

However, when running perform-downsteam, the number of significant cells is basically 0, resulting in all blank cells on the heat map (see below).
image
image

We are trying to locate the possible causes and what can be done to improve the results. It would be great if you could please provide any suggestions!
Thanks in advance!

scDRS breaks if using all scRNA-seq genes as disease genes

Some users may use all scRNA-seq genes as the trait gene set as some sort of negative control. scDRS produced miscalibrated results in this case due to numerical issues.

  • Raw control scores were slightly but systematically smaller than the disease score (1e-17).
  • Small SD of raw control of a given cell across control gene sets (1e-18).

Loess error in compute-score after subsampling

Hi scDRS team,

I am encountering a loess error when trying to run a subsampled, lognormalized h5ad file through scdrds compute-score:

Call: scdrs compute-score
--h5ad-file ordovas_immune_downsampled.h5ad
--h5ad-species human
--cov-file None
--gs-file all.gs
--gs-species human
--ctrl-match-opt mean_var
--weight-opt vs
--adj-prop None
--flag-filter-data False
--flag-raw-count False
--n-ctrl 1000
--flag-return-ctrl-raw-score False
--flag-return-ctrl-norm-score True
--out-folder scDRS/immune

Loading data:
--h5ad-file loaded: n_cell=10727, n_gene=5000 (sys_time=0.2s)
First 3 cells: ['N51.LPB.TCAGCAACATACGCTA-0', 'N111.LPA2.CTTCTCTGTGGCAAAC-0', 'N20.LPA.TGGATGTGCACTTT-0']
First 5 genes: ['A2M', 'A2M-AS1', 'A2ML1', 'AAED1', 'AAK1']
--gs-file loaded: n_trait=6 (sys_time=0.2s)
Print info for first 3 traits:
First 3 elements for 'ukbb_CRC': ['CLTCL1', 'DAB2IP', 'NBPF1'], [4.0019, 3.5014, 3.4717]
First 3 elements for 'ukbb_IBD': ['ZNF236', 'C3orf38', 'LAMB2'], [3.17, 3.1623, 3.1358]
First 3 elements for 'alkes_UC': ['FCGR2A', 'IL23R', 'DLD'], [7.964, 7.0309, 6.897]

Preprocessing:
/Users/user/opt/anaconda3/lib/python3.9/site-packages/scdrs/pp.py:323: RuntimeWarning: invalid value encountered in log10
x = np.log10(df_gene["ct_mean"].values[not_const])
Traceback (most recent call last):
File "/Users/user/opt/anaconda3/bin/scdrs", line 723, in
fire.Fire()
File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Users/user/opt/anaconda3/bin/scdrs", line 211, in compute_score
scdrs.preprocess(adata, cov=df_cov, adj_prop=ADJ_PROP, n_mean_bin=20, n_var_bin=20, copy=False)
File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/scdrs/pp.py", line 205, in preprocess
df_gene, df_cell = compute_stats(
File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/scdrs/pp.py", line 325, in compute_stats
model.fit()
File "_loess.pyx", line 899, in _loess.loess.fit
ValueError: b'Extrapolation not allowed with blending'

I used the following code to generate the subsampled dataset from the original data:

crc_immune_adata = sc.read_h5ad("ordovas_immune_processed.h5ad")
target_cells = 250

crc_immune_adata_ds = [crc_immune_adata[crc_immune_adata.obs.Cluster == clust] for clust in crc_immune_adata.obs.Cluster.cat.categories]

for dat in crc_immune_adata_ds:
    if dat.n_obs > target_cells:
         sc.pp.subsample(dat, n_obs=target_cells)

crc_immune_adata_downsampled = crc_immune_adata_ds[0].concatenate(*crc_immune_adata_ds[1:])
crc_immune_adata_downsampled.write("ordovas_immune_downsampled_250.h5ad")

The subsampled dataset is attached here:
ordovas_immune_downsampled_250.h5ad.zip

Thanks for all your time and help!

Considerations for single cell splicing data

Hello. Thank you for creating and maintaining this easy to use tool.

When scoring cells, have you considered how single-cell splicing data, stored as a cell-by-intron matrix of percent-spliced/PSI values, might be input? This data is generally more sparse than gene expression, with many values represented as NaN, where no underlying gene expression in the cell can be used to calculate PSI.

As is, score_cell returns NaN values as scores for every cell, likely due to the missing values in the input.

Segmentation fault when Running scDRS

Hi,

Sorry to bug you with this. I am trying to run scDRS on our scRNA-seq dataset.

The h5ad file I'm using has 12755 cells from both normal and tumor patients. I generated this file by converting from the corresponding Seurat object. I was able to visualize this h5ad object in cellxgene and confirmed that it has all the metadata correctly.

For the .gs file I just used the one included in the example dataset magma_10kb_top1000_zscore.74_traits.rv1.gs.

Here are the commands I used for running scDRS (I think I installed the packages correctly):

python compute_score.py
--h5ad_file ./Tumor_Seqwell.h5ad
--h5ad_species human
--gs_file magma_10kb_top1000_zscore.74_traits.rv1.gs
--gs_species human
--flag_filter True
--flag_raw_count True
--n_ctrl 1000
--flag_return_ctrl_raw_score False
--flag_return_ctrl_norm_score True
--out_folder ./Tumor_Seqwell

And here is the output I'm seeing:


  • Single-cell disease relevance score (scDRS)
  • Version beta
  • Martin Jinye Zhang and Kangcheng Hou
  • HSPH / Broad Institute / UCLA
  • MIT License

Call: ./compute_score.py
--h5ad_file ./Tumor_Seqwell.h5ad
--h5ad_species human
--cov_file None
--gs_file magma_10kb_top1000_zscore.74_traits.rv1.gs
--gs_species human
--ctrl_match_opt mean_var
--weight_opt vs
--flag_filter True
--flag_raw_count True
--n_ctrl 1000
--flag_return_ctrl_raw_score False
--flag_return_ctrl_norm_score True
--out_folder ./Tumor_Seqwell

Load data:
/Users/hsong/opt/anaconda3/envs/scDRS_env/lib/python3.9/site-packages/anndata/compat/init.py:232: FutureWarning: Moving element from .uns['neighbors']['distances'] to .obsp['distances'].

This is where adjacency matrices should go now.
warn(
/Users/hsong/opt/anaconda3/envs/scDRS_env/lib/python3.9/site-packages/scanpy/preprocessing/_simple.py:352: RuntimeWarning: invalid value encountered in log1p
np.log1p(X, out=X)
--h5ad_file loaded: n_cell=2760, n_gene=1258 (sys_time=6.1s)
--gs_file loaded: n_geneset=74 (sys_time=6.3s)
./scDRS_step1.sh: line 11: 57097 Segmentation fault: 11 python compute_score.py --h5ad_file ./Tumor_Seqwell.h5ad --h5ad_species human --gs_file magma_10kb_top1000_zscore.74_traits.rv1.gs --gs_species human --flag_filter True --flag_raw_count True --n_ctrl 1000 --flag_return_ctrl_raw_score False --flag_return_ctrl_norm_score True --out_folder ./Tumor_Seqwell

The first thing I noticed besides the error message is that the number of cells does not match what I have in the dataset, nor does the number of genes. I am wondering what is the cause of that, and of course the error.

Thank you very much for your time!

Angelus

Issue with test run

Hi! Thanks for providing this exciting new analysis package. I am having some trouble with the running the test case. When I run the command line test:
python -m pytest tests/test_scdrs.py -p no:warnings

I receive the following error. Any help would be much appreciated. Thanks!

============================================================================================== FAILURES ===============================================================================================
___________________________________________________________________________________________ test_score_cell __________________________________________________________________________________________$
                                                                                                                                                                                                       
    def test_score_cell():                                                                                                                                                                             
                                                                                                                                                                                                       
        # Load toy data                                                                                                                                                                                
        DATA_PATH=scdrs.__path__[0]                                                                                                                                                                    
        H5AD_FILE=os.path.join(DATA_PATH,'data/toydata_mouse.h5ad')                                                                                                                                    
        COV_FILE=os.path.join(DATA_PATH,'data/toydata_mouse.cov')                                                                                                                                      
        GS_FILE=os.path.join(DATA_PATH,'data/toydata_mouse.gs')                                                                                                                                        
        assert os.path.exists(H5AD_FILE), "built-in data toydata_mouse.h5ad missing"                                                                                                                   
        assert os.path.exists(COV_FILE), "built-in data toydata_mouse.cov missing"                                                                                                                     
        assert os.path.exists(GS_FILE), "built-in data toydata_mouse.gs missing"                                                                                                                       
                                                                                                                                                                                                       
        # Load built-in data                                                                                                                                                                           
        adata = read_h5ad(H5AD_FILE)                                                                                                                                                                   
                                                                                                                                                                                                       
        df_cov = pd.read_csv(COV_FILE, sep='\t', index_col=0)                                                                                                                                          
        cov_list = list(df_cov.columns)                                                                                                                                                                
        adata.obs.drop([x for x in cov_list if x in adata.obs.columns], axis=1, inplace=True)                                                                                                          
        adata.obs = adata.obs.join(df_cov)                                                                                                                                                             
        adata.obs.fillna(adata.obs[cov_list].mean(), inplace=True)
        adata.var['mean'] = adata.X.mean(axis=0).T
        if sp.sparse.issparse(adata.X):
            adata.X = adata.X.toarray()
        adata.X -= adata.var['mean'].values
        adata.X = scdrs.method.reg_out(adata.X, adata.obs[cov_list].values)
>       adata.X += adata.var['mean']

tests/test_scdrs.py:37:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../anaconda3/lib/python3.7/site-packages/pandas/core/ops.py:1071: in wrapper
    index=left.index, name=res_name, dtype=None)
../../anaconda3/lib/python3.7/site-packages/pandas/core/ops.py:980: in _construct_result
    out = left._constructor(result, index=index, dtype=dtype)

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'Series' object has no attribute '_name'") raised in repr()] Series object at 0x7fa40a633890>
data = array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., na...nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
index = Index([b'Pip4k2a',    b'Chd7', b'Atp6v0c',   b'Exoc3',    b'Pex5',     b'Wrn',
        b'Zfp236',   b'Asna1',    b'Pdh...', b'Pikfyve',
         b'Tram1',    b'Ei24',    b'Smc2',   b'Cops4'],
      dtype='object', name='index', length=2500)
dtype = None, name = None, copy = False, fastpath = False
                                                                                                                                                                                                       
    def __init__(self, data=None, index=None, dtype=None, name=None,                                                                                                                                   
                 copy=False, fastpath=False):                                                                                                                                                          
                                                                                                                                                                                                       
        # we are called internally, so short-circuit                                                                                                                                                   
        if fastpath:                                                                                                                                                                                   
                                                                                                                                                                                                       
            # data is an ndarray, index is defined                                                                                                                                                     
            if not isinstance(data, SingleBlockManager):                                                                                                                                               
                data = SingleBlockManager(data, index, fastpath=True)                                                                                                                                  
            if copy:                                                                                                                                                                                   
                data = data.copy()                                                                                                                                                                     
            if index is None:                                                                                                                                                                          
                index = data.index                                                                                                                                                                     
                                                                                                                                                                                                       
        else:                                                                                                                                                                                          
                                                                                                                                                                                                       
            if index is not None:                                                                                                                                                                      
                index = _ensure_index(index)                                                                                                                                                           
                                                                                                                                                                                                       
            if data is None:                                                                                                                                                                           
                data = {}                                                                                                                                                                              
            if dtype is not None:                                                                                                                                                                      
                dtype = self._validate_dtype(dtype)                                                                                                                                                    
                                                                                                                                                                                                       
            if isinstance(data, MultiIndex):                                                                                                                                                           
                raise NotImplementedError("initializing a Series from a "                                                                                                                              
                                          "MultiIndex is not supported")                                                                                                                               
            elif isinstance(data, Index):                                                                                                                                                              
                if name is None:                                                                                                                                                                       
                    name = data.name                                                                                                                                                                   
                                                                                                                                                                                                       
                if dtype is not None: 
                   # astype copies
                    data = data.astype(dtype)
                else:
                    # need to copy to avoid aliasing issues
                    data = data._values.copy()
                copy = False
    
            elif isinstance(data, np.ndarray):
                pass
            elif isinstance(data, Series):
                if name is None:
                    name = data.name
                if index is None:
                    index = data.index
                else:
                    data = data.reindex(index, copy=copy)
                data = data._data
            elif isinstance(data, dict):
                data, index = self._init_dict(data, index, dtype)
                dtype = None
                copy = False
            elif isinstance(data, SingleBlockManager):
                if index is None:
                    index = data.index
                elif not data.index.equals(index) or copy:
                    # GH#19275 SingleBlockManager input should only be called
                    # internally
                    raise AssertionError('Cannot pass both SingleBlockManager '
                                         '`data` argument and a different '
                                         '`index` argument.  `copy` must '
                                         'be False.')
    
            elif is_extension_array_dtype(data) and dtype is not None:
                if not data.dtype.is_dtype(dtype):
                    raise ValueError("Cannot specify a dtype '{}' with an "
                                   "extension array of a different "
                                     "dtype ('{}').".format(dtype,
                                                            data.dtype))
    
            elif (isinstance(data, types.GeneratorType) or
                  (compat.PY3 and isinstance(data, map))):
                data = list(data)
            elif isinstance(data, (set, frozenset)):
                raise TypeError("{0!r} type is unordered"
                                "".format(data.__class__.__name__))
            else:
    
                # handle sparse passed here (and force conversion)
                if isinstance(data, ABCSparseArray):
                    data = data.to_dense()
    
            if index is None:
                if not is_list_like(data):
                    data = [data]
                index = com._default_index(len(data))
            elif is_list_like(data):
    
                # a scalar numpy array is list-like but doesn't
                # have a proper length
                try:
                    if len(index) != len(data):
                        raise ValueError(
                            'Length of passed values is {val}, '
                            'index implies {ind}'
>                           .format(val=len(data), ind=len(index)))
E                           ValueError: Length of passed values is 30, index implies 2500

../../anaconda3/lib/python3.7/site-packages/pandas/core/series.py:262: ValueError

Dependency on background cells

Hey,

Playing around with reproducing your results, I noticed that results for a given set of cells depend heavily on what other cells are included in the analysis. For example in the TMS dataset, the medium spiny neurons came out highly significant for several traits when analysed in the context of the entire dataset (as you show), but insignificant when analysed on their own. I was surprised by this because, to my understanding, the background cells are only used to normalize the score via their control scores,which don't have any biological interpretation.

Is this the expected behavior? If so, do you have any recommendations regarding the number and diversity of cells to include in an analysis, above and beyond the cell population of interest?

Error when adata.obs_names can be interpreted as integers

I recently ran scDRS on a new data set and received this error:

Performing scDRS group-analysis
`connectivities` not found in `adata.obsp`; run `sc.pp.neighbors` first
Traceback (most recent call last):
  File "/usr/local/bin/scdrs", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/usr/tools/scDRS/bin/scdrs", line 740, in <module>
    fire.Fire()
  File "/home/usr/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/tools/scDRS/bin/scdrs", line 683, in perform_downstream
    dict_df_res = scdrs.method.downstream_group_analysis(
  File "/home/usr/tools/scDRS/scdrs/method.py", line 775, in downstream_group_analysis
    {"fdr": multipletests(df_reg["pval"].values, method="fdr_bh")[1]},
  File "/home/usr/.local/lib/python3.10/site-packages/statsmodels/stats/multitest.py", line 147, in multipletests
    alphacSidak = 1 - np.power((1. - alphaf), 1./ntests)
ZeroDivisionError: float division by zero

I was puzzled by this as the same scDRS version ran fine on other data sets and I could see no issues with this data set.
After some digging, the issue seemed to be that the cell names in adata.obs_names were simply '0', '1', '2', etc., which caused the import into df_fullscore.index to interpret them as integers, causing mismatches during the cell alignment steps.

Error in scdrs perform-downstream --group-analysis

Hi Matin,

Thank you for you amazing tools. compute-score works good for me. But when I run perform-downstream --group-analysis, it will first use the whole cores of server, and then appear "Segmentation fault (core dumped)". Do you have any idea about this? My h5ad datset has ~60000 cells.

Thanks,
Jiawei

Problem with munging

Hi, Thank you for the great program.
I'm trying to run scDRS on one trait, but I seem to get no gene sets after munging with scdrs munge-gs

Call: scdrs munge-gs \
--zscore-file SLE.genes.out.zstat.tsv \
--weight zscore \
--n-min 10 \
--n-max 1000 \
--out-file munge_top1000Z.gs

--zscore-file loaded: n_gene=19025, n_trait=1 (sys_time=0.0s)
Print info for the first 3 traits and first 10 genes
Traits               ['SLE']
SAMD11               [1.037]
NOC2L                [1.1508]
KLHL17               [1.5457]
PLEKHN1              [2.8403]
PERM1                [3.0402]
HES4                 [2.3732]
ISG15                [2.2022]
AGRN                 [2.7473]
RNF223               [2.6844]
C1orf159             [3.9652]
--zscore-file have values above 1 or below 0. Seems fine.

Finish munging 1 gene sets (sys_time=0.0s)

The resulting file looks like the following

Screenshot 2022-10-03 at 19 52 16

Would you know what the problem is? Thanks

custom gene set?

Hi!

I'm trying to figure out how to generate a custom geneset file using MAGMA that could plug into scDRS. I found the following text within the manuscript

We use MAGMA20 502 to compute gene-level association p-values from disease
503 GWAS summary statistics (Box 1, step 1). We use a reference panel based on individuals of European ancestry in the 1000 Genomes Project97 504 . We use a 10-kb window around the gene body to map SNPs to genes. We select the top 1,000 genes based 505 on MAGMA p-values as putative disease genes.

Could you provide an example of the MAGMA inputs and commands that performs this?

munge-gs type error

Hello,
I am trying to generate .gs file with 'MAGMA' output and having a trouble.

I sort 'MAGMA' output by zstat with code
cut -f 1 prostate_GCST90011808_zscore_file.tsv > gene_symbol.txt paste gene_symbol.txt prostate_GCST90011808.genes.out > test.tsv awk '{print $1, $9}' test.tsv | sort -rn -k 2 | sed 's/ /\t/g' > prostate_GCST90011808_zscorefile.tsv
which gives me output of
image

With this 'prostate_GCST90011808_zscorefile.tsv',
munge-gs always give me the error
image

What would be the problem,,,,?

I have added my zscorefile converted to 'txt' .
Thank you!

prostate_GCST90011808_zscorefile.txt

Issues Plotting UMAP from quick start tutorial

Dear Dr. Zhang,
I worked through the tutorial and it worked well on the test data. However, when I try to plot a UMAP of my own data with the 'tabula-muris-senis-facs-official-raw-obj.h5ad' anndata, I am unable to get the UMAP enrichment plots to work. My thought is that

I get this error:
KeyError: "Could not find 'umap' or 'X_umap' in .obsm"

When I run this code:
sc.set_figure_params(figsize=[2.5, 2.5], dpi=150)
sc.pl.umap(
adata,
color="tissue",
ncols=1,
color_map="RdBu_r",
vmin=-5,
vmax=5,
)

My thought is that there is not UMAP data to plot, and perhaps the UMAP data was included in the 'expr.h5ad' file, but I haven't been able to figure out a solution. Could you please provide details about how I can get this to work for a new dataset?

Thank you in advance!

  • Naomi

scdrs CLI not working?

Sorry if this is a silly question. I followed the instructions and installed scdrs v1.0.2:

git clone https://github.com/martinjzhang/scDRS.git
cd scDRS
git checkout -b v102 v1.0.2
pip install -e .

But when I go to the folder with the binary executable and type ./scdrs, it gives me this error:

  File "./scdrs", line 26
    h5ad_file: str,
             ^
SyntaxError: invalid syntax

I feel like I'm missing something obvious here, but I can't figure out what.

Problems with input data

Hi,

I have two questions that I would like to ask for your guidance.

  1. The examples I saw in the article were all scRNA-seq datas from multiple tissues associated with disease. Because I only want to focus on one disease, and I only have scRNA-seq data from one tissue, I wanted to ask if I could use just one scRNA-seq data from one tissue.
  2. Is there a more detailed tutorial on how to get from public scRNA-seq data to .h5ad file required for scDRS?

Any advice would be greatly appreciated!

GWAS traits and a control traits

Hello! I have a problem and I hope to get your help. I was going to incorporate multiple GWAS traits and a control (height) for the study but couldn't figure out how to produce a gs-file. One trait and height are listed in the tutorial. If you add traits, do you need to list multiple traits and height at the same time, and take the intersection genes as the column? However, there is a problem with this. As the number of included traits increases, the number of shared genes decreases, which is not even enough to meet the analysis requirements. I wonder if there is any mistake in my understanding. I hope to get your reply.

scDRS CLI scdrs munge-gs

Hi,

  1. I only have one set of genes for one trait, when I create .gs file from pval_file/zscore_file using scDRS CLI scdrs munge-gs, I found that for a group of genes with both pvalue and score, the results obtained by inputting zscore-file (format1) and pval-file (format2) respectively were quite different. For format 1, only genes with a positive zscore are returned (weight = zscore) , while for format 2, the weights are recalculated and genes with smaller p-values were returned. I don't know how to choose?

format1:

scdrs munge-gs \
      --out-file gene1.gs \
      --zscore-file muco_z.tsv \
      --weight zscore \
      --n-max 1000

format2:

scdrs munge-gs \
      --out-file gene2.gs \
      --pval-file muco_p.tsv \
      --weight zscore \
      --n-max 1000
  1. If I obtain a set of genes by taking the intersection of multiple methods, can I directly make all genes weight 1, or do not add weight?

Seurat to h5ad

Hello,

Thank you for developing such a wonderful tool! I'm trying to use an existing scRNA dataset and converting from Seurat to h5ad. The dataset has raw data and normalized data through sctransform. When I convert the object to h5ad and run compute_score.py, I get an error:

python3 compute_score.py \
    --h5ad_file ${h5ad_file}\
    --h5ad_species human\
    --gs_file ${gs_file}\
    --gs_species human\
    --flag_filter True\
    --flag_raw_count True\
    --n_ctrl 1000\
    --flag_return_ctrl_raw_score False\
    --flag_return_ctrl_norm_score True\
    --out_folder ${out_dir}
 RuntimeWarning: invalid value encountered in log1p np.log1p(X, out=X)

I think this is because sctransform gives positive and negative values. My question is, should I just strip the Seurat object down to the raw RNA counts and make h5ad files from that? Is there a way to preserve sctransform normalized data in Seurat for downstream use? And lastly, should relevant information like cell type identity, umap embeddings, etc go into the covariates file separately?

Thanks for your help!

Data sparsity is a critical problem for the inference stability (ex. 10x data)

Hi, Thank you for the wonderful tool.
I noticed that the detection power for sparse data such as 10x is very low.
Imputation improved the stability, but I want to avoid using imputation if possible because it can distort the data.

I'm attaching an example using the pbmc3k dataset.
Regarding attaching examples, autoimmune diseases are not still significant...
Is there any way to deal with these?

Disease scores for raw expression value notebook

image

Disease scores for imputed expression value notebook

image

codes for the results

concerning beta or OR when choosing putative disease genes

Hello,
while I was customizing my putative disease genes,
I thought we might need to concern about 'beta value' or 'odds ratio'.

As your manuscript suggested, putative disease genes are expected to have higher expression levels in diseased-cell population. However, if we see the issue 2 (#2), we only use P-value to get 1000 MAGMA genes.
If low P-value indicates positive Beta value, it would be fine to assume 1000 MAGMA genes are highly expressed in diseased ones. However, I assume P-value and Beta value are two independent values, which means 'low p-val = positive beta' is not the case.
odds ratio > 1 can be used for above.

Therefore, I think it would be better to use only SNPs with positive beta value or OR > 1 to get 1000 MAGMA genes.

It would so nice to hear your thoughts.
Thank you!

"skipped due to small size" error

Hello. I am working on computing scDRS scores for several different traits, however I keep getting the error that the trait is being skipped due to small size for all the different traits that I try:
Screenshot 2023-03-20 at 10 42 30 AM

Here is the head of the file I am testing the code on:
Screenshot 2023-03-20 at 10 43 14 AM

Thank you!

Cleaning/prepping GWAS statistics

Hello, I couldn't find it in the paper. For these GWAS traits, do you do any pruning prior to magma? (i.e. exclude HLA region, limit to Hapmap3 SNPs)

Thanks

Ordinal covariates

For categorical covariates (i.e. batch), should you numerically encode them in the covariates file? Will they be handled as is if they are character strings?

Dependency on Background Cells (followup)

Hey again,

In the 7th discussion point of the manuscript you mention the possibility of choosing the control gene sets based on a different data set than the one you're analyzing. Is this implemented as an option in the software yet?

compute-score can not complete

Hi, thanks for developing such a helpful tool, but I have had some questions recently. When I run the compute-score process, the code can not finish. could you be so pleased to help me with the problem? Here are the code.


  • Single-cell disease relevance score (scDRS)
  • Version 1.0.2
  • Martin Jinye Zhang and Kangcheng Hou
  • HSPH / Broad Institute / UCLA
  • MIT License

Call: scdrs compute-score
--h5ad-file /data4/scDRS/data/cere/expr.h5ad
--h5ad-species human
--cov-file /data4/scDRS/data/cere/cov.tsv
--gs-file /data4/scDRS/data/cere/processed_geneset.gs
--gs-species human
--ctrl-match-opt mean_var
--weight-opt vs
--adj-prop None
--flag-filter-data True
--flag-raw-count True
--n-ctrl 1000
--flag-return-ctrl-raw-score False
--flag-return-ctrl-norm-score True
--out-folder /data4/scDRS/data/cere/out
Loading data:
--h5ad-file loaded: n_cell=62247, n_gene=23202 (sys_time=7.0s)
First 3 cells: ['E083_AAACCCAAGGGCTGAT-1', 'E083_AAACCCACAGGCAATG-1', 'E083_AAACCCACAGTATACC-1']
First 5 genes: ['AL627309.1', 'AL627309.5', 'LINC01409', 'FAM87B', 'LINC01128']
--cov-file loaded: covariates=['const', 'n_genes', 'timepoint'] (sys_time=7.0s)
First 5 values for 'const': [1, 1, 1, 1, 1]
First 5 values for 'n_genes': [3861, 4883, 5453, 2459, 5002]
First 5 values for 'timepoint': ['E083', 'E083', 'E083', 'E083', 'E083']
--gs-file loaded: n_trait=3 (sys_time=7.0s)
Print info for first 3 traits:
First 3 elements for 'SCZ': ['NRGN', 'DPYD', 'RBFOX1'], [7.6558, 7.6519, 7.3247]
First 3 elements for 'CEREV': ['RNF11', 'CDKN2C', 'TRRAP'], [6.4221, 6.1533, 6.1347]
First 3 elements for 'Height': ['WWOX', 'BNC2', 'GMDS'], [10.0, 10.0, 10.0]

Preprocessing:
scdrs.pp.category2dummy: Detected categorical columns: timepoint. Added dummy columns: timepoint_E093,timepoint_E101,timepoint_E102,timepoint_E108,timepoint_E117. Dropped columns: timepoint.

Computing scDRS score:
Trait=SCZ, n_gene=898: 165/62247 FDR<0.1 cells, 469/62247 FDR<0.2 cells (sys_time=839.1s)
Trait=CEREV, n_gene=819: 0/62247 FDR<0.1 cells, 0/62247 FDR<0.2 cells (sys_time=1529.8s)


And the computer keeps running even 2 days after. Could you please help with the problem? looking forward to your relay, thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.