msingerlab / cometsc Goto Github PK

View Code? Open in Web Editor NEW

31.0 3.0 7.0 20.8 MB

COMET Single-Cell Marker Detection tool

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

cometsc's Introduction

COMET: Identifying candidate marker panels from single-cell transcriptomic data.

About us

Actively maintained by Osmaan Shahid ([email protected]).

Developed by Conor Delaney ([email protected]).

Prototyped by Aaron Yao-Smith ([email protected]).

Uses the XL-mHG test, created by Florian Wagner.

cometsc's People

Contributors

Stargazers

Watchers

Forkers

xyl012 scfurl denghb001 clarissasun rrrrui11 jianguozhouzunyimedicaluniversity ewowiredu

cometsc's Issues

COMETSC running but not giving output

I'm excited to use this tool, but it's been a struggle to get it to work for me. I think the issue is my input files because I can run the example dataset.

I am working with a Seurat object. I've exported the markers, umap dimensions and cluster calls as tab separated text. I make a call to Comet from the command line it and looks like it's running. It certainly is from top. It goes for about 3 hours. An output folder is created. But the folder is empty. It doesn't throw any specific errors but it does show the following during run-time:

Creating discrete expression matrix...
Insufficient floating point precision for calculating or reporting the exact XL-mHG test statistic; the true value is too small. Using "0" instead.(The XL-mHG p-value will also be reported as "0".)
Insufficient floating point precision for calculating or reporting the exact XL-mHG test statistic; the true value is too small. Using "0" instead.(The XL-mHG p-value will also be reported as "0".)

I am not sure what the problem is. Is there a stderr or log file to see what's going on?

Also, relatedly, the docs would benefit greatly from a tutorial showing how to get the input files out from a Seurat object, since that is such a common procedure.

Here is my code to get the input files out from Seurat and to the command line.

# matrix
matrix_cometsc <- GetAssayData(so) # so is Seurat Object

write.table(as.matrix(matrix_cometsc), file=here("data", "COMETSC", "markers.txt"), row.names=TRUE, col.names=TRUE, sep = "\t", quote = FALSE)

#UMAP embeddings
umap_cometsc <- Embeddings(so, reduction = "umap")
write.table(umap_cometsc, file=here("data", "COMETSC", "vis.txt"), row.names=TRUE, col.names=FALSE, sep = "\t", quote = FALSE)

#cluster IDs
cluster_cometsc <- noquote(as.matrix(Idents(so)))
write.table(cluster_cometsc, file=here("data", "COMETSC", "cluster.txt"), row.names=TRUE, col.names=FALSE, sep = "\t", quote = FALSE)

Part of the issue is with the marker (matrix) because of that first tab above the row names. I had to manually add it like this:

sed '1s/.*/\t&/' markers.txt > markers2.txt

Also, my command to Comet is the following:

#! /bin/bash
source ~/comet/bin/activate
Comet markers2.txt vis.txt cluster.txt -C 16 -K 4 -Count true output/

And for some reference, here is a sample of markers2.txt with the tabs indicated by ^I

^ID1_TTCAGGATCAAGCCAT^ID1_GTGGAGATCTGCTTAT^ID1_GCACGGTCACTCAGAT^ID1_TATACCTGTCTTACTT
MIR1302-2HG^I0^I0.0766241526725224^I0^I0
FAM138A^I0^I0^I0^I0
OR4F5^I0^I0^I0^I0
AL627309.1^I0.103146952196364^I0.0766241526725224^I0.0823802232731239^I0.0918193591402592
AL627309.3^I0^I0^I0^I0
AL627309.2^I0^I0^I0^I0
AL627309.4^I0^I0^I0^I0
AL732372.1^I0^I0^I0^I0

Confusion about ranks

Hi there,

Great tool! Thank you for making this tool.

I have a basic question, and hope to get some clarification through this forum. I was going through the example output you have described here:

https://hgmd.readthedocs.io/en/latest/Output.html

In the TSNE plots on that page, you have shown Cd74 and Fcer1g_c. But the rankings shown are way too low, but you still say that they are among the top ranked ones. How is that? What am I missing in my understanding?
Do the ranks shown represent single-gene rankings then? Is this why they are low? So, although the single rankings are low, these are good as pairs. Is this why they are still shown as examples?

Would very much appreciate some explanation so that I can clearly understand what these rankings mean. Both in the context of singletons and pairs.

Thank you!

Error when using sample data

I am using COMTSC to process my single-cell data, but when I use COMTSC in windows, I use the examples you provide in normal situations or in virtual environments.All suggest an error. .I have been troubled by this issue for several days and hope to get your help.

XLMHG error
('get_xlmhg_stat() takes from 3 to 4 positional arguments but 6 were given', 'occurred at index AAMP')
q-val error
local variable 'xlmhg' referenced before assignment
error in sliding values
local variable 'xlmhg' referenced before assignment
Creating discrete expression matrix...
discrete matrix construction failed
local variable 'cutoff_value' referenced before assignment

Input matrix - raw or normalized?

Hello there! Thanks for your tool, and I'm excited to try it out!

I took a look at your preprint as well as the manual, but I'm still confused as to whether the raw or normalized data is better as the input matrix. Both are accepted, I know. But my question is, is any one input type more preferred (normalized, log-normalized, or raw)?

I have 10X data, processed by Seurat. So, can I give as input to COMET the log-normalized data as pre-processed by Seurat v2?

Any advice would be very helpful!

Thanks!

Unable to install COMETSC

Dear,
I failed to install locally COMETSC, on Ubuntu 20.04 with Python 3.6.8:
pip3 install COMETSC
give this error :

ERROR: Cannot install cometsc==0.1.10, cometsc==0.1.11, cometsc==0.1.12, cometsc==0.1.13, cometsc==0.1.5, cometsc==0.1.6, cometsc==0.1.7 and cometsc==0.1.9 because these package versions have conflicting dependencies.

The conflict is caused by:
    cometsc 0.1.13 depends on scikit-learn==0.21.0
    cometsc 0.1.12 depends on scikit-learn==0.21.0
    cometsc 0.1.11 depends on scikit-learn==0.21.0
    cometsc 0.1.10 depends on scikit-learn==0.21.0
    cometsc 0.1.9 depends on scikit-learn==0.21.0
    cometsc 0.1.7 depends on scikit-learn==0.21.0
    cometsc 0.1.6 depends on scikit-learn==0.21.0
    cometsc 0.1.5 depends on scikit-learn==0.21.0

On scikit-learn github, we can read this:
"Also we removed the 0.21.0 tarball from pypi because it lacked a metadata that made it explicit that it was not compatible with python 2."

So how do you recommend installing COMETSC?

Best,
Marc

Error when running example data: "XLMHG error"

Hi,

Thank you for publishing your new algorithm. I am really excited about it! I think it is a common obstacle to find the right markers on the protein level after doing single cell RNA seq.

Could you help me troubleshooting the error message I get when running the test dataset you provided?

The first error is the missing mhg_cython C extension which does not seem to be critical. The XLMHG error appears to have problems with the markers file. I get the same error messages when running my own data.

(comet_env) COMPUTER1:example_ins USER1$ Comet tabmarker.txt tabvis.txt tabcluster.txt output/
Warning (xlmhg): Failed to import "mhg_cython" C extension.
Warning (xlmhg): Failed to import the "mhg_cython" C extension.Falling back to the pure Python implementation, which is very slow.
Started on 2019-06-03T22:14:39.260567
Reading data...
Generating complement data...
[1]
########
# Processing cluster 1...
########
2 gene combinations
Running t test on singletons...
Calculating fold change
Running XL-mHG on singletons...
X = 0
L = 10
Cluster size 5
XLMHG error
('get_xlmhg_stat() takes from 3 to 4 positional arguments but 6 were given', 'occurred at index AAMP')
q-val error
local variable 'xlmhg' referenced before assignment
error in sliding values
local variable 'xlmhg' referenced before assignment
Creating discrete expression matrix...
discrete matrix construction failed
local variable 'cutoff_value' referenced before assignment
Process Process-1:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/anaconda3/bin/comet_env/lib/python3.7/site-packages/Comet/__main__.py", line 329, in process
    discrete_exp_full = discrete_exp.copy()
UnboundLocalError: local variable 'discrete_exp' referenced before assignment
[2]

Then the same error messages are repeated for cluster 2. The script ends without creating output files. Im running python 3.7.2 on macosx 10.13.6 and made the installation via virtualenv / git clone.

Thank you!

unable to install because: No matching distribution found for matplotlib==3.0.0 (from COMETSC)

Dear Sir,
I followed your tutorial for installation of COMETSC in a docker image:
apt-get update
apt-get install sudo
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.7
sudo apt install python-pip
sudo apt-get install python3-matplotlib
pip install COMETSC

The installation stops because it cannot find matplotlib==3.0.0
Collecting kiwisolver==1.0.1 (from COMETSC)
Using cached https://files.pythonhosted.org/packages/3a/62/a8c9bef3059d55ab38e41fe9cba4fad773bfc04e47290bab84db1c18262e/kiwisolver-1.0.1-cp27-cp27mu-manylinux1_x86_64.whl
Collecting matplotlib==3.0.0 (from COMETSC)
Could not find a version that satisfies the requirement matplotlib==3.0.0 (from COMETSC) (from versions: 0.86, 0.86.1, 0.86.2, 0.91.0, 0.91.1, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1rc1, 1.4.1, 1.4.2, 1.4.3, 1.5.0, 1.5.1, 1.5.2, 1.5.3, 2.0.0b1, 2.0.0b2, 2.0.0b3, 2.0.0b4, 2.0.0rc1, 2.0.0rc2, 2.0.0, 2.0.1, 2.0.2, 2.1.0rc1, 2.1.0, 2.1.1, 2.1.2, 2.2.0rc1, 2.2.0, 2.2.2, 2.2.3, 2.2.4)
No matching distribution found for matplotlib==3.0.0 (from COMETSC)
Could please help me in fixing this issue?
Cheers
Raf

cell # limit?

Hello, I'm trying to run COMET on a large dataset (~500K cells), but I'm running into some errors that seem like they may be due to my dataset being too large? Is this the case or is there something else going on? Here's my stdout file:

Started on 2021-12-20T16:32:55.311333
Reading data...
Generating complement data...
########
# Processing cluster 1...
########
2 gene combinations
Running t test on singletons...
Calculating fold change
Running XL-mHG on singletons...
X = 75
L = 1000
Cluster size 500
XLMHG error
('List is too long. The maximum length supported is  65536.', 'occurred at index TNS1')
q-val error
local variable 'xlmhg' referenced before assignment
error in sliding values
local variable 'xlmhg' referenced before assignment
Creating discrete expression matrix...
discrete matrix construction failed
local variable 'cutoff_value' referenced before assignment

And here's my stderr file:

Traceback (most recent call last):
  File "/n/holylfs03/LABS/hoekstra_lab/Users/kelsey/comet/bin/Comet", line 8, in <module>
    sys.exit(main())
  File "/n/holylfs03/LABS/hoekstra_lab/Users/kelsey/comet/lib/python3.6/site-packages/Comet/__main__.py", line 866, in main
    process(cls,X,L,plot_pages,cls_ser,tsne,marker_exp,gene_file,csv_path,vis_path,pickle_path,cluster_number,K,Abbrev,cluster_overall,Trim,count_data,skipvis)
  File "/n/holylfs03/LABS/hoekstra_lab/Users/kelsey/comet/lib/python3.6/site-packages/Comet/__main__.py", line 361, in process
    discrete_exp_full = discrete_exp.copy()
UnboundLocalError: local variable 'discrete_exp' referenced before assignment

Explanation of output

Hello there!

I am happy to say that this tool has been very useful to extract markers for our 10X data!

However, it would help us better to interpret the results if we are able to get more detailed explanations for each of the columns of a typical output table.

I would like to understand the meaning and interpretation of these columns:

HG_pval
TP
TN
q_value
HG_rank
CCS
MGD
CCS_rank
rank
Plot

Would be immensely helpful!

Many thanks :)

Quads option output not as expected?

First of all thank you for this package is amazing the things that you can do.

I already installed the package into a virtual environment working fine with default parameters. But, I wanted to add new parameters such as a specific gene list separated by line ("-g") and "-K 4" to increase the pair genes and in my case i want to analiyse for 4 genes .

Command used:

Comet -C 20 -K 4 -g ./gene_list.txt ./tabmaker_RNA.txt ./tabvis.txt ./tabcluster.txt output_within_ManualSelectedCells_4PairGenes/

This runs through without any errors. So I was checking the csv files for each cluster and found out that the cluster_X_quads.csv has gene combination repetition (table attached) and to my understanding this shouldnt be possible as it is supposed to get eliminated as your paper state "Duplicates and gene-repeating combinations are once again filtered out, and the resulting entries contain unique 4 gene marker panels." in the materials and methods section "Computing and ranking 4-gene marker panels".

Is there something that i am missing in the interpretation, or parameters or something
Thank you in advanced
cluster_4_quads_sample.csv

Web version - 3 and 4 marker panels

Hello,
For the web version, is there an option to obtain more than "pairs" of markers? That is, the 3 and 4-marker panels?
I know this is possible in the installed version, but I would like to use this on the web version, if possible.
Thanks!

COMET with Integrated Data? Normalization Method suggestions?

Hello,

Thank you for developing a great package filling an important niche within the world of scRNAseq. I'd like to use COMET on my dataset and I have two questions.

Do you have any recommendations for normalization method to use for input to COMET? I am using 10X data normalized with Seurat's SCTransform, but I wasn't sure if SCTransformed data would be compatible with COMET or if instead I should use log normalized or another method. on an integrated dataset (implemented with Seurat) to identify markers for cell types.
Do you have any experience or know of any users using COMET on integrated/batch corrected data, and similarly do you know if COMET is compatible with this kind of data? Would you recommend instead running COMET on single samples to circumvent this issue?

Thank you so much for your help!

Best,
Phil Cohen

installation error

Hi
I tried to install it in a virtual env using the latest version of python on my Mac but I get dependency error. Here is the error:

code: pip install COMETSC

ERROR: Cannot install cometsc==0.1.10, cometsc==0.1.11, cometsc==0.1.12, cometsc==0.1.13, cometsc==0.1.5, cometsc==0.1.6, cometsc==0.1.7 and cometsc==0.1.9 because these package versions have conflicting dependencies.

The conflict is caused by:
cometsc 0.1.13 depends on scikit-learn==0.21.0
cometsc 0.1.12 depends on scikit-learn==0.21.0
cometsc 0.1.11 depends on scikit-learn==0.21.0
cometsc 0.1.10 depends on scikit-learn==0.21.0
cometsc 0.1.9 depends on scikit-learn==0.21.0
cometsc 0.1.7 depends on scikit-learn==0.21.0
cometsc 0.1.6 depends on scikit-learn==0.21.0
cometsc 0.1.5 depends on scikit-learn==0.21.0

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

TypeError: ufunc 'isnan' not supported

Hi,

I get the following error when running comet (Running the example data files works fine):

(Virtualenv_3_6) bqdyn253_025:data stirier$ Comet tabmarker.txt tabvis.txt tabcluster.txt output/
Started on 2019-10-18T18:23:55.076413
Reading data...
Traceback (most recent call last):
  File "/Users/stirier/Virtualenv_3_6/bin/Comet", line 8, in <module>
    sys.exit(main())
  File "/Users/stirier/Virtualenv_3_6/lib/python3.6/site-packages/Comet/__main__.py", line 795, in main
    skipvis=skipvis)
  File "/Users/stirier/Virtualenv_3_6/lib/python3.6/site-packages/Comet/__main__.py", line 100, in read_data
    if np.isnan(cls_ser[0]):
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I extract the normalised count from Seurat:

input_data <- GetAssayData(object = BM1, slot = "data", assay = "SCT")
write.table(x = input_data, quote=F, sep="\t", row.names = T, col.names = NA, file = "/Volumes/ag-rippe/NGS_Stephan/HIPO2_K43R/General_Scripts/Comet/data/tabmarker.txt")

Do you have any idea what is wrong?

Thanks in advance!

Best

Stephan