su-informatics-lab / dstg Goto Github PK

Deconvoluting Spatial Transcriptomics Data through Graph-based Artificial Intelligence

License: MIT License

Python 73.65% R 26.35%

dstg's Introduction

Deconvoluting Spatial Transcriptomics data through Graph-based convolutional networks (DSTG)

This is a TensorFlow implementation of DSTG for decomposing spatial transcriptomics data, which is described in our paper:

Installation

python setup.py install

Requirements

tensorflow (>0.12)
networkx

Run the demo

load the example data using the convert_data.R script In the example data, we provide two synthetic spatial transcriptomics data generated from scRNA-seq data (GSE72056). Each synthetic data consists of 1,000 spots, which can be found in folder synthetic_data.

cd DSTG
Rscript convert_data.R # load example data 
python train.py # run DSTG

Predicted compositions within each spot are saved in will be shown in the DSTG_Result folder.

Performance of JSD score will be shown if you run

Rscript evaluation.R

If you want to use your own scRNA-seq data to deconvolute your spatail transcriptomcis data, provide you data to script below:

Run your own data

When using your own scRNA-seq data to deconvolute your spatail transcriptomcis data, you have to provide

the raw scRNA-seq data matrix and label, which are saved as .RDS format (e.g. 'scRNAseq_data.RDS' & 'scRNAseq_label.RDS')
the raw spatial transcriptomics data matrix saved as .RDS format (e.g. 'spatial_data.RDS')

cd DSTG
Rscript  convert_data.R  scRNAseq_data.RDS  spatial_data.RDS  scRNAseq_label.RDS
python train.py # run DSTG

Then you will get your results in the DSTG_Result folder.

Cite

Please cite our paper if you use this code in your own work:

Qianqian Song, Jing Su, DSTG: deconvoluting spatial transcriptomics data through graph-based artificial intelligence, Briefings in Bioinformatics, 2021;, bbaa414, https://doi.org/10.1093/bib/bbaa414

dstg's People

Contributors

Stargazers

Watchers

Forkers

csangara qsong-github bink98 honchkrow xinformatics remylau hasihays wlzhdtk han117

dstg's Issues

Clarification on the input matrices

Hello,

Quick question - can you provide more clarification on the format of the input files? Presumably "scRNAseq_data.RDS" and "spatial_data.RDS" need to be gene x cell or gene x spot count matrices but I am unclear as to what the "scRNAseq_label.RDS" needs to be. Looking at the example demo, this appears to be known proportions of cell-types in the spots, but if using my own data without a ground truth, I am unsure what this file should be.

Additionally, for the demo data, is "example_data.RDS" a list of the scRNA-seq gene count matrix and the ST gene count matrix? What are the different components of the list "example_labels.RDS"?

Thanks in advance

Datasets in paper

Hi,
DSTG is a great tool.
Could you give the link to download the datasets in .rds format what ST data from complex tissues including mouse cortex,
hippocampus and human pancreatic tumor slices.
This will help me.

Data format of our own data

Could you explain more about the specific data format of our own data respectively? (including scRNAseq_data.RDS spatial_data.RDS scRNAseq_label.RDS)
What should they contain? What are the row and column of each file? Thanks!

Potential issues in data_process function in R_utils.R

Hello,

Thank you for providing a nice tool for decomposing cellular composition in spatial transcriptomic data. I was trying to utilize DSTG for my Visium spatial transcriptomics dataset and ran into an error.

The error occurred when I was running the below command in the Ubuntu terminal with R version 4.0.3.

Rscript convert_data.R scRNAseq_data.RDS spatial_data.RDS scRNAseq_label.RDS

I created a count matrix of single-cell data for scRNAseq_data.RDS (dim: total number of genes * total number of cells), a count matrix for spatial data for spatial_data.RDS (dim: total number of genes * total number of spatial spots) and a cell type label for each cell barcode (dim: total number of cells * 1) for scRNAseq_label.RDS.

The error message told me that in line 110 in R_utils.R, "st_labels" is not defined.

label.list2 <- do.call("rbind", rep(list(st_labels[[1]]), round(N2/N1)+1))[1:N2]

I presumed that since "st_labels" is defined in the next line (111), fixing st_labels[[1]] to st_label[[1]] may resolve the issue. Or is there any problem with my input for the convert_data.R?

Could you kindly check-up with the data_process function in R_utils.R?

Thanks a lot for creating the nice tool!

Issue in data_process function in R_utils.R

I was using the DSTG for my own Visium data. I have encountered the following error in the data_process function of R_utils.R file

Error: 'test_spot_fun' is not an exported object from 'namespace:SPOTlight'

The error was in line 101 of the file.

Can you help me to solve the issue?
Thanks for creating the nice tool.

Number of genes in the ST dataset

Hi, can I run DSTG with only a few dozen genes in my ST dataset? I tried to modify the default number of features in R_utils.R and I could ran convert_data.R, but runnning train.py gave me NA results.

The demo fails to run

Thanks for the great job.I had some problems in running the demo.With all packages in requirments.txt installed,the convert_data.R still can't generate data.The error are as follow:

Is there an implementation of pytorch version?

what filterEdge filters?

Hi Su,

Thanks for you and your team for developing this powerful tool. But I have a question about "filterEdge" function:

` def filterEdge(edges, neighbors, mats, features, k_filter):

nn_spots1 = neighbors[4]
nn_spots2 = neighbors[5]
mat1 = mats.loc[features, nn_spots1].transpose()
mat2 = mats.loc[features, nn_spots2].transpose()
cn_data1 = l2norm(mat1)
cn_data2 = l2norm(mat2)
nn = kNN(data=cn_data2.loc[nn_spots2, ],
         query=cn_data1.loc[nn_spots1, ],
         k=k_filter)
position = [
    np.where(
        edges.loc[:, "spot2"][x] == nn[1][edges.loc[:, 'spot1'][x], ])[0]
    for x in range(edges.shape[0])
]
nps = np.concatenate(position, axis=0)
fedge = edges.iloc[nps, ]
return (fedge)`

As I understand it, this function aims to ensure the relationships between spots are also kept when genes are restircted to top variable genes. However, I think np.where in line 12 returns the postion of nn[1] matrix if spot1 exists and this position itself dosen't match the position of edges, which makes this filter seem meaningless. Maybe what you want to return is the position of vector in edges if two spots are also adjacent in nn.
I am not sure I got that right. So, looking forward to your reply.

Bests,
Chang

PBMC data

Hi,

I see that you have provided the simulation details about 13 PBMC scRNAseq data which is super helpful. Could you please share the annotations(cell types) of those scRNAseq data?

Many Thanks!

Question regarding the data files.

Hi,

I am MS data science student and having difficulties in dealing with data.
I have downloaded the datasets from Gene Expression Omnibus and from Visium 10x sites as described in the paper, but they are not in the .RDS format. Specifically I can not find the labels in the downloaded datasets.
Please guide me in this regard.

Problem with Matrix Subsetting in convert_data.R

I am trying to run DSTG on my private transcriptomic dataset but the attempt is throwing an error.

Rscript convert_data.R "C:\Users\Milan Anand Raj\Desktop\sc_count.RDS" "C:\Users\Milan Anand Raj\Desktop\st_count.RDS" "C:\Users\Milan Anand Raj\Desktop\sc_label.RDS

The command line code.

Error in sc.count[intersect.genes, ] :
  invalid or not-yet-implemented 'Matrix' subsetting
Calls: [ -> [
Execution halted

The Error.

Any help is appreciated.
Thank You.

Question regarding the link-graph usage in the main training script

According to the link-graph construction description on the paper, it contains interaction (1) between pseudo-spots and real-spots, and (2) between real-spots.

If I understood correctly, graph1 corresponds to (1) and graph2 corresponds to (2).

DSTG/DSTG/graph.py

Lines 89 to 95 in 4a2f958

    
           graph1 = link1[0].iloc[:, 0:2].reset_index() 
        
           graph1 = graph1.iloc[:,1:3]  
        
           graph1.to_csv('./Datadir/Linked_graph1.csv') 
        
           graph2 = link2[0].iloc[:, 0:2].reset_index() 
        
           graph2 = graph2.iloc[:,1:3] 
        
           graph2.to_csv('./Datadir/Linked_graph2.csv')

However, in the main training script, only graph1 is used.

DSTG/DSTG/train.py

Lines 34 to 37 in 4a2f958

    
           adj, features, labels_binary_train, labels_binary_val, labels_binary_test, train_mask, pred_mask, val_mask, test_mask, new_label, true_label = load_data( 
        
               FLAGS.dataset) 
        
           support = [preprocess_adj(adj)]

Because load_data only loads graph1

DSTG/DSTG/utils.py

Lines 73 to 99 in 4a2f958

    
              id_graph1 = pd.read_csv('{}/Linked_graph1.csv'.format(datadir), 
        
                                      index_col=0, 
        
                                      sep=',') 
        
              #' map index  
        
              fake1 = np.array([-1] * len(lab_data2.index)) 
        
              index1 = np.concatenate((data_train1.index, fake1, data_val1.index, 
        
                                       data_test1.index)).flatten() 
        
              #' (feature_data.index==index1).all() 
        
              fake2 = np.array([-1] * len(data_train1)) 
        
              fake3 = np.array([-1] * (len(data_val1) + len(data_test1))) 
        
              find1 = np.concatenate((fake2, np.array(lab_data2.index), fake3)).flatten() 
        
              row1 = [np.where(find1 == id_graph1.iloc[i, 1])[0][0] 
        
           for i in range(len(id_graph1)) 
        
              ] 
        
              col1 = [np.where(index1 == id_graph1.iloc[i, 0])[0][0] 
        
           for i in range(len(id_graph1)) 
        
              ] 
        
              adj = defaultdict(list)  # default value of int is 0                                                                                                                                
        
              for i in range(len(labels)): 
        
                  adj[i].append(i) 
        
              for i in range(len(row1)): 
        
                  adj[row1[i]].append(col1[i]) 
        
                  adj[col1[i]].append(row1[i]) 
        
              adj = nx.adjacency_matrix(nx.from_dict_of_lists(adj))

I'm wondering if this is done intentionally or accidentally. I appreciate if you could clarify this confusion due to the inconsistency between the code and the manuscript. Thank you for any help in advance!

PDAC data

Hi @QSong-github ,
Could you please share the data of all spots' coordinates for the tissue images in the PDAC-A and B? I would like to map the spots into images. Thanks a lot!

problem in running train.py

it is a great tools!i use my own data to test , follow the input format,_The 'scRNAseq_data.RDS' refers to the single-cell RNA-seq data that you use for deconvolution. It is a data matrix with rows as genes and columns as cells. The 'scRNAseq_label.RDS' is a data frame with rowname as the cell names and one column as cell type. The 'spatial_data.RDS' is the spatial transcriptomics data matrix with rows as genes and columns as spots._but there is a erro as follow when i run the train.py. raise KeyError(f"{not_found} not in index").can you give me some advice? Thanks.

'test_spot_fun' is no longer supported

Hi,

I was trying to run the convert_data.R step by step and I encounter the error:
Error: 'test_spot_fun' is not an exported object from 'namespace:SPOTlight'
I checked the SPOTlight and it seems that this function no longer exists in their latest package now. Can you help with that?

Edit: I found this archived function from their github, but I am not sure it is the correct version of this function.

Thank you

Predict_output.csv problem

Hi:
I run the demo, and get the preddict_output result all Nan.
How to fix it ?

error while running demo

Installed /Users/nameetashah/anaconda3/lib/python3.7/site-packages/DSTG-0.0.1-py3.7.egg
Processing dependencies for DSTG==0.0.1
Finished processing dependencies for DSTG==0.0.1
(base) Nameetas-MacBook-Pro:DSTG-main nameetashah$ cd DSTG
(base) Nameetas-MacBook-Pro:DSTG nameetashah$ Rscript convert_data.R
run synthetic data...
(base) Nameetas-MacBook-Pro:DSTG nameetashah$ python train.py
Traceback (most recent call last):
File "train.py", line 8, in
from models import DSTG
File "/Users/nameetashah/Documents/MSCTR/SGI/software/DSTG-main/DSTG/models.py", line 1, in
from layers import *
File "/Users/nameetashah/Documents/MSCTR/SGI/software/DSTG-main/DSTG/layers.py", line 4, in
flags = tf.app.flags
AttributeError: module 'tensorflow' has no attribute 'app'

code bug

def filterEdge(edges, neighbors, mats, features, k_filter):
    nn_cells1 = neighbors[4]
    nn_cells2 = neighbors[5]
    mat1 = mats.loc[features, nn_cells1].transpose()
    mat2 = mats.loc[features, nn_cells2].transpose()
    cn_data1 = l2norm(mat1)
    cn_data2 = l2norm(mat2)
    nn = kNN(data=cn_data2.loc[nn_cells2, ],
             query=cn_data1.loc[nn_cells1, ],
             k=k_filter)
    position = [
        np.where(edges.loc[:, "cell2"][x] == nn[1][edges.loc[:, 'cell1'][x], ])[0]
        for x in range(edges.shape[0])
    ]
    nps = np.concatenate(position, axis=0) 
    fedge = edges.iloc[nps, ] 
    #print("\t Finally identified ", fedge.shape[0], " MNN edges")
    pdb.set_trace()
    return (fedge)

well, I think this code has a bug.
nps here maybe refer to the match index of cell2' s cell the nearest 200 cell1' s cell, according to select_genes.
however, edges here refer to the csv of cell link among cell1 and 2.
I don't know what the connection between them in code: fedge = edges.iloc[nps, ]

Error on synthetic dataset

I am receiving the following error when I run the synthetic dataset:

C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\Infor_Data/ST_count\ST_count_1.csv
C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\Infor_Data/ST_count\ST_count_2.csv
C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\Infor_Data/ST_norm\ST_norm_1.csv
C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\Infor_Data/ST_norm\ST_norm_2.csv
C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\Infor_Data/ST_scale\ST_scale_1.csv
C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\Infor_Data/ST_scale\ST_scale_2.csv
C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\Infor_Data/ST_label\ST_label_1.csv
C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\Infor_Data/ST_label\ST_label_2.csv
Traceback (most recent call last):
  File "train.py", line 35, in <module>
    FLAGS.dataset)
  File "C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\utils.py", line 13, in load_data
    input_data(datadir)
  File "C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\data.py", line 17, in input_data
    Link_Graph(outputdir='Infor_Data')
  File "C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\graph.py", line 55, in Link_Graph
    combine=combine)
  File "C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\gutils.py", line 161, in Link_graph
    k=30)
  File "C:\Users\echan\Desktop\spatiogenomics\DSTG\DSTG\gutils.py", line 66, in KNN
    embedding_cells1 = cell_embedding.loc[cells1, ]
  File "C:\ProgramData\Anaconda3\envs\tf4\lib\site-packages\pandas\core\indexing.py", line 1472, in __getitem__
    return self._getitem_tuple(key)
  File "C:\ProgramData\Anaconda3\envs\tf4\lib\site-packages\pandas\core\indexing.py", line 879, in _getitem_tuple
    return self._multi_take(tup)
  File "C:\ProgramData\Anaconda3\envs\tf4\lib\site-packages\pandas\core\indexing.py", line 943, in _multi_take
    return o._reindex_with_indexers(d, copy=True, allow_dups=True)
  File "C:\ProgramData\Anaconda3\envs\tf4\lib\site-packages\pandas\core\generic.py", line 3810, in _reindex_with_indexers
    copy=copy)
  File "C:\ProgramData\Anaconda3\envs\tf4\lib\site-packages\pandas\core\internals.py", line 4429, in reindex_indexer
    return self.__class__(new_blocks, new_axes)
  File "C:\ProgramData\Anaconda3\envs\tf4\lib\site-packages\pandas\core\internals.py", line 3282, in __init__
    self._verify_integrity()
  File "C:\ProgramData\Anaconda3\envs\tf4\lib\site-packages\pandas\core\internals.py", line 3493, in _verify_integrity
    construction_error(tot_items, block.shape[1:], self.axes)
  File "C:\ProgramData\Anaconda3\envs\tf4\lib\site-packages\pandas\core\internals.py", line 4843, in construction_error
    passed, implied))
ValueError: Shape of passed values is (30, 2000), indices imply (30, 1000)

	graph1 = link1[0].iloc[:, 0:2].reset_index()
	graph1 = graph1.iloc[:,1:3]
	graph1.to_csv('./Datadir/Linked_graph1.csv')

	graph2 = link2[0].iloc[:, 0:2].reset_index()
	graph2 = graph2.iloc[:,1:3]
	graph2.to_csv('./Datadir/Linked_graph2.csv')

	adj, features, labels_binary_train, labels_binary_val, labels_binary_test, train_mask, pred_mask, val_mask, test_mask, new_label, true_label = load_data(
	FLAGS.dataset)

	support = [preprocess_adj(adj)]

	id_graph1 = pd.read_csv('{}/Linked_graph1.csv'.format(datadir),
	index_col=0,
	sep=',')

	#' map index
	fake1 = np.array([-1] * len(lab_data2.index))
	index1 = np.concatenate((data_train1.index, fake1, data_val1.index,
	data_test1.index)).flatten()
	#' (feature_data.index==index1).all()
	fake2 = np.array([-1] * len(data_train1))
	fake3 = np.array([-1] * (len(data_val1) + len(data_test1)))
	find1 = np.concatenate((fake2, np.array(lab_data2.index), fake3)).flatten()

	row1 = [np.where(find1 == id_graph1.iloc[i, 1])[0][0]
	for i in range(len(id_graph1))
	]
	col1 = [np.where(index1 == id_graph1.iloc[i, 0])[0][0]
	for i in range(len(id_graph1))
	]
	adj = defaultdict(list) # default value of int is 0
	for i in range(len(labels)):
	adj[i].append(i)
	for i in range(len(row1)):
	adj[row1[i]].append(col1[i])
	adj[col1[i]].append(row1[i])

	adj = nx.adjacency_matrix(nx.from_dict_of_lists(adj))