alleninstitute / scrattch.hicat Goto Github PK

View Code? Open in Web Editor NEW

103.0 15.0 31.0 157.93 MB

Hierarchical, iterative clustering for analysis of transcriptomics data in R

License: Other

R 42.04% HTML 57.96%

transcriptomics clustering wgcna jaccard r

scrattch.hicat's Introduction

scrattch.hicat: Hierarchical, Iterative Clustering for Analysis of Transcriptomics

Master:

Dev:

Installation

scrattch.hicat has several dependencies, including two from BioConductor and one from Github:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("limma")

devtools::install_github("JinmiaoChenLab/Rphenograph")

Once these dependencies are installed, scrattch.hicat can be installed with:

devtools::install_github("AllenInstitute/scrattch.hicat")

Vignettes

An overview of the main functions in scrattch.hicat

Tutorials

An interactive walkthrough of the major steps in clustering for scrattch.hicat.

Roadmap

The next few updates to scrattch.hicat will be aimed at getting code testing in place for major clustering functions:
0.0.22: Current version; Tests in place for de.genes.R functions.
0.0.23: Tests in place for cluster.R functions.
0.1.0: Vignette re-integrated; Adding pkgdown page; Update to Master branch.

Previous updates: 0.0.21: Added TravisCI and covr integration.

The `scrattch` suite

scrattch.hicat is one component of the scrattch suite of packages for Single Cell RNA-seq Analysis for Transcriptomic Type CHaracterization from the Allen Institute.

License

The license for this package is available on Github at: https://github.com/AllenInstitute/scrattch.hicat/blob/master/LICENSE

Level of Support

We are planning on occasional updating this tool with no fixed schedule. Community involvement is encouraged through both issues and pull requests.

Contribution Agreement

If you contribute code to this repository through pull requests or other mechanisms, you are subject to the Allen Institute Contribution Agreement, which is available in full at: https://github.com/AllenInstitute/scrattch.hicat/blob/master/CONTRIBUTION

Image attribution:

By Internet Archive Book Images [No restrictions], via Wikimedia Commons

scrattch.hicat's People

Contributors

Stargazers

Watchers

scrattch.hicat's Issues

Multicore clustering should allow user defined .libPath

Running run_consensus_clust with mc.cores > 1 fails if scrattch.hicat is not installed in the default library location. It should be possible to specify a user .libPath for additional R instances, e.g. : https://stackoverflow.com/questions/6412459/how-to-specify-the-location-of-r-packages-in-foreach-packages-pkg-do

iter_clust() and iter_consensus_clust() have an if statement glitch

These functions warn "the condition has length > 1 and only the first element will be used" repeatedly. There's an if statement somewhere that's looking for a single value, but is getting multiple values.

This happens if dim.method = "PCA"

Display_cl function

Hello scrattch.hitcat team,

I am following the tutorial presented here: https://taxonomy.shinyapps.io/scrattch_tutorial with the tasic2016data. However, I am having problems when plotting the heatmaps (display.result = display_cl(onestep.result$cl, norm.dat, plot=TRUE, de.param=de.param), I got the following error:

Error in is.null(Rowv) || is.na(Rowv): 'length = 2' in coercion to 'logical(1)'
Traceback:

display_cl(onestep.result$cl, norm.dat, plot = TRUE, de.param = de.param)
plot_cl_heatmap(tmp.dat, cl, markers, ColSideColors = tmp.col,
. prefix = prefix, labels = NULL, by.cl = TRUE, min.sep = min.sep,
. main = main, height = height, width = width)
heatmap.3(tmp.dat[, ord], Rowv = as.dendrogram(gene.hc), Colv = NULL,
. col = col, trace = "none", dendrogram = "none", cexCol = cexCol,
. cexRow = cexRow, ColSideColors = ColSideColors[, ord], breaks = breaks,
. colsep = sep, sepcolor = "black", main = main, key = key,
. density.info = "none")

I was wondering if you could help me with this issue.

Thank you

How to cite scratch.hicat ?

I've used some of your code for my own research and I'm trying to find a way to reference your package. Could you please point me to the correct reference?

Thanks!

install requirements

I think the install instructions are missing installing WGCNA from bioconductor? They say "several dependencies, including two from BioConductor and one from Github", but only list limma from bioconductor, and I get an error if I haven't installed WGCNA already.

fast_tsne.R breaks installation of dev branch

Warning in file(filename, "r", encoding = encoding) :
cannot open file '/home/trygveb/src/FIt-SNE/fast_tsne.R': No such file or directory
Error in file(filename, "r", encoding = encoding) :
cannot open the connection
Error : unable to load R code in package ‘scrattch.hicat’
ERROR: lazy loading failed for package ‘scrattch.hicat’

Installation error

Hi,

I tried installing package, scrattch.hicat, using remotes::install_github("AllenInstitute/scrattch.hicat"), while the below error messages were obtained:

Downloading GitHub repo AllenInstitute/scrattch.hicat@HEAD
Error in utils::download.file(url, path, method = method, quiet = quiet, :
download from 'https://api.github.com/repos/AllenInstitute/scrattch.hicat/tarball/HEAD' failed

Any suggestions are appreciated!

Best regards,
Zhenyao

run_consensus_clust missing return.markers = TRUE setting

Add return.markers = TRUE as default setting to run_consensus_clust. Markers are needed for generation of consensus matrix so function breaks.

ERROR: dependency ‘qlcMatrix’ is not available for package ‘scrattch.hicat’

Hi,

When trying to install scrattch.hicat, I run into the following error:

ERROR: dependency ‘qlcMatrix’ is not available for package ‘scrattch.hicat’

It seems that qlcMatrix is no longer available on CRAN. Installing the package from its GitHub repo using devtools solved the problem.

Perhaps the DESCRIPTION file could be updated to retrieve the dependency automatically from GitHub upon installation?

Best,

Ángeles

Choosing the DE score threshold

Hi,
I am clustering snRNAseq data and I was wondering how you chose DE scores for different types of data (

scrattch.hicat/R/de.genes.R

Line 133 in 885b34e

#' # Recommended initial parameters for 10x Nuclei (> 1,000 genes per sample):

). Was there a set of statistic tests to determine these numbers, or were they chosen based on trial and error with the different datasets?
I am also curious about why DE score contribution is based on the p-values for chi squares from a binary express/doesn't express metric rather than a test for continuous data. I do also see that the genes are filtered by LFC before they are allowed to contribute to the DE score.
Thanks!

iter_clust keeps failing

Hi
Im trying to run the scrattch.hicat module on some 10x data. however the iter_clust function keeps failing with the error Error in unique.default(cl) : unique() applies only to vectors

I am not sure what is causing this. It works fine on the toy dataset on tasic2016 data but seems to fail on mine.
I am using log and not log2 normalised values which were first corrected for library size and then scaled to factor of 10,000 in seurat. Primarily because UMI counts are much lower so its common to use this than cpm or fpkm. can that be causing the error.

Merge clusters based on number of marker genes regardless of de.score

Count the number of up and down markers that meet the de gene criteria (default p < 0.01 and log2FC > 1). Allow merging of cluster pairs based on the number of up and/or down markers regardless of the de score (i.e. summed -log10P). This can be used as a final curation of clusters to require bidirectional markers between all clusters.

Verbosity needs to be adjustable

Running iterative clustering dumps a lot of text to the console. We should be able to toggle or adjust this behavior throughout scrattch.hicat.

Dependencies need to be updated

qlcMatrix has been deprecated on CRAN (see reason here). The DESCRIPTION file should be updated to use the archived version or some other workaround.

Split findVG to computation and plots

Currently, findVG both computes statistics for variable genes and can optionally output plots to a file as a side effect. Instead, we should split this up to 3 functions:
1 computes the statistics
1 generates the plots
1 saves the plots (or leave this up to the user to use ggsave() or cowplot).

Plot River includes source files not included

The riverplot function include two lines that source code not included in the repository, when you run the function it gives this error.

In file(filename, "r", encoding = encoding) :
cannot open file '/home/bharris/zizhen/My_R/map_river_plot.R': No such file or directory

when I view the function, it appears that both of these will throw an error

	source("~/zizhen/My_R/map_river_plot.R")
	source("~/zizhen/My_R/sankey_functions.R")

Broad types

Hello,

I was trying to use hicat on my own data. However, I can't seem to find the genes/markers that you use to classify in the broad types of GABAergic, glutamergic and non-neuronal (first levels of the dendrogram). Could you please indicate where I can find more information on that ?

Thank you so much in advance, I've been searching and searching

Best

Docs: Workflow of functions

Recommended by Fahimeh: We should build a workflow showing which functions are used for each step of analysis, analogous to our supplementary figure.

Details on the clustering method

Hello,

I used scrattch.hicat for my scRNA-seq analysis and I thank you for the developpement of this great tool.

However, In the pipeline that you used in the Tasic et al, 2018 paper, I would like clarification on some points.

You performed the bootstrapping and consensus clustering on each of the broad class that you identified beforehand. What it is not clear for me it is when you merged the co-clustering matrices. I understand that you merged the co-clustering matrices of PCA and WGCNA modes for each broad class. Is it the case? So steps until the merging module is applied for each broad class, right?
In the de_param function, to set the de.score.th you recommand to use for small datasets (#cells < 1000), a de.score.th = 40,
and for large datasets (#cells > 10000), a de.score.th = 150. But do we consider the whole dataset to set this de.score.th or the number of cells encompassed in each broad class?
For example if I have a dataset of 8000 cells with 3 classes (class1: 6000 cells/ class2: 1500 cells/ class3: 500 cells), I have to set:

for class1: de.score.th= 105
for class2: de.score.th= 50
for class3: de.score.th= 40

or, do I have to set:
de.score.th=130 for all?

Last question: when assigning core and intermediate cells, if you find that the best.cluster.score is not the original cluster of the cell, do you reassign this cell to its best cluster or do you keep the original one?

Thank you in advance.
Best regards

Check ref.cl data.frame annotation in compare_annotate

Add this to function as it is required for later steps:
row.names(ref.cl.df) <- ref.cl.df$cluster_label

Update column names for tutorial

Hello!

My name is Andrew Blair, a graduate student at UCSC in Josh Stuart's lab. First, congratulations and thank you for providing a clear and descriptive tutorial!

I ran into a few minor annotation 'bugs' during the tutorial though and wanted to let you guys know:

Update from tutorial 'primary_type' to 'primary_type_label' and 'sample_id' to 'sample_name'
select.cells <- tasic_2016_anno %>%
filter(primary_type_label != "unclassified") %>%
filter(grepl("Igtp|Ndnf|Vip|Sncg|Smad3",primary_type_label)) %>%
select(sample_name) %>%
unlist()
The column 'primary_type' was not present in the data frame
ref.cl.df <- as.data.frame(unique(anno[,c("primary_type_id", "primary_type_color", "broad_type")]))
Update from anno$sample_id to anno$sample_name
ref.cl <- setNames(factor(anno$primary_type_id), anno$sample_name)

Your tutorial worked after these updates.

Thanks Again!

iter_clust() throws an hclust warning in WGCNA mode

iter_clust() warns repeatedly with:
The "ward" method has been renamed to "ward.D"; note new "ward.D2"

This happens if dim.method = "WGCNA".

Creating the rd.dat for cl_merge - "too dense for [CR]sparseMatrix; would have more than 2^31-1 nonzero entries"

I am running a somewhat large dataset (17K x 110K) and wanted to use the cl_merge function but my transposed matrix has too many nonzero entities. I'm sure plenty of people have run larger datasets through this system so I'm unsure of what is wrong here. My norm.dat matrix has 458M non-zero entities and there are about 8K marker genes in my iterated clusters object.

Constellation plots full example

Hello,

Thank you so much for making this toolbox publically available! I am trying to make a constellation plot like the ones you have in your papers, but I am having a hard time generating the correct input for the get_knn_graph() function. You provide great example csv files for the plot_constellation() function but none for the knn_graph function. Would you mind providing me with example input to that function so that I can tailor my data accordingly?

Thanks,
Salwan

Clustering should be able to run without writing files

We need to have a mode for clustering that doesn't write files to the users' machine. This is not normal behavior for most functions. The outputs aren't very large, so we should be able to contain these results in a list object.

tutorial uses old column names

Just noticed that the tutorial at https://taxonomy.shinyapps.io/scrattch_tutorial/ can't be run since it uses the old column names for tasic2016data, before the change AllenInstitute/tasic2016data@17098ed

Constellation plots: calculating knn in reduced dimension space - pca or umap?

Hi scrattch.hicat team,

I've been trying make some constellation plots of my own using the code in your package.

In your methods section for the Yao 2020 preprint, you describe the process for making the constellation plots:

For each cell its 15 nearest neighbors in reduced dimension space were determined and summarized by cluster. For each cluster, we then calculated the fraction of nearest neighbors that were assigned to other clusters.

Does "reduced dimension space" here refer to PCA, or UMAP?
And if PCA - how many PCs did you use?

I understand that the cluster nodes are derived from the UMAP coordinates (centroids), but it's not clear from the explanation or the code if you are getting the knn table from PCA or UMAP coordinates. My hunch is that you use PCA for this, following the workflow used for clustering. Am I right about this?

Thanks a lot!
Carmen

after iter_clust, cl>49

Hi,

What does cl>49 mean after iter_clust? Because the annotation from Tasic 2016 only covers 0-49 clusters, cl>49 cannot be annotated.

My sample size is around 20k cells. This is the parameter that I used.
de.param <- de_param(padj.th = 0.05,
lfc.th = 1,
low.th = 1,
q1.th = 0.5,
q.diff.th = 0.7,
de.score.th = 150)

pca.clust.result <- iter_clust(norm.dat,
dim.method = "pca",
de.param = de.param)

Thank you.

Addition of functions for checking whether clusters are outliers or donor-specific

library() order matters

If library(WGCNA) is called later than library(Matrix), there are errors that crop up related to matrices. We may need to always call sparse matrix-related computations using Matrix::

alleninstitute / scrattch.hicat Goto Github PK

scrattch.hicat's Introduction

scrattch.hicat: Hierarchical, Iterative Clustering for Analysis of Transcriptomics

Installation

Vignettes

Tutorials

Roadmap

The scrattch suite

License

Level of Support

Contribution Agreement

Image attribution:

scrattch.hicat's People

Contributors

Stargazers

Watchers

Forkers

scrattch.hicat's Issues

Recommend Projects

Recommend Topics

Recommend Org

The `scrattch` suite