Ideally the clustering method should be able to: operate on a

t-SNE plot of hclust results using <code class="notra

Things to try: apply clustering method to PCA or t-SNE embedde

t-SNE plot with clusters at h=65 <a target="_blank" rel="noopener noreferrer nofol

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Useful bit of code to add to kpca.R : <div class="

Apply clustering method to distance matrix from #12 about ovrf-viz HOT 14 CLOSED

poonlab commented on August 19, 2024

Apply clustering method to distance matrix from #12

from ovrf-viz.

Comments (14)

ArtPoon commented on August 19, 2024

Trying hierarchical clustering with hclust. From trial and error, Ward's method seems to be effective.

The adenovirus data set has 72 unique accession numbers. We want to see most clusters having roughly 72 entries because we expect most genes to appear once in each genome. The actual sizes of clusters, of course, varies with the clustering threshold:

clusters <- cutree(hc, h=2.)
hist(table(clusters), col='grey', border='white', breaks=20, main=NA)
abline(v=72, lty=2)

clusters <- cutree(hc, h=2.5)
hist(table(clusters), col='grey', border='white', breaks=20, main=NA)
abline(v=72, lty=2)

Here is the composition of one of the resulting clusters at a threshold of 2.5:

> table(temp[20])

               fiber              fiber 1              fiber 2 
                  39                    4                    4 
       fiber protein              fiber-1              fiber-2 
                  13                    8                   10 
               fibre        fibre protein hypothetical protein 
                   1                    1                    1 
                IV-1                 IV-2      protein fiber 2 
                   1                    1                    1 
          protein IV  short fiber protein 
                   1                    1

from ovrf-viz.

ArtPoon commented on August 19, 2024

t-SNE plot of hclust results using cutree to request 32 clusters:

clusters 2 and 22 are problematic and do not have a clear composition with respect to gene names
16 and 25 should be merged (protease)

from ovrf-viz.

ArtPoon commented on August 19, 2024

Things to try:

apply clustering method to PCA or t-SNE embedded data
use clustering method on large clusters in recursive fashion
use semi-supervised clustering approach, using the composition of clusters with respect to protein labels (i.e., "fiber protein") to optimize the clustering method and cutoff criteria.
incorporate genome location (normalized to relative position?) into the distance (which becomes a linear combination of k-mer distance and position similarity)

from ovrf-viz.

ArtPoon commented on August 19, 2024

Distribution of cluster sizes and cluster numbers at varying height cutoffs for hierarchical clustering directly on the kmer distance matrix:

Red lines indicate the number of genomes and average number of ORFs per genome, respectively.
Note that cluster sizes get very large at cutoffs past 2, where number of clusters rapidly drops below the mean number of ORFs.

from ovrf-viz.

ArtPoon commented on August 19, 2024

Same analysis applied to hierarchical clustering on t-SNE projection of k-mer distance matrix:

Note that the scales of tree heights are radically different (the scale is somewhat meaningless for t-SNE). There is a nicer distribution of cluster sizes around the number of genomes over a stretch of cutoffs. We see a similar stability over a range of cutoffs (about 55-85) with respect to number of clusters, although this plateau is slightly lower than the mean number of ORFs per genome.

from ovrf-viz.

ArtPoon commented on August 19, 2024

t-SNE plot with clusters at h=65

from ovrf-viz.

ArtPoon commented on August 19, 2024

Trying another approach that optimizes h based on what minimizes the difference between:

the mean proportion of ORFs with unique cluster IDs per genome (singletons)
the mean proportion of genomes carrying a given cluster ID

> opt <- optimize(obj.func, c(0, 100))
> opt
$minimum
[1] 45.92852

$objective
[1] 0.0001028029
> clusters <- cutree(hc2, h=opt$minimum)
> pal <- gg2.cols(n=max(clusters))
> par(mfrow=c(1,1))
> plot(res$Y, type='n')
> text(res$Y, label=clusters, col=pal[clusters], cex=0.8)

from ovrf-viz.

ArtPoon commented on August 19, 2024

Put another way, this optimum is established where the probability of being a unique cluster ID in the genome is equal to the probability of that cluster ID being found in a randomly selected genome.

from ovrf-viz.

ArtPoon commented on August 19, 2024

@horaciobam

from ovrf-viz.

ArtPoon commented on August 19, 2024

Useful bit of code to add to kpca.R:

foo <- lapply(split(headers$gene.name, clusters), function(x) {
  tab <- table(as.character(x))
  sort(tab, decreasing=TRUE)
  })
> foo[6]
$`6`

              DNA polymerase                          pol 
                          45                           18 
                         E3L                   polymerase 
                           3                            3 
DNA dependent DNA polymerase                       DNApol 
                           1                            1 
                     DNA pol               DNA polyermase 
                           1                            1 
        hypothetical protein                  pol protein 
                           1                            1

from ovrf-viz.

ArtPoon commented on August 19, 2024

@horaciobam ran this clustering method on Potyviridae, attempted to run on Luteoviridae but that crashed (check for differences in same size, genome size?)

from ovrf-viz.

ArtPoon commented on August 19, 2024

Re-run kpca.R script with a seed set for the Rtsne analysis, and check composition of what we are currently viewing as cluster 3 (the node with high degree size).

from ovrf-viz.

ArtPoon commented on August 19, 2024

For a more visually appealing layout:

    edge [len=10.0];
    node [fontname="Helvetica" style="filled" fillcolor="white"];
    graph [outputorder="edgesfirst"];

from ovrf-viz.

ArtPoon commented on August 19, 2024

Try porting clustering method from R to Python so that we don't have to maintain two languages

from ovrf-viz.

Apply clustering method to distance matrix from #12 about ovrf-viz HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent