Giter Site home page Giter Site logo

Comments (14)

ArtPoon avatar ArtPoon commented on August 19, 2024

Trying hierarchical clustering with hclust. From trial and error, Ward's method seems to be effective.

The adenovirus data set has 72 unique accession numbers. We want to see most clusters having roughly 72 entries because we expect most genes to appear once in each genome. The actual sizes of clusters, of course, varies with the clustering threshold:

clusters <- cutree(hc, h=2.)
hist(table(clusters), col='grey', border='white', breaks=20, main=NA)
abline(v=72, lty=2)

image

clusters <- cutree(hc, h=2.5)
hist(table(clusters), col='grey', border='white', breaks=20, main=NA)
abline(v=72, lty=2)

image

Here is the composition of one of the resulting clusters at a threshold of 2.5:

> table(temp[20])

               fiber              fiber 1              fiber 2 
                  39                    4                    4 
       fiber protein              fiber-1              fiber-2 
                  13                    8                   10 
               fibre        fibre protein hypothetical protein 
                   1                    1                    1 
                IV-1                 IV-2      protein fiber 2 
                   1                    1                    1 
          protein IV  short fiber protein 
                   1                    1 

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

t-SNE plot of hclust results using cutree to request 32 clusters:

image

  • clusters 2 and 22 are problematic and do not have a clear composition with respect to gene names
  • 16 and 25 should be merged (protease)

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

Things to try:

  • apply clustering method to PCA or t-SNE embedded data
  • use clustering method on large clusters in recursive fashion
  • use semi-supervised clustering approach, using the composition of clusters with respect to protein labels (i.e., "fiber protein") to optimize the clustering method and cutoff criteria.
  • incorporate genome location (normalized to relative position?) into the distance (which becomes a linear combination of k-mer distance and position similarity)

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

Distribution of cluster sizes and cluster numbers at varying height cutoffs for hierarchical clustering directly on the kmer distance matrix:

image

Red lines indicate the number of genomes and average number of ORFs per genome, respectively.
Note that cluster sizes get very large at cutoffs past 2, where number of clusters rapidly drops below the mean number of ORFs.

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

Same analysis applied to hierarchical clustering on t-SNE projection of k-mer distance matrix:
image

Note that the scales of tree heights are radically different (the scale is somewhat meaningless for t-SNE). There is a nicer distribution of cluster sizes around the number of genomes over a stretch of cutoffs. We see a similar stability over a range of cutoffs (about 55-85) with respect to number of clusters, although this plateau is slightly lower than the mean number of ORFs per genome.

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

t-SNE plot with clusters at h=65
image

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

Trying another approach that optimizes h based on what minimizes the difference between:

  • the mean proportion of ORFs with unique cluster IDs per genome (singletons)
  • the mean proportion of genomes carrying a given cluster ID
> opt <- optimize(obj.func, c(0, 100))
> opt
$minimum
[1] 45.92852

$objective
[1] 0.0001028029
> clusters <- cutree(hc2, h=opt$minimum)
> pal <- gg2.cols(n=max(clusters))
> par(mfrow=c(1,1))
> plot(res$Y, type='n')
> text(res$Y, label=clusters, col=pal[clusters], cex=0.8)

image

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

Put another way, this optimum is established where the probability of being a unique cluster ID in the genome is equal to the probability of that cluster ID being found in a randomly selected genome.

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

@horaciobam

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

Useful bit of code to add to kpca.R:

foo <- lapply(split(headers$gene.name, clusters), function(x) {
  tab <- table(as.character(x))
  sort(tab, decreasing=TRUE)
  })
> foo[6]
$`6`

              DNA polymerase                          pol 
                          45                           18 
                         E3L                   polymerase 
                           3                            3 
DNA dependent DNA polymerase                       DNApol 
                           1                            1 
                     DNA pol               DNA polyermase 
                           1                            1 
        hypothetical protein                  pol protein 
                           1                            1 

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

@horaciobam ran this clustering method on Potyviridae, attempted to run on Luteoviridae but that crashed (check for differences in same size, genome size?)

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

Re-run kpca.R script with a seed set for the Rtsne analysis, and check composition of what we are currently viewing as cluster 3 (the node with high degree size).

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

For a more visually appealing layout:

    edge [len=10.0];
    node [fontname="Helvetica" style="filled" fillcolor="white"];
    graph [outputorder="edgesfirst"];

from ovrf-viz.

ArtPoon avatar ArtPoon commented on August 19, 2024

Try porting clustering method from R to Python so that we don't have to maintain two languages

from ovrf-viz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.