Hi, I'm testing this tool and I find it very interesting; however, I'm having a little problem (I am not sure if this is a bug or if I am missing something).
I have a similarity matrix that I've calculated by applying the Jaccard similarity to my data. In R this matrix is stored in a data frame, where equal individuals have a similarity of 1, and completely distinct individuals have a similarity of 0. I am using the function plotSimilarityMatrix
and It seems to be correct:
Nonetheless, I tried to recreate the clustering by using hclust. This library needs a dist object, so I did 1 - my similarity matrix
so that a similarity of 1 is translated into a distance of 0, and a similarity of 0 is translated into a distance of 1, and I did as.dist(myDistanceMatrix)
in order to get a dist object to use with hclust. I used the default parameters for hclust (euclidean distance and complete method), however, the resulting clustering is not as nice as I got before:
I do not know which cluster is the correct one, but I have checked the code of the function plotSimilarityMatrix
and it is using the pheatmap library. If I am not wrong, the similarity matrix received as input by plotSimilarityMatrix
is passed to pheatmat. I dived into the pheatmap function and I saw the following code used for calculating the dendrogram:
cluster_mat = function(mat, distance, method){
if(!(method %in% c("ward.D", "ward.D2", "ward", "single", "complete", "average", "mcquitty", "median", "centroid"))){
stop("clustering method has to one form the list: 'ward', 'ward.D', 'ward.D2', 'single', 'complete', 'average', 'mcquitty', 'median' or 'centroid'.")
}
if(!(distance[1] %in% c("correlation", "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski")) & class(distance) != "dist"){
stop("distance has to be a dissimilarity structure as produced by dist or one measure form the list: 'correlation', 'euclidean', 'maximum', 'manhattan', 'canberra', 'binary', 'minkowski'")
}
if(distance[1] == "correlation"){
d = as.dist(1 - cor(t(mat)))
}
else{
if(class(distance) == "dist"){
d = distance
}
else{
d = dist(mat, method = distance)
}
}
return(hclust(d, method = method))
}
This code checks if the type of the input matrix is a dist object. I think, in this case this would never be a dist object because the function plotSimilarityMatrix
is expecting a similarity matrix, not a dissimilarity one. Thus, the above function from pheatmat assumes that the input matrix contains data, not distances, and it calculates a distance matrix through d = dist(mat, method = distance)
Then, the clustering appearing in the plot from plotSimilarityMatrix
is resulting from calculating the distance among the elements from the input similarity matrix.
Am I correct? I wish I've misunderstood something because I really like the first plot provided by your library, much more than the one I obtained after by applying hclust.
Kind regards,
Francisco Abad.