Comments (4)
Thank you for the kind words.
Mostly what UMAP will buy you over using DBSCAN directly on the embedding vectors is a lot more of your data clustered while still having reasonably fine-grained clusters. Can I guarantee better results? I think there are no guarantees, especially in unsupervised learning. Would I expect better results if you use UMAP first and then DBSCAN or HDBSCAN? Yes, I definitely would.
Choosing parameters is always going to come down to the data you have, the kinds of results you want to get, and what you are going to use the clustering for from there. Some rules of thumb: n_components=5
is a good starting point for clustering. It is enough dimensions that UMAP has a much easier time resolving tangles etc. in the optimization, but still pretty low. I would not choose n_components
larger than n_neighbors
(or really larger than 20 even if you have a very large n_neighbors
). The choice of n_neighbors
is going to strongly influence the granularity of the clustering. The smaller the value the more fine grained the resolution of clusters you'll tend to get out (assuming DBSCAN or HDBSCAN for clustering the UMAP output). As for metric
, the usual choice for sentence embeddings is "cosine"; if you want to try something a little different then import pynndescent and use pynndescent.distances.alternative_cosine
which is a small tweak on cosine distance that may work better for your use case with UMAP.
from umap.
Thank you for your kind response. I'll start as you suggested!
Can UMAP be updated in batches? Is it possible to create a UMAP model for large images and further train it? It seems impossible due to UMAP's mechanics, but I wonder if implementing this feature would be difficult.
from umap.
I think for that use case you might want to look into ParametricUMAP. UMAP does have an update
method, but it is definitely not the same as training on the full dataset.
from umap.
Related Issues (20)
- Setting a random state still leads to stochastic results
- Implementation of sciki-learn's get_feature_names_out() API is not correct
- Is 'n_training_epochs' working for parameteric UMAP?
- visualize video data
- How to combine UMAP models in new data?
- Edit instructions to make them compatible with zsh
- Empty API page on UMAP API Guide? HOT 1
- PCA diagnostic error HOT 2
- Speed inquries HOT 2
- UMAP crashes when torch also imported before first run HOT 2
- Unable to pickle trained UMAP instance
- Reducing Model Size for UMAP on Large Datasets HOT 2
- umap.UMAP accepts strings as n_neighbors and min_dist, causing later failures
- Optimal dimensions
- RunUMAP Failing HOT 1
- Semi-deterministic output even though randon_state is set HOT 1
- TypeError: Dispatcher._rebuild() got an unexpected keyword argument 'impl_kind' HOT 1
- illegal hardware instruction python HOT 2
- Transform new input with composite model HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from umap.