Wanted a project to replicate some of the things I learned during my internship at Schibsted, but felt like I had to make a new project as not to rely on Schibsted's data.
This project aims to find clusters of topics based on titles from YouTube's trending videoes, with a dataset from Kaggle.
The project uses SBERT for getting sentence embeddings, UMAP for dimensionality reduction, HDBSCAN for clustering and TF-IDF for finding topics within clusters.
More documentation and code coming soon!
PS:
Plotly figures aren't displaying properly on github so if you want to see the plots, check the figures
directory