Giter Site home page Giter Site logo

fagan2888 / bertopic Goto Github PK

View Code? Open in Web Editor NEW

This project forked from maartengr/bertopic

1.0 1.0 0.0 8.12 MB

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Home Page: https://maartengr.github.io/BERTopic/

License: MIT License

Python 99.70% Makefile 0.30%

bertopic's Introduction

PyPI - Python Build docs PyPI - PyPi PyPI - License DOI

BERTopic

BERTopic is a topic modeling technique that leverages ๐Ÿค— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports visualizations similar to LDAvis!

Corresponding medium posts can be found here and here.

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]

To install all backends:

pip install bertopic[all]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with one of the examples below:

Name Link
Topic Modeling with BERTopic Open In Colab
(Custom) Embedding Models in BERTopic Open In Colab
Advanced Customization in BERTopic Open In Colab
(semi-)Supervised Topic Modeling with BERTopic Open In Colab
Dynamic Topic Modeling with Trump's Tweets Open In Colab

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs)

After generating topics, we can access the frequent topics that were generated:

>>> topic_model.get_topic_info()

Topic	Count	Name
-1	4630	-1_can_your_will_any
49	693	49_windows_drive_dos_file
32	466	32_jesus_bible_christian_faith
2	441	2_space_launch_orbit_lunar
22	381	22_key_encryption_keys_encrypted

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 49:

>>> topic_model.get_topic(49)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

NOTE: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

Embedding Models

BERTopic supports many embedding models that can be used to embed the documents and words:

  • Sentence-Transformers
  • Flair
  • Spacy
  • Gensim
  • USE

Click here for a full overview of all supported embedding models.

Sentence-Transformers

You can select any model from sentence-transformers here and pass it to BERTopic:

topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:

from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model)

Flair

Flair allows you to choose almost any embedding model that is publicly available. Flair can be used as follows:

from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)

You can select any ๐Ÿค— transformers model here.

Custom Embeddings
You can also use previously generated embeddings by passing it to fit_transform():

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs, embeddings)

Dynamic Topic Modeling

Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented across different times. Here, we will be using all of Donald Trump's tweet so see how he talked over certain topics over time:

import re
import pandas as pd

trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

topic_model = BERTopic(verbose=True)
topics, _ = topic_model.fit_transform(tweets)

From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this by simply calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics:

topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps)

Finally, we can visualize the topics by simply calling visualize_topics_over_time():

topic_model.visualize_topics_over_time(topics_over_time, top_n=6)

Overview

For quick access to common function, here is an overview of BERTopic's main methods:

Method Code
Fit the model BERTopic().fit(docs)
Fit the model and predict documents BERTopic().fit_transform(docs)
Predict new documents BERTopic().transform([new_doc])
Access single topic BERTopic().get_topic(topic=12)
Access all topics BERTopic().get_topics()
Get topic freq BERTopic().get_topic_freq()
Get all topic information BERTopic().get_topic_info()
Get topics per class BERTopic().topics_per_class(docs, topics, classes)
Dynamic Topic Modeling BERTopic().topics_over_time(docs, topics, timestamps)
Visualize Topics BERTopic().visualize_topics()
Visualize Topic Probability Distribution BERTopic().visualize_distribution(probs[0])
Visualize Topics over Time BERTopic().visualize_topics_over_time(topics_over_time)
Visualize Topics per Class BERTopic().visualize_topics_per_class(topics_per_class)
Update topic representation BERTopic().update_topics(docs, topics, n_gram_range=(1, 3))
Reduce nr of topics BERTopic().reduce_topics(docs, topics, nr_topics=30)
Find topics BERTopic().find_topics("vehicle")
Save model BERTopic().save("my_model")
Load model BERTopic.load("my_model")
Get parameters BERTopic().get_params()

Citation

To cite BERTopic in your work, please use the following bibtex reference:

@misc{grootendorst2020bertopic,
  author       = {Maarten Grootendorst},
  title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.7.0},
  doi          = {10.5281/zenodo.4381785},
  url          = {https://doi.org/10.5281/zenodo.4381785}
}

bertopic's People

Contributors

maartengr avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.