This project explores topic modeling in natural language processing (NLP) using Python. Topic modeling is a method for finding a group of words (i.e. topics) from a collection of documents that best represents the information in the collection of text documents. The goal is to extract topics from a set of documents using the TF-IDF method for information extraction and three clustering algorithms: K-means, DBSCAN, and Hierarchical Clustering.
The input data consists of a set of text documents in a CSV file.
The file contains 9 columns: id
, title
, publication
, author
, date
, year
, month
, url
and content
.
Column | Description |
---|---|
id | column contains a unique identifier for each document |
title | column contains the raw text of each document |
publication | column contains the puplication source name for each document |
author | column contains the authors' names for each document |
date | column contains the full-date of publication for each document |
year | column contains the year of publication for each document |
month | column contains the month of publication for each document |
url | column contains the topic link for each document |
content | column contains the raw text of each document |
- Handel Missing values
- Handel columns with missing values: by remove
url
column because it's empty(100% null values). - Handel rows with missing values: by drop all rows contain null values.
- Handel columns with missing values: by remove
- Handel columns data types
data_frame= data_frame.astype({"title": "string", "publication": "string", "author": "string", "date": "datetime64[ns]", "year": "int64", "month": "int64", "content": "string"})
- Dealing with unnecessary columns
- Drop
id
column because it contain unnecessary unique values. - Extract day from
date
column and rename column to day.
- Drop
The text data is preprocessed using the following steps:
- Convert string column to lowercase.
- Remove HTML Tags using regular expression.
- Remove URLs using regular expression.
- Remove unnecessary word like: publication name from title using regular expression.
- Apply Tokenization using the nltk library:
- Apply sentence tokenization to split text into sentences.
- Apply word tokenization to split sentences into individual words.
- Apply Stop word removal: Common stop words (such as "a", "an", "the", etc.) are removed from the text using the nltk library.
- Remove Punctuations using regular expression and string library.
- Remove Special Characters using regular expression.
- Remove invalid indexes in lists.
- Remove empty indexes.
- remove cells contain one character.
- Remove most frequently 10 words.
- Remove most rare 10 words.
- Apply Stemming: The remaining words are stemmed using the Porter stemming algorithm from the nltk library.
- Apply POS and Lemmatization: The remaining words are lematized using the wordnet lemmatizer algorithm from the nltk library.
- Preprocess
Puplication
by apply one-hot encoding. - Preprocess
author
by splitting authors in a list and applying mapping on them.
TF-IDF vectorization
: The preprocessed text is converted into a matrix of TF-IDF feature vectors using the sklearn library.
The TF-IDF feature vectors are clustered using three different clustering algorithms:
Algorithm | Details |
---|---|
K-means | K-means is a centroid-based clustering algorithm that partitions the data into K clusters based on the Euclidean distance between the data points and the cluster centroids. |
DBSCAN | DBSCAN is a density-based clustering algorithm that groups together data points that are close together in a high-density region, and separates out data points in low-density regions. |
Hierarchical Clustering | Hierarchical Clustering is a clustering algorithm that groups together similar data points based on their distance from each other in a hierarchical structure. |
The resulting clusters are then analyzed to identify the most common topics in the data.
The results of the topic modeling analysis are presented in a set of visualizations, including:
- A scatter plot of the DBSCAN Clusters.
- A scatter plot of the Hierarchical Clustering Clusters.
To run the project,
- You will need to install the required Python libraries listed in the
requirements.txt
file. - Run this code to install stopwords and wordnet
nltk.download(‘stopwords’) nltk.download(‘wordnet’)
- You will also need to provide a download CSV file from here.
- Word cloud for title column
- Word cloud for content column
- DBSCAN Clustering
- Hierarchical Clustering
This project was created by: