Topic Modeling Classification

This project explores topic modeling in natural language processing (NLP) using Python. Topic modeling is a method for finding a group of words (i.e. topics) from a collection of documents that best represents the information in the collection of text documents. The goal is to extract topics from a set of documents using the TF-IDF method for information extraction and three clustering algorithms: K-means, DBSCAN, and Hierarchical Clustering.

Data

The input data consists of a set of text documents in a CSV file. The file contains 9 columns: id, title, publication, author, date, year, month, url and content.

Column	Description
id	column contains a unique identifier for each document
title	column contains the raw text of each document
publication	column contains the puplication source name for each document
author	column contains the authors' names for each document
date	column contains the full-date of publication for each document
year	column contains the year of publication for each document
month	column contains the month of publication for each document
url	column contains the topic link for each document
content	column contains the raw text of each document

Data Cleaning

Handel Missing values
1. Handel columns with missing values: by remove url column because it's empty(100% null values).
2. Handel rows with missing values: by drop all rows contain null values.

Handel columns data types

data_frame= data_frame.astype({"title": "string",
                               "publication": "string",
                               "author": "string",
                               "date": "datetime64[ns]",
                               "year": "int64",
                               "month": "int64",
                               "content": "string"})

Dealing with unnecessary columns
1. Drop id column because it contain unnecessary unique values.
2. Extract day from date column and rename column to day.

Preprocessing for text data

The text data is preprocessed using the following steps:

Convert string column to lowercase.
Remove HTML Tags using regular expression.
Remove URLs using regular expression.
Remove unnecessary word like: publication name from title using regular expression.
Apply Tokenization using the nltk library:
1. Apply sentence tokenization to split text into sentences.
2. Apply word tokenization to split sentences into individual words.
Apply Stop word removal: Common stop words (such as "a", "an", "the", etc.) are removed from the text using the nltk library.
Remove Punctuations using regular expression and string library.
Remove Special Characters using regular expression.
Remove invalid indexes in lists.
1. Remove empty indexes.
2. remove cells contain one character.
Remove most frequently 10 words.
Remove most rare 10 words.
Apply Stemming: The remaining words are stemmed using the Porter stemming algorithm from the nltk library.
Apply POS and Lemmatization: The remaining words are lematized using the wordnet lemmatizer algorithm from the nltk library.

Dealing with categorical data (Nominal Data).

Preprocess Puplication by apply one-hot encoding.
Preprocess author by splitting authors in a list and applying mapping on them.

Feature Extraction

TF-IDF vectorization: The preprocessed text is converted into a matrix of TF-IDF feature vectors using the sklearn library.

Clustering

The TF-IDF feature vectors are clustered using three different clustering algorithms:

Algorithm	Details
K-means	K-means is a centroid-based clustering algorithm that partitions the data into K clusters based on the Euclidean distance between the data points and the cluster centroids.
DBSCAN	DBSCAN is a density-based clustering algorithm that groups together data points that are close together in a high-density region, and separates out data points in low-density regions.
Hierarchical Clustering	Hierarchical Clustering is a clustering algorithm that groups together similar data points based on their distance from each other in a hierarchical structure.

The resulting clusters are then analyzed to identify the most common topics in the data.

Results

The results of the topic modeling analysis are presented in a set of visualizations, including:

A scatter plot of the DBSCAN Clusters.
A scatter plot of the Hierarchical Clustering Clusters.

Usage

To run the project,

You will need to install the required Python libraries listed in the requirements.txt file.

Run this code to install stopwords and wordnet

nltk.download(‘stopwords’)
nltk.download(‘wordnet’)

You will also need to provide a download CSV file from here.

Visualization

Word cloud for title column

Word cloud for content column

DBSCAN Clustering

Hierarchical Clustering

Credits

This project was created by:

haneentronic / nlp-topic-modeling-classification Goto Github PK

nlp-topic-modeling-classification's Introduction

Topic Modeling Classification

Data

Data Cleaning

Preprocessing for text data

Dealing with categorical data (Nominal Data).

Feature Extraction

Clustering

Results

Usage

Visualization

Credits

nlp-topic-modeling-classification's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent