Giter Site home page Giter Site logo

haneentronic / nlp-topic-modeling-classification Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nourayman10/nlp-topic-modeling-classification

0.0 0.0 0.0 31.77 MB

Topic modeling is a method for finding a group of words (i.e. topics) from a collection of documents that best represents the information in the collection of text documents

Jupyter Notebook 100.00%

nlp-topic-modeling-classification's Introduction

Topic Modeling Classification

This project explores topic modeling in natural language processing (NLP) using Python. Topic modeling is a method for finding a group of words (i.e. topics) from a collection of documents that best represents the information in the collection of text documents. The goal is to extract topics from a set of documents using the TF-IDF method for information extraction and three clustering algorithms: K-means, DBSCAN, and Hierarchical Clustering.

Data

The input data consists of a set of text documents in a CSV file. The file contains 9 columns: id, title, publication, author, date, year, month, url and content.

Column Description
id column contains a unique identifier for each document
title column contains the raw text of each document
publication column contains the puplication source name for each document
author column contains the authors' names for each document
date column contains the full-date of publication for each document
year column contains the year of publication for each document
month column contains the month of publication for each document
url column contains the topic link for each document
content column contains the raw text of each document

Data Cleaning

  1. Handel Missing values
    1. Handel columns with missing values: by remove url column because it's empty(100% null values).
    2. Handel rows with missing values: by drop all rows contain null values.
  2. Handel columns data types
    data_frame= data_frame.astype({"title": "string",
                                   "publication": "string",
                                   "author": "string",
                                   "date": "datetime64[ns]",
                                   "year": "int64",
                                   "month": "int64",
                                   "content": "string"})
    
  3. Dealing with unnecessary columns
    1. Drop id column because it contain unnecessary unique values.
    2. Extract day from date column and rename column to day.

Preprocessing for text data

The text data is preprocessed using the following steps:

  1. Convert string column to lowercase.
  2. Remove HTML Tags using regular expression.
  3. Remove URLs using regular expression.
  4. Remove unnecessary word like: publication name from title using regular expression.
  5. Apply Tokenization using the nltk library:
    1. Apply sentence tokenization to split text into sentences.
    2. Apply word tokenization to split sentences into individual words.
  6. Apply Stop word removal: Common stop words (such as "a", "an", "the", etc.) are removed from the text using the nltk library.
  7. Remove Punctuations using regular expression and string library.
  8. Remove Special Characters using regular expression.
  9. Remove invalid indexes in lists.
    1. Remove empty indexes.
    2. remove cells contain one character.
  10. Remove most frequently 10 words.
  11. Remove most rare 10 words.
  12. Apply Stemming: The remaining words are stemmed using the Porter stemming algorithm from the nltk library.
  13. Apply POS and Lemmatization: The remaining words are lematized using the wordnet lemmatizer algorithm from the nltk library.

Dealing with categorical data (Nominal Data).

  1. Preprocess Puplication by apply one-hot encoding.
  2. Preprocess author by splitting authors in a list and applying mapping on them.

Feature Extraction

TF-IDF vectorization: The preprocessed text is converted into a matrix of TF-IDF feature vectors using the sklearn library.

Clustering

The TF-IDF feature vectors are clustered using three different clustering algorithms:

Algorithm Details
K-means K-means is a centroid-based clustering algorithm that partitions the data into K clusters based on the Euclidean distance between the data points and the cluster centroids.
DBSCAN DBSCAN is a density-based clustering algorithm that groups together data points that are close together in a high-density region, and separates out data points in low-density regions.
Hierarchical Clustering Hierarchical Clustering is a clustering algorithm that groups together similar data points based on their distance from each other in a hierarchical structure.

The resulting clusters are then analyzed to identify the most common topics in the data.

Results

The results of the topic modeling analysis are presented in a set of visualizations, including:

  • A scatter plot of the DBSCAN Clusters.
  • A scatter plot of the Hierarchical Clustering Clusters.

Usage

To run the project,

  • You will need to install the required Python libraries listed in the requirements.txt file.
  • Run this code to install stopwords and wordnet
    nltk.download(‘stopwords’)
    nltk.download(‘wordnet’)
    
  • You will also need to provide a download CSV file from here.

Visualization

  • Word cloud for title column

title

  • Word cloud for content column

content

  • DBSCAN Clustering

DBSCAN

  • Hierarchical Clustering

HierarchicalClustering

Credits

This project was created by:

nlp-topic-modeling-classification's People

Contributors

abdelrhman-sayed70 avatar haneentronic avatar nourayman10 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.