Light

manuelemacchia / hotel-sentiment-analysis Goto Github PK

Implementation of a data science process for predicting the sentiment contained in hotel reviews by building a binary classification model

Home Page: https://manuelemacchia.com/hotel-sentiment-analysis/

License: MIT License

Jupyter Notebook 100.00%

classification data-science jupyter-notebook natural-language-processing nlp nltk pandas python scikit-learn sentiment-analysis supervised-learning

hotel-sentiment-analysis's Introduction

Ciao, I'm Manuele 👋

Machine learning engineer based in Bari, Italy.
Working at Connect Reply.
MSc in Data Science and Engineering at Politecnico di Torino.
Interested in natural language processing, computer vision, generative AI, MLOps, web development, and anything to do with data.

manuelemacchia.com

👨‍💻

hotel-sentiment-analysis's People

Contributors

Stargazers

Watchers

Forkers

viswa7878

hotel-sentiment-analysis's Issues

Create GitHub page

Once we close #4 we should create a GitHub page for this project.

Handle repeating characters from the beginning and the end of a token

The spell checker currently removes characters that appear consecutively more than twice (e.g., pproovaaaa becomes pproovaa).

In addition to that, it should remove characters that repeat more than once from the beginning and the end of a token (e.g., pproovaaaa becomes proova).

This should help the stemmer to correctly stem tokens.

Handling text emoticons

Right now, the classifier can handle unicode emojis such as 🥰 or 😡.
We shall improve the emoji handler with support for text emoticons, such as :) or :(.

Test set unavailability should be solved by train test split on the development set

The test set is unavailable. Therefore, we should use the development set for training, testing and validation.

To do this, we should split the development set into training and test sets (80-20?). To validate the model, we should perform k-fold cross-validation on the training set. Testing should be done on the test set.

Modularize the classifier

We could modularize the classifier code. This requires further exploration: what modules? use python packages? documentation?

Write more extensive documentation in markdown format

Delete the PDF report once the markdown documentation is finished.

Review length distribution graph

We could graph the length of the reviews in the data exploration phase. We could have two distributions, one for each class (negative and positive review length distribution). Is this useful?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.