Giter Site home page Giter Site logo

final-project-data-science's Introduction

header

Google maps + Yelp! 🗺️ 🚀

Context 🌍

The opinion of the users has become an invaluable data in the planning of commercial strategies. Review platforms like Yelp and Google Maps provide a lot of information about the user perception regarding various businesses, including restaurants, hotels, aesthetics and other related services. This feedback is essential for companies, since it allows them to evaluate their performance, identify areas for improvement and understand how they are perceived by users. As part of a data consultancy, we have been hired to perform a detailed analysis of user opinion on Yelp and Google Maps for businesses related to personal care and aesthetics in the US market. The beauty category covers a wide range of services and establishments related to personal care and aesthetics. Some examples of businesses within this category are beauty salons, spas, hairdressers, barbershops, nail salons, beauty salons, massage parlors, and beauty supply stores.

Content

Description + Objective 🏆

Our project consists of collecting, cleaning and analyzing data from Yelp and Google Maps reviews, using sentiment analysis techniques and machine learning to determine the most suitable locations to establish new business premises and discover investment opportunities by investigating aspects such as market growth. , the demand for beauty services, existing competition and emerging trends. Based on the analysis carried out, we will generate clear and well-founded recommendations for the investor. These recommendations will showcase the most compelling investment opportunities in the beauty industry, highlighting the key aspects that support the viability and growth potential of each opportunity. Although we will focus mainly on the aesthetics sector, the methodology can be applied to other types of businesses.

The main objective of the project is to provide our investment client from the Latin American aesthetic industry with an overview of the US market in order to make the most informed and intelligent decisions to become a competitor in that market. Thanks to an exhaustive analysis of user opinion on Yelp and Google Maps, we will be able to identify trends, predict the growth or decline of business lines, and make informed strategic decisions to improve business management and investment.

Demo 🔌

stack

KPIs 📈📉

In our dashboard we can visualize 5 KPIs of different kinds:

  • Average Review Score indicates information according to the filters, aiming for a minimum rating of 4.2 stars.

  • Number of Reviews refers to the average according to the filters. This data is important since it must exceed a minimum of 20 reviews, since if it is lower, it may mean that, beyond the fact that the average number of reviews is high, the amount of sample data is low, therefore unreliable.

  • Variability KPIs refer to the volatility of the data referred to the score, the less variation the better (since the scores are more predictable) with a minimum of 0.5 variation.

  • The KPI Reliability is a calculation based on the product of the standardization of the first and second KPI, showing a more accurate data of how reliable in statistical terms the selected market filter is.

  • Finally the KPI Average change in review score across two different time divisions, months and quarters. With a minimum of 0.4% increase in reviews compared to the previous period.

It should be noted that the objectives were concluded, not only from a market context judgment, but also based on the distribution of data

Tech Stack 💻

Python HTML5 Matplotlib NumPy Pandas scikit-learn SciPy Postgres MySQL Power Bi Apache Spark Jupyter Notebook Visual Studio Code Linux macOS Windows Git GitHub

Microsoft Excel ChatGPT Google Chrome Google Drive Microsoft Word Stack Overflow Adobe Adobe Illustrator Adobe Photoshop Adobe Premiere Pro

Google Meet Discord Slack

stack

Methodology + Schedule 📆

stack

We worked following the SCHEDULE below:

stack

ER Model

stack

Visualizations

POWER BI

-screen de un dashboard -

Machine Learning 🤖

Sentiment Analysis

Application of natural language processing (NLP) techniques to analyze the sentiment of reviews and classify them as positive, negative or neutral. Making use of the "SentimentIntensityAnalyzer" library from the "nltk.sentiment" set which generates a new column where each review is classified, thus replacing/translating the review itself to its representative category.

This is how we can order and filter to reveal which States are more happy with the service and which are not.

Google Maps

Resulting Table:

stack

This fact table represents the characteristics and results of each of the reviews, filtered from the beauty and aesthetics category, together with a new column called Sentiment where it is expressed whether the review was Positive, Neutral or Negative.

It should be noted that in the following graph it was not decided to take into account the "Neutrals" since they represented less than 1%.

Top 5 States with the most positive reviews

stack

Top 5 States with the most negative reviews

stack

Yelp

Resulting Table:

stack

The description of this table is identical to the previous one, Google Maps, it only varies from where the data was extracted; in this case the Yelp dataset.

Top 5 States with the most positive reviews

stack

Top 5 States with the most negative reviews

stack

Clustering

Through a three-dimensional clustering model (latitude, longitude and average rating), businesses are investigated and grouped. This is geared towards their specific geographic locations (State and County) along with their rating trends. Thus complementing the current competition in each location; ordered from the element with the highest rating to the lowest. In it, the sklearn library was used, where the StandardScaler sub-libraries were extracted to standardize the data, KMeans for the clustering process and finally a second library called geodesic from the geopy set to identify the counties and states of the investigated locations.

Resulting Table:

Thanks to the process involved in clustering by extracting the central points of each cluster, we have an exact geographical location of said set. This is how this data is translated into the "State" and "County" columns; adding in turn the average of the scores of the reviews of said area. On the other hand we can see at the same time a comparison between the Number of Businesses present vs the number of Competitor Businesses, referring to a common item; the last column expresses the relationship of this competition, that is, the higher the percentage, the more competition is present in this location. Thanks to this set of data as decision parameters, it can be made known to which States and Counties it is convenient to invest depending on the Rating average, as well as taking into account the percentage of competition present in each area. Since if the client wants a business environment with little competition, then it is convenient to look for an area with little competition compared to the others. Otherwise, if the client wants to join the group of competent people in the present areas, it is also feasible since it symbolizes that the business bears fruit in said place.

stack

Elbow Graph

stack

We can see that the graph indicates that the optimal number of clusters to apply is approximately 5. But the business context of this project requires us to classify the locations with the largest partitions. For this reason, it was decided to use the amount of 50 clusters; so we can have 50 different locations.

3D graph of clustering

In this graph you can initially see the three-dimensional shape of the United States, where the colors represent the clusters together with their contents (points).

stack

Finally, this visualization shows the distribution of the data geographically, expressed in colors depending on the rating of each review. It can be seen that the reviews above 4 (green) are the most abundant.

stack

User App

STREAMLIT APP

stack

The application can be accessed through: https://databrick-app-ro1106uif3t.streamlit.app/Proyecto

Team 🫂

Name LinkedIn ↘️ GitHub Role
Paula Pallares linkedin.com/in/paupallares/ paupallares Functional Analyst
Benjamín Zambelli linkedin.com/in/benjamin-zambelli/ BenJokek Data Engineer
Beder Rivera linkedin.com/in/beder-rivera/ cullanco-huaman Data Engineer
Claritzo Pérez Marcano linkedin.com/in/claritzoperez/ Claritzo Data Analyst
Gonzalo Schwerdt linkedin.com/in/gonzalo-schwerdt/ GonzaloSchwerdt ML Engineer

final-project-data-science's People

Contributors

benjz2 avatar gonzaloschwerdt avatar ssanjua avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

ssanjua

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.