The opinion of the users has become an invaluable data in the planning of commercial strategies. Review platforms like Yelp and Google Maps provide a lot of information about the user perception regarding various businesses, including restaurants, hotels, aesthetics and other related services. This feedback is essential for companies, since it allows them to evaluate their performance, identify areas for improvement and understand how they are perceived by users. As part of a data consultancy, we have been hired to perform a detailed analysis of user opinion on Yelp and Google Maps for businesses related to personal care and aesthetics in the US market. The beauty category covers a wide range of services and establishments related to personal care and aesthetics. Some examples of businesses within this category are beauty salons, spas, hairdressers, barbershops, nail salons, beauty salons, massage parlors, and beauty supply stores.
- Description + Objective
- Demo
- KPIs
- Tech Stack
- Methodology + Schedule
- ER Model
- Visualizations
- Machine Learning
- User App
- Conclusions
- Team
Our project consists of collecting, cleaning and analyzing data from Yelp and Google Maps reviews, using sentiment analysis techniques and machine learning to determine the most suitable locations to establish new business premises and discover investment opportunities by investigating aspects such as market growth. , the demand for beauty services, existing competition and emerging trends. Based on the analysis carried out, we will generate clear and well-founded recommendations for the investor. These recommendations will showcase the most compelling investment opportunities in the beauty industry, highlighting the key aspects that support the viability and growth potential of each opportunity. Although we will focus mainly on the aesthetics sector, the methodology can be applied to other types of businesses.
The main objective of the project is to provide our investment client from the Latin American aesthetic industry with an overview of the US market in order to make the most informed and intelligent decisions to become a competitor in that market. Thanks to an exhaustive analysis of user opinion on Yelp and Google Maps, we will be able to identify trends, predict the growth or decline of business lines, and make informed strategic decisions to improve business management and investment.
In our dashboard we can visualize 5 KPIs of different kinds:
-
Average Review Score indicates information according to the filters, aiming for a minimum rating of 4.2 stars.
-
Number of Reviews refers to the average according to the filters. This data is important since it must exceed a minimum of 20 reviews, since if it is lower, it may mean that, beyond the fact that the average number of reviews is high, the amount of sample data is low, therefore unreliable.
-
Variability KPIs refer to the volatility of the data referred to the score, the less variation the better (since the scores are more predictable) with a minimum of 0.5 variation.
-
The KPI Reliability is a calculation based on the product of the standardization of the first and second KPI, showing a more accurate data of how reliable in statistical terms the selected market filter is.
-
Finally the KPI Average change in review score across two different time divisions, months and quarters. With a minimum of 0.4% increase in reviews compared to the previous period.
It should be noted that the objectives were concluded, not only from a market context judgment, but also based on the distribution of data
We worked following the SCHEDULE below:
POWER BI
-screen de un dashboard -
Application of natural language processing (NLP) techniques to analyze the sentiment of reviews and classify them as positive, negative or neutral. Making use of the "SentimentIntensityAnalyzer" library from the "nltk.sentiment" set which generates a new column where each review is classified, thus replacing/translating the review itself to its representative category.
This is how we can order and filter to reveal which States are more happy with the service and which are not.
This fact table represents the characteristics and results of each of the reviews, filtered from the beauty and aesthetics category, together with a new column called Sentiment where it is expressed whether the review was Positive, Neutral or Negative.
It should be noted that in the following graph it was not decided to take into account the "Neutrals" since they represented less than 1%.
The description of this table is identical to the previous one, Google Maps, it only varies from where the data was extracted; in this case the Yelp dataset.
Through a three-dimensional clustering model (latitude, longitude and average rating), businesses are investigated and grouped. This is geared towards their specific geographic locations (State and County) along with their rating trends. Thus complementing the current competition in each location; ordered from the element with the highest rating to the lowest. In it, the sklearn library was used, where the StandardScaler sub-libraries were extracted to standardize the data, KMeans for the clustering process and finally a second library called geodesic from the geopy set to identify the counties and states of the investigated locations.
Thanks to the process involved in clustering by extracting the central points of each cluster, we have an exact geographical location of said set. This is how this data is translated into the "State" and "County" columns; adding in turn the average of the scores of the reviews of said area. On the other hand we can see at the same time a comparison between the Number of Businesses present vs the number of Competitor Businesses, referring to a common item; the last column expresses the relationship of this competition, that is, the higher the percentage, the more competition is present in this location. Thanks to this set of data as decision parameters, it can be made known to which States and Counties it is convenient to invest depending on the Rating average, as well as taking into account the percentage of competition present in each area. Since if the client wants a business environment with little competition, then it is convenient to look for an area with little competition compared to the others. Otherwise, if the client wants to join the group of competent people in the present areas, it is also feasible since it symbolizes that the business bears fruit in said place.
We can see that the graph indicates that the optimal number of clusters to apply is approximately 5. But the business context of this project requires us to classify the locations with the largest partitions. For this reason, it was decided to use the amount of 50 clusters; so we can have 50 different locations.
In this graph you can initially see the three-dimensional shape of the United States, where the colors represent the clusters together with their contents (points).
Finally, this visualization shows the distribution of the data geographically, expressed in colors depending on the rating of each review. It can be seen that the reviews above 4 (green) are the most abundant.
The application can be accessed through: https://databrick-app-ro1106uif3t.streamlit.app/Proyecto
Name | LinkedIn |
GitHub | Role |
---|---|---|---|
Paula Pallares | linkedin.com/in/paupallares/ | paupallares | Functional Analyst |
Benjamín Zambelli | linkedin.com/in/benjamin-zambelli/ | BenJokek | Data Engineer |
Beder Rivera | linkedin.com/in/beder-rivera/ | cullanco-huaman | Data Engineer |
Claritzo Pérez Marcano | linkedin.com/in/claritzoperez/ | Claritzo | Data Analyst |
Gonzalo Schwerdt | linkedin.com/in/gonzalo-schwerdt/ | GonzaloSchwerdt | ML Engineer |