Giter Site home page Giter Site logo

mawada-sweis / clustering-analysis Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 3.0 50.66 MB

Clustering Analysis is used to analyze the travel behaviors, preferences, and attitudes of New York City citizens.

Python 0.29% Jupyter Notebook 99.71%
cluster-analysis

clustering-analysis's Introduction

The Citywide Mobility Clustering

Data Description

The Citywide Mobility Survey (CMS) is a survey conducted by the New York City Department of Transportation (DOT) to gather information about the travel behavior, preferences, and attitudes of New York City residents. The survey is conducted periodically, and the data collected is used to inform transportation planning and policy decisions.

Table of Contents

Data Objectives

The primary objectives of the CMS data collection are:

  • Identify the factors and experiences that drive transportation choices for New York City residents.
  • Understand current views on the state of transportation within the City.
  • Measure attitudes toward current transportation issues and topics in New York City.

Project Objective

The objective of this project is to analyze the CMS dataset to gain insights into the travel behavior, preferences, and attitudes of New York City residents. Specifically, we aim to:

  • Understand the relationship between residents' behaviors, preferences, attitudes, and traveling methods.
  • Identify any trends or patterns in the data that may be relevant to transportation planning and policy decisions.

By achieving these objectives, we hope to contribute to the ongoing efforts to improve transportation in New York City and enhance the mobility of its residents.

Project Process

We maintained just the data of the individuals who completed the survey, leaving out the remainder of the participating family members. As indicated in Table (1), the data comprises two categories of missing data: natural missing data (NaN) and missing data due to particular reasons, with each category having a label that we processed differently. We maintained all columns except those that were eliminated following discussion: Columns with more than 75% missing data Columns with 65% missing data for a specified cause.

Label Description
Not required under the circumstances 995
Don't know 998
Prefer not to answer 999
No response for required field -9998
A technological error occurred -9999

Table (1): Missing Data Labels

The data consists of naturally encoded columns, label data, numeric data, category data, and dates, all of which we encoded. We scaled the data using the standard and minmax scalers. then for the final clustering, we chose the minmax scaler result since the models deal with it more efficiently than the standard scaler result. Clustering was done using three models: k-mean, k-medoid, DBSCAN, and hierarchical. After selecting the best model result, we examined each cluster and its important characteristics.

Hierarchical Clustering

We examine the Ward linkage method's use in hierarchical clustering. The dendrogram, shown in Graph (1), determines the number of clusters to test in our analysis.

Graph(1): Dendrogram Presentation using Ward Linkage

Graph(1): Dendrogram Presentation using Ward Linkage.

To determine the appropriate number of clusters to test, we examine the dendrogram and look for the longest distance where there is a clear separation of data points. Based on this analysis, we chose to test clustering scenarios with three and four clusters. This decision is reflected in Graph (2), which shows the dendrogram and the resulting clusters for our analysis. By selecting three and four clusters, we can compare the effectiveness of different clustering scenarios and identify the optimal number of clusters for our data.

Threshold at 60 then we have 3 clusters Threshold at 55.5 then we have 4 clusters
Dendrogram at 60 Dendrogram at 55.5

Graph (2): Dendrogram Presentation with Cutting Threshold

To evaluate the effectiveness of hierarchical clustering with Ward linkage, we performed clustering on different scenarios, including clustering before and after PCA with three and four clusters. Our analysis shows that the best results were obtained using hierarchical clustering with Ward linkage after applying PCA, as shown in Graph (3), and we received the maximum Silhouette score when utilizing four clusters.

Hierarchical Clustering

Graph (3): Hierarchical Clustering with Ward Linkage after PCA

Cluster 1 Insights - Unemployed Travelers

Individuals in cluster 1 travel for 5 days on average, indicating a steady travel schedule. Interestingly, the majority of people in this cluster do not have a work, which might explain their travel habits. They use trip planning applications 2-3 to 6-7 days per week, indicating that they rely substantially on these apps for transportation. Also, they rarely use smartphone-app transportation services, with less than three days per month. It is worth mentioning that the majority of people in this group do not have a driver's license, which may explain their reliance on trip planning applications. When it comes to ride services, the majority of people in this cluster prefer Uber, with only a handful choosing Lyft or none at all. Additionally, because they are jobless, the majority of people in this cluster skipped logic relating to job type, work mode, and industry.

Cluster 2 insights - Retired Errand Runners

Individuals in cluster 2 tend to travel only once during the week, likely for errands or appointments. This cluster includes individuals aged between 55 to 74 who do not have a job. Interestingly, individuals in this cluster rarely use trip-planning apps, with less than 3 days of usage per month. Moreover, information about their frequency of smartphone-app ride services usage and teleworking is missing, as many individuals skipped these questions. None of the individuals in this cluster are employed, and the majority do not have a driver's license. Additionally, they do not use either Uber or Lyft for their ride-sharing needs. Already many people in this cluster ignored questions regarding their job type, work style, industry, the purpose of their travels, and method of transportation for using smartphone-app ride services, indicating that their data was incomplete.

Cluster 3 insights - Inconsistent Commuters

People in cluster 3 tend to travel for one to five full weekdays, showing some variation in their travel routine. Individuals in this cluster range in age from 35 to 64 and work only one job. Surprisingly, the frequency of smartphone-app transportation service usage is lacking in this cluster, since many people missed this question. Furthermore, none of the people in this cluster works from home. The majority of people in this cluster are working full-time and have a driver's license. They do not, however, use Uber or Lyft for their ride-sharing needs. The majority of them work in their actual workplace, but the reason for their trips is also missing, as many people skipped this question. Furthermore, the method of transportation for their smartphone-app ride services usage is absent, indicating that the data is incomplete.

Cluster 4 insights - Working Commuters

People in cluster 4 travel for 5 complete weekdays, indicating a steady travel pattern. Individuals in this cluster range in age from 25 to 54 and work only one job. They use trip planning applications 6-7 days each week, showing that they rely on them for transportation. Surprisingly, most people in this group rarely use smartphone-app transportation services, with less than three days per month. Furthermore, none of the people in this cluster work from home. The majority of people in this cluster are working full-time and have a driver's license. When it comes to ride-sharing services, the vast majority of people in this group prefer Uber. Additionally, the majority of them work in their actual workplace.

The analyzed data reveals interesting insights about each cluster's travel behaviour. The size of each cluster is shown in Graph (4).

size of each cluster with hierarchical clustering

Graph (4): The Hierarchical Clustering Size of each Cluster

K-Means Clustering

In order to find the optimal number of clusters for our data, we experimented with various values of k and different dimensions of data both before and after performing PCA. After careful consideration, we selected a model with k=4 on 2 dimensions as the best fit, as shown in Graph (5). We evaluated the models using the silhouette score as our metric of choice.

Screenshot 2023-03-06 153503

Graph (5): K-Means Clustering on 2-Dimensional Data.

Cluster 1 insights - Unemployed Travelers

The insights about cluster 1 remain largely consistent with the previous model. Individuals in cluster 1 still tend to travel 5 complete days, and most do not have a job. They use planning apps 2-3 and 6-7 days a week, and rarely use smartphone app ride services, with less than 3 days of usage per month. Similar to the previous model, individuals in this cluster do not have a driver's license, and many skipped questions related to job type, work mode, and industry. However, there is no information about telework frequency in this model.

Cluster 2 insights - Full-time Working Commuters

The insights about cluster 2 suggest that individuals in this cluster tend to travel 5 complete weekdays, and are typically between the ages of 25 and 54. They have one job and use trip planning apps 6-7 days a week. Like individuals in cluster 1, they rarely use smartphone app ride services, with less than 3 days of usage per month. Most individuals in this cluster are employed full-time and have a driver's license.

Cluster 3 insights - Flexible Commuters

Individuals in cluster 3 tend to travel either for one or five complete weekdays, indicating a flexible travel pattern. This cluster includes individuals aged between 35 to 64 who have one job. The frequency of their smartphone-app ride services usage is missing due to skipped logic, and most of them do not engage in teleworking. Most of the individuals in this cluster are employed full-time and have a driver's license. Interestingly, none of the individuals in this cluster uses either Uber or Lyft for their ride-sharing needs. Most of them work at their physical work location, but the purpose of their trips is missing due to skipped logic. Additionally, the mode of transportation for their smartphone-app ride services usage is missing, indicating incomplete data.

Cluster 4 insights - Weekday Errand Runners

Individuals in cluster 4 tend to travel for one complete weekday, indicating a specific travel pattern. This cluster is characterized by individuals aged between 55 to 74 who are without a job. They rarely use trip-planning apps (less than 3 days a month) and have missing data on the frequency of their smartphone-app ride services usage and teleworking behaviour. Most of the individuals in this cluster are not employed and do not have a driver's license. Interestingly, none of the individuals in this cluster uses either Uber or Lyft for their ride-sharing needs. Data on job type, work mode, the purpose of their trips, industry, and the mode of transportation for smartphone-app ride services usage is missing due to skipped logic.

The data analysis uncovers intriguing insights about each cluster's travel pattern. Graph represents the size of each cluster (6).

k-mean clustering

Graph (6): Number of samples in each cluster after performing k means.

DBSCAN Model

In order to conduct the DBSCAN algorithm, we performed hyperparameter tuning by testing various values of eps and minPts. We also tested the model on different datasets with varying dimensions, both before and after applying PCA. Our evaluation metric for determining the effectiveness of the model was the silhouette score, and based on this metric, we arrived at 6 clusters, as shown in Graph (7).

Screenshot 2023-03-06 153922

Graph(7): DBSCAN Clustering on 2-Dimension

Cluster 1 insight - Limited Mobility Commuters

Cluster 1 consists of individuals who are not currently employed and do not have a job. They are aged between 55 to 74 years old and tend to travel only one complete day per week. Most of them do not use smartphone app ride services (less than 3 days a month) and do not have a license. It seems that many individuals in this cluster skipped the logic in the job type and work mode questions since they are not currently employed. Additionally, they do not use Uber or Lyft services and the most common purpose of trips made using smartphone-app ride services is not applicable, given that they don't use these services.

Cluster 2 insights

This cluster seems to represent employed individuals in the age range of 25-54 who work full-time on work locations for five complete weekdays. They use TNC services (Uber and Lyft) occasionally (less than three days a month). Additionally, some individuals in this cluster also use rail or taxi services for transportation.

Cluster 3 insights

The cluster's age range is from 25 to 64, and they tend to work either one or five complete weekdays. This suggests that they may have a full-time job, which could be why they primarily work on their work location and do not use ride-hailing services frequently.

Cluster 4 insights

Cluster 4 seems to be made up of individuals who are not employed or currently looking for work. They have no specified job type or work mode. They also do not frequently use telework, indicating that they are not engaged in remote work or freelancing. Interestingly, despite not being employed, many of the individuals in this cluster still use ride-hailing services such as Uber and Lyft, and some also use any rail and taxi services. This may suggest that they use these services for personal transportation needs rather than for work-related purposes.

Cluster 5 insights

This cluster consists of people who are not currently employed and have not specified their job type or work mode. They rarely use TNCs (less than 3 days a month) but do use Uber and Lyft and usually travel on a weekday. The age range of this cluster is 55-64 years old, indicating that they are likely to be retirees or individuals who have left the workforce. A significant portion of this cluster has a walking disability, which could impact their mobility options. Some individuals in this cluster use any rail, indicating that public transportation is still a mode of transportation for them despite their TNC usage. Cluster 6 insights People in this cluster aged between 55 to 64, they’re employees with one job and have 1 complete weekday, and they don’t use Uber or Lyft.

The data analysis provides valuable insights about each cluster's travel behavior. The size of each cluster is displayed in Graph (8).

image

Graph (8): Number of samples in each cluster after performing DBSCAN model.

clustering-analysis's People

Contributors

eleenkmail avatar mawada-sweis avatar ramahasiba avatar zubaidasader avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

clustering-analysis's Issues

Grouping Data

Group the data based on the household feature to better understand the data.

Readme file

The Citywide Mobility Survey (CMS)

The Citywide Mobility Survey (CMS) is a survey conducted by the New York City Department of Transportation (DOT) to gather information about the travel behavior, preferences, and attitudes of New York City residents. The survey is conducted periodically, and the data collected is used to inform transportation planning and policy decisions.

Table of Contents

Data Objectives

The primary objectives of the CMS data collection are:

  • Identify the factors and experiences that drive transportation choices for New York City residents.
  • Understand current views on the state of transportation within the City.
  • Measure attitudes toward current transportation issues and topics in New York City.

Project Objective

The objective of this project is to analyze the CMS dataset to gain insights into the travel behavior, preferences, and attitudes of New York City residents. Specifically, we aim to:

  • Understand the relationship between residents' behaviors, preferences, attitudes, and traveling methods.
  • Identify any trends or patterns in the data that may be relevant to transportation planning and policy decisions.

By achieving these objectives, we hope to contribute to the ongoing efforts to improve transportation in New York City and enhance the mobility of its residents.

skeleton folder

Having a well-organized project structure makes it easy to understand and make changes.
In general, these are the basic folders we will work on:

data, models, notebooks, utils and src folders.

Scaling dataset

  • share final two data sets scaled with:
    • standard scaler
    • min max scaler

Basic Dataset information

Provide the basic information about the dataset in the notebook file. It must contain:

  • Dataset Statistical Describing.
  • Dataset information.
  • Display the percentage of missing/noisy/inconsistent data for each column.
  • Suggest method handling based on the data.

Transformation

Do Feature Engineering includes:

  • One hot encoding on categorical features (Nominal).
  • Extract the day, month and year features from each date feature.

EDA

  • The plot displays the percentage of missing data (before and after Cleaning).
  • Plotting displays the column value counts.
  • Correlation graph.
  • Numeric column boxplot.
  • Plots of distribution.
  • Write Insights and conclusions

Hierarchal model

  • Choose the best threshold value.
  • Display the dendrogram graph.

Discuss result

Meet to discuss the clustering results for various algorithms and our progress.

Share final dataset

Based on the dataset and our sumption and understanding:

  • What columns we will have to keep?
  • How to deal with samples of other persons in the same household?
  • How to deal with several type of collected data method in the same dataset?

Output expected:
Dataset with specific collected data method and with only persons who fill out the servay.

Cleaning dataset

Cleaning the data by removing columns that are unnecessary, or contain a large amount of missing data. in addition to dealing with noisy or inconsistent data.

K-mean model

  • Choose the best K value.
  • Display the Elbow graph.
  • How to Display the clustering?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.