This is a group project on Data visualisation and Exploratory Data Analysis
This dataset contains data from the 2020 annual CDC survey of 400,000 adults in the US about their health status, specifically focusing on key indicators of heart disease. Heart disease is a leading cause of death in the US, affecting people of most races. The dataset includes variables such as high blood pressure, high cholesterol, smoking, diabetic status, obesity, physical activity, and alcohol consumption.
The dataset is part of the Behavioral Risk Factor Surveillance System (BRFSS) of the Centers for Disease Control and Prevention (CDC). BRFSS conducts annual telephone surveys to gather data on the health status of US residents. The dataset has undergone cleaning to include only the most relevant variables, reducing the original nearly 300 variables to about 20.
The dataset consists of 401,958 rows and 279 columns. The vast majority of columns are questions asked to respondents about their health status. The relevant variables include key indicators of heart disease such as high blood pressure, high cholesterol, smoking, diabetic status, obesity, physical activity, and alcohol consumption.
This dataset can be used for exploratory data analysis (EDA), Visualisation as well as machine learning methods such as logistic regression, SVM, and random forest to predict the likelihood of heart disease. The variable "HeartDisease" should be treated as a binary ("Yes" - respondent had heart disease; "No" - respondent had no heart disease). However, note that the classes are not balanced, so fixing the weights/undersampling is advisable for better results.
The dataset can be used to investigate which variables have a significant effect on the likelihood of heart disease. The key indicators of heart disease such as high blood pressure, high cholesterol, smoking, diabetic status, obesity, physical activity, and alcohol consumption can be analyzed to identify patterns and predictors of heart disease.
The data in this dataset is observational and should not be used to draw causal conclusions.