To analyze the distribution of death in the United States and investigate the trends at a micro level for each state.
The deployed app can be viewed here https://death-machine.herokuapp.com/
According to data from the National Center for Health Statistics (NCHS), which is overseen by the Center for Disease Control (CDC), one of the leading causes of death in the United States is heart disease. The CDC provides data for each recorded death that occurred on the US territory, including the causes of death, the age adjusted death rate, the location etc. For this project, we selected we selected among other datasets, the occurred age-adjusted death rates for the 10 leading causes of death in the United States. The objective of this project is to analyze the causes of deaths distribution across the leading states in the US and point out any trend that might among a certain demographic. In addition, we also want to be able to predict the death of an individual by using machine learning to create a model capable of understanding our data and rendering the information needed. The analysis also covers at a very micro level what variables significantly affect life expectancy in the United States. Verified reporting of this data starts as early as 1999.
Americans die each year and the leading causes of death account for a large portion of mortality. This project aims at providing a visual representation of what the leading of causes of deaths are for Americans and which states have the highest number of deaths and what is the cause of death. Additionally, analysis also take a look at factors such as age and population size for each state. The main purpose of the Heroku app is to provide an informative and straightforward representation of data on the leading causes of death to not only to educate but also to make people interested.
- Pandas
- Flask
- SQLAlchemy
- Postgres
- HTML
- CSS
- Bootstrap
- Heroku
- Google collab
By informing the common American about the leading causes of death in the United States, the hope is that more people will realize that there remains a high demand for research and support for preventative measures. For those who want more statistics and general information on American health, these are useful resources:
The primary source of data for this objective comes from the NCHS. This dataset contains information for the top 10 leading causes of death, and was used to identify heart disease as the leading cause.
- NCHS: https://data.cdc.gov/NCHS/NCHS-Leading-Causes-of-Death-United-States/bi63-dtpu.
- Kaggle:https://www.kaggle.com/cdc/mortality
- Kaggle:https://www.kaggle.com/cdc/nchs-death-rates-and-causes-of-death
- Kaggle:https://www.kaggle.com/ronitf/heart-disease-uci
- Please see: https://wonder.cdc.gov/wonder/help/mcd-expanded.html
- https://www.cdc.gov/nchs/data/dvs/Multiple-Cause-Record-Layout-2019-508.pdf
- Pyspark was utilized to clean data and create data frames
- Dataframes were connected to SQL using Postgres and a Death Database was created
- Flask was used to created connection to the Postgres database
- Routes were used to query the database and create a dictionary
- Deployment:
- Bootstrap was used to create a theme for the page
Ultimately, the objective is to identify the leading cause of death in the United States since 1999, and then identify the states with the highest deaths recorded. We used tableau to analyze the data and plot different visualizations as shown below. This analysis was broken down as see below.
Northern states like Minnesota and Dakota have lowest the age adjusted death rates. Southern states (Louisiana, Kentucky) have the highest heart disease death rates over the years and Wyoming is among the highest with suicide rate on the national level.
As can be seen below, the leading cause of death in the United States is heart disease, close behind is cancer followed by unintentional injuries.
- Heart Disease
- Cancer
- Unintentional Injuries
- Chronic Lower Respiratory Disease (CLRD)
- Stroke
- Alzheimer’s Disease
- Diabetes
- Influenza & Pneumonia
- Kidney Disease
- Suicide
The leading cause of death in the United States is heart disease and remained at the first position throughout the years. There are little changes in position between the other causes, but in general, these are the top 10 ranked: Heart Disease, Cancer, Unintentional Injuries, Chronic Lower Respiratory Disease (CLRD), Stroke, Alzheimer’s Disease, Diabetes, Influenza & Pneumonia, Kidney Disease, Suicide
Predict whether a patient should be diagnosed with Heart Disease. Examine trends & correlations within our data. Determine which features are most important to Heart Disease diagnosis. We used Machine Learning algorithm where we can train our AI to learn & improve from experience.
The Machine Learning for the cause of death utilized 2015 causes of death data from the CDC. The data initially had over two million rows of data, but there were over 3,000 causes of death utilized. Some causes of death only had one entry, and they would not be very helpful. As a starting point, only the deaths with over 10,000 entries were utilized. That was 60 causes of death. Those were further grouped together by their kind. For example, lung cancer and breast cancer were listed as separate entities. Eventually, the data was whittled down to ten causes of death for this project. More could be done later once more refinement is gained with the machine learning model with using so much information.
The following ten ICD codes were used (in order of frequency, with most frequent at the top and decreasing).
- I251: Heart Failure, Heart Attack, Heart Disease
- C349: Cancer
- J449: COPD
- F03: Dementia
- G309: Alzheimer's Disease
- J189: Pneumonia or other Lung Disease
- A419: Sepsis
- E149: Diabetes
- G20: Parkinson's Disease
- X44: Accidental Poisoning By and Exposure to Drugs and Other Biological Substances
Keras was used to make the machine learning model. It was converted into a tensorflowjs file to be used in javascript to be brought to the website. After the data had been cleaned up and ready to be put into the model, there were 962,411 rows of data. Roughly 721,808 were used for training and 240,603 for testing. 7 layers were used for this, and it was steadily improving with more layers.
However, the model pushed to the project was not Keras. Due to issues with deployment to the flask app, another model was used to at least make the project function. A random forest model was used to make it work for the app.
The accuracy rate on that model is fairly low, so it cannot be taken as a good accurate model. With more time, the project will be updated to go back to the original model and finesse to make it better and get it working on the app.
The prediction model requires education, gender, age, marital status, race, and whether a person is Hispanic as inputs to use the model. The original data also included other medical issues the person who died had. It is for future intention to break down that information and add it to the model and make it more precise.