Giter Site home page Giter Site logo

ca-health-perceptions's Introduction

Investigating the Role of Social Determinants in Shaping Perceptions of Health Status Across California Counties: A Machine Learning Approach

Author: Kat L., University of California, Los Angeles (UCLA)

Keywords: Multidimensional Population Health, Machine Learning in Public Health, Health Determinants, Predictive Modeling

Interactive Website

1st page of site 2nd page of site 3rd page of site

to run the website, run the following command via terminal: python src/app.py

Abstract

This research investigates the role of social determinants in shaping perceptions of health status across California counties, focusing on a myriad of factors including socioeconomic conditions, demographic indicators, environmental conditions, and health status variables. Utilizing the County Health Rankings (CHR) California dataset from 2020 to 2022, the study employs advanced machine learning techniques, both supervised and unsupervised, to explore and categorize health outcomes. The findings reveal fifteen significant predictors of health perceptions, with physical distress emerging as a central determinant. K-Means clustering uncovers three distinct health profiles among the counties, demonstrating enduring patterns in health outcomes and behaviors. The study underscores the multidimensional nature of health perceptions and highlights the potential of machine learning in public health research, offering insights for targeted health policies and interventions.

Introduction

Health is a multifaceted construct, influenced by individual characteristics, environmental conditions, and a range of social determinants. The World Health Organization emphasizes the impact of social determinants such as socioeconomic status, education, employment, and social support networks on health outcomes and disparities. A comprehensive understanding of these determinants is crucial for effective public health planning and intervention. The advent of machine learning offers a promising avenue for exploring health outcomes, providing advanced computational methods for tasks like disease prediction, health care service optimization, and health behavior analysis. However, the potential of these techniques in categorizing regions based on health status similarities remains largely unexplored, necessitating a detailed investigation into the specific factors predicting self-perceived poor health status in California counties.

Objectives and Scope of Study

  1. Multidimensional Analysis:

    • To conduct a comprehensive analysis considering multiple dimensions of population health, such as health-related quality of life and length of life, to provide a more holistic view1.
  2. Advanced Machine Learning Techniques:

    • To leverage advanced, data-driven multivariate statistical learning approaches, including linear and non-linear ensemble tree-based models, to capture the complex, nonlinear relationships in population health1.
  3. Comprehensive Insight Generation:

    • To interpret the results from the machine learning models to understand the significant predictors and provide insights for more targeted and effective public health strategies and interventions1.

Data

The primary dataset utilized is the CHR California dataset, spanning the years 2020 to 20222. This dataset is a comprehensive collection of health-related indicators, including health behaviors, clinical care, social and economic factors, and physical environment, providing a holistic view of the health landscape in California counties2.

Methodology

  1. Exploratory Data Analysis (EDA):

    • Conducting descriptive statistics, visualizations, and correlation analysis to understand the characteristics and relationships within the dataset1.
  2. Advanced Machine Learning Modeling:

    • Implementing a suite of statistical learning models including linear regression and non-linear ensemble tree-based models to evaluate population health1.
  3. Result Interpretation:

    • Presenting visualization tools including a variable importance heat-map and partial dependence plots of the key predictors to explain the underlying relationships of the important variables with the population health outcomes1.

Results

Model Performance

Linear Regression

  • Cross-Validation Score: 0.944
  • Mean Squared Errors: Training: 0.564, Testing: 1.324
  • R^2 Scores: Training: 0.969, Testing: 0.930
  • Most Important Feature: 'pct_freq_phys_distress' (2.798)

Support Vector Regression

  • Cross-Validation Score: 0.940
  • Mean Squared Errors: Training: 1.515, Testing: 4.131
  • R^2 Scores: Training: 0.916, Testing: 0.780
  • Most Important Feature: 'pct_freq_phys_distress' (2.806)

Decision Tree

  • Cross-Validation Score: 0.806
  • Mean Squared Errors: Training: 0.620, Testing: 3.645
  • R^2 Scores: Training: 0.965, Testing: 0.806
  • Most Important Feature: 'pct_freq_phys_distress' (0.701)

Random Forest

  • Cross-Validation Score: 0.890
  • Mean Squared Errors: Training: 0.255, Testing: 2.284
  • R^2 Scores: Training: 0.986, Testing: 0.879
  • Most Important Feature: 'pct_freq_phys_distress' (0.669)

XGBoost

  • Cross-Validation Score: 0.935
  • Mean Squared Errors: Training: 0.001, Testing: 1.674
  • R^2 Scores: Training: 1.000, Testing: 0.911
  • Most Important Feature: 'pct_freq_phys_distress' (0.652)

Insights and Implications

All models performed well, with Linear Regression, Support Vector Regression, and XGBoost yielding particularly high cross-validation scores. The feature pct_freq_phys_distress emerged as the most important feature across all models, highlighting the significance of addressing physical distress in public health interventions. The variability in feature importance across models underscores the value of employing multiple machine learning models to gain a comprehensive understanding of the data.

Cluster Analysis

2020-2022 Clusters

Clusters identified over three years consistently represented counties with varying health outcomes, demographic characteristics, and health behaviors. The enduring patterns in health outcomes and behaviors across California have significant implications for designing and implementing health policies and interventions, reflecting the dynamic nature of public health.

Recommendations for Future Research

  1. Addressing Ethical Considerations:

    • Future studies should consider ethical implications in feature selection to avoid unintended and permanent embedding of biases in algorithms3.
  2. In-depth Analysis of Socioeconomic Factors:

    • Delving deeper into the impact of individual socioeconomic factors on health outcomes to understand the underlying mechanisms and to identify potential intervention points1.
  3. Incorporation of Additional Datasets:

    • Integrating the CHR dataset with other relevant datasets, such as healthcare utilization and access, can enrich the analysis and provide a more comprehensive view of the health landscape in California1.

Repository Structure

├── data # data used in the project
│   ├── processed # processed data for modeling
│   │   ├── ca-counties.geojson # geojson file for California counties
│   │   ├── county-health-data-processed.csv # processed dataset for modeling
│   │   ├── county-health-data-summary.csv # summary statistics for dataset
│   │   ├── county-health-final.csv # final dataset for clustering
│   │   └── county-health-rank.csv # county rankings for health outcomes
│   └── raw # raw data for project  
│       └── county-health-data.csv # raw dataset from County Health Rankings
├── requirements.txt # required packages, modules, etc 
├── notebooks # notebooks for data analysis and modeling
│   └── health-perceptions.ipynb 
└── src # source code for project
    ├── __init__.py
    ├── app.py # dash app for visualizations
    ├── supervised # scripts for supervised learning models
    │   ├── __init__.py
    │   └── train_supervised.py
    └── unsupervised # scripts for unsupervised learning models
        ├── __init__.py
        └── train_unsupervised.py

References

Footnotes

  1. Wei, Z., Narin, A. B., & Mukherjee, S. (2022). Multidimensional population health modeling: A data-driven multivariate statistical learning approach. IEEE Access, 10, 22737-22755. 2 3 4 5 6 7 8

  2. County Health Rankings & Roadmaps. (n.d.). Methods. Retrieved from County Health Rankings Website 2

  3. Chen, I. Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., & Ghassemi, M. (2021). Ethical machine learning in healthcare. Annual review of biomedical data science, 4, 123-144.

ca-health-perceptions's People

Contributors

katjpg avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.