Giter Site home page Giter Site logo

project-happiness's Introduction

Project Happiness

What determines happiness? Can we predict how happy a country is?

Data

  • Data is collected from Knoema's World Happiness Index for 2018, which sourced the original data from the World Happiness Report published by the Sustainable Development Solutions Network
  • Surveying 156 countries, the report provides the rankings of polled happiness score amongst other life-evaluation questions
  • Rankings are taken from nationally representative samples of ~1,000 respondents per country
  • The happiness score is ranked on a scale of 1 to 10. Each variable measures a population-weighted average score on a scale from 0 to 10

The model aims to predict Happiness Score for a given country. The independent variables in the model consist of the following by country:

  • Positive Emotions
  • Negative Emotions
  • GDP per Capita
  • Health/Life Expectancy
  • Freedom
  • Social Support
  • Government Confidence
  • Generosity
  • Perception of Corruption

Data Cleaning

  • Organized the schema by pivoting the dataframe to make each country as the index, while rearranging each independent variable into its own column
  • Dropped all other columns except for the selected independent variables
  • Dropped all NaN values, which left the dataframe with 123 countries that have workable data
  • Snippet of resulting dataframe:

Screen Shot 2019-04-18 at 11 34 46 AM

Initial Observations

Screen Shot 2019-04-18 at 11 38 14 AM

  • Strong positive correlations were observed between Happiness Score (dependent variable) with GDP per Capita, Health/Life Expectancy, and Social Support. Negative correlations with the dependent variable were shown for Negative Emotions and Perception of Corruption

Screen Shot 2019-04-18 at 11 38 22 AM

There are some outliers detected for the variables Freedom, Social Support, Generosity, and Perception of Corruption. These outliers are kept in the dataset since eliminating them would mean excluding countries with geopolitical instabilities.

Where is Happiness?

I created an interactive world heatmap on Plotly using Happiness Score (access via link below):

https://plot.ly/~feiqi9047/3/

Feel free to drag/pull and hover!

The top 10 happiest and unhappiest countries are shown below based on Happiness Score: Screen Shot 2019-04-18 at 1 34 22 PM Screen Shot 2019-04-18 at 1 34 29 PM

Understanding the Target Variable

I wanted to see what the R2 looks like with all of the original independent variables in the dataset:

  • SKlearn gave me a baseline R2 of 0.77
  • OLS gave me a R2 of 0.81

Screen Shot 2019-04-18 at 2 23 45 PM

Checking for Multicollinearity

Multicollinearity was detected for GDP per Capita. However, given its high correlation with the dependent variable, I decided to keep it in the model. Screen Shot 2019-04-18 at 2 29 06 PM

Checking for Interactions

Given all possible combinations of independent variables, I found that Social Support is a major confounding variable. I decided to split up Social Support into 3 categories: High Social Support Environment, Medium Social Support Environment, and Low Social Support Environment. Screen Shot 2019-04-18 at 2 31 46 PM

Out of the three independent variables that had the highest interactions with Social Support, Health/Life Expectancy and Perception of Corruption exhibit clear interactions while Negative Emotions does not. To interpret this:

  • The effect on Happiness Score are more strongly affected by level of Health/Life Expectancy in High Social Support environments than in low social support environments
  • The effect on Happiness Score are more strongly affected by Perception of Corruption in High Social Support environments than in low social support environments

Given the above, I decided to include the two interactions in my model.

Checking for Non-Linearity

Examining the relationships between Happiness Score and each independent variable more closely, there appeared to be non-linear relationships between Happiness Score with Health/Life Expectancy and Perception of Corruption:

Screen Shot 2019-04-18 at 1 40 56 PM

I proceeded to account for these non-linear relationships in the model using polynomial transformations.

For Health/Life Expectancy, I tested its transformation to the powers of 2, 3, and 4: Screen Shot 2019-04-19 at 5 20 02 PM

Screen Shot 2019-04-18 at 2 42 46 PM

The best fitting line is the degree 2 transformation. Checking the residual distribution proved that such transformation is useable.

Repeating the same steps for Perception of Corruption:

Screen Shot 2019-04-18 at 2 46 43 PM

Screen Shot 2019-04-18 at 2 47 15 PM

Although degree 3 seemed to be the best fitting line, the residuals for degree 2 showed more of a random distribution. I decided to include the degree 2 transformation in my model.

What Does my Model Look Like?

Inclusive of all the interaction terms and variable transformations, my OLS output showed a R2 0.86:

Screen Shot 2019-04-18 at 2 50 54 PM

There is a slight improvement from my baseline model. However, this model showed the presence of insignificant p-values and counterintuitive coefficients for my variables.

Training my Model

After many iterations of including and dropping different variables, I needed to make a trade-off between generating the highest Adj. R2 and generating reasonable coefficients. Hence, I dropped all variables with p-values > 0.05, variables where the coefficients were not explanatory, and those that are already included in the model as part of an interaction term or transformed variable (these included Generosity, Negative Emotions, Social Support, Perception of Corruption, Positive EMotions, Health/Life expectancy to the second degree, and Perception of Corruption to the second degree).

The final model yielded a R2 of 0.84 and an Adj. R2 of 0.83:

Screen Shot 2019-04-18 at 3 01 09 PM

Checking Residuals

The residual vs. fitted plot shows that my residuals left over from my regression model do not have a non-linear pattern.

Screen Shot 2019-04-19 at 5 22 17 PM

Although the QQ-plot of my residuals shows some deviation from normality, the deviations are not too severe to be alarming.

Screen Shot 2019-04-19 at 5 22 40 PM

The Scale-Location / Spread-Location plot shows that my residuals are spread equally along the ranges of predictors. This confirms the assumption of equal variance (homoscedasticity) as it shows a horizontal line with equally (randomly) spread points.

Screen Shot 2019-04-19 at 5 23 13 PM

The Leverage Plot shows that no values lie on the other side of Cook's Distance line (they all have low Cook’s distance scores), therefore outliers are not influential to the regression results.

Screen Shot 2019-04-19 at 5 23 41 PM

Interpreting the Model

In my multiple-regression model, ~83.7% of the variability in the Happiness Score can be explained by the following variables:

  • For every 1 point increase in the score of GDP per Capita, Happiness Score goes up by 0.65
  • For every 1 point increase in the score of Health and Life Expectancy, Happiness Score goes down by 4.36 (This inverse relationship could be due to increased financial pressure for retirement given longer expected life. Typically longer life expectancy is observed in developed countries with stronger economies, where, on average, employments are more competitive.)
  • For every 1 point increase in the score of Freedom, Happiness Score goes up by 2.52
  • For every 1 point increase in the score of Government Confidence, Happiness Score goes down by 1.20 (This seems counterintuitive and will need to be further researched)
  • Health and Life Expectancy in high Social Support environments increases Happiness Score by 5.01 than that in low Social Support environments
  • Perception of Corruption in high Social Support environments lowers Happiness Score by 3.64 than that in low Social Support environments (A possible explanation for this could be related to managing expectations. People in low social support countries tend to perceive their governments as more corrupt, and vice versa. It is possible that their low expectations and general distrust in the government prevented as dramatic of a decrease in their overall happiness as compared to those in high social support environments)

Validating my Model

Since I do not have 2019 data, I cannot make predictions to approximate future happiness. However, I still wanted to test if my model is robust.

Using the same variables from 2017 and 2016, I used my model to see the Actual Happiness Score and the "Predicted Happiness Score" for each country in those years. Then, I plotted out the Margin of Error between the Actual Happiness Scores and what my model "predicted". (Note that negative MOE indicates that I "underpredicted" the Happiness Score, where as positive MOE indicates that I "overpredicted" the Happiness Score).

For 2017:

Screen Shot 2019-04-22 at 4 24 44 PM

Screen Shot 2019-04-18 at 3 23 16 PM

For 2016:

Screen Shot 2019-04-22 at 4 25 03 PM

Screen Shot 2019-04-18 at 3 23 26 PM

An interesting observation here is that countries with the least political stability, weakest economy, and lowest level of development tend to have the highest positive MOE. Whereas more developed countries tend to exhibit less variability in their Happiness Scores over 2016-2018.

project-happiness's People

Contributors

feiqi9047 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.