Giter Site home page Giter Site logo

dsmordasov / drink_and_study Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 2.33 MB

An investigation into the influence of various socio-economic factors on high-school students' performance, for an 'Introduction to Machine Learning & Data Mining' course.

Python 99.28% R 0.72%

drink_and_study's Introduction

A project in which a set of socio-economic factors of high-school students in Portugal are investigated with their relation to the student's performance, potentially giving some advice to future parents who would like their children to perform well academically. Done as an assignment for the 'Introduction to Machine Learning & Data Mining' MSc course under DTU, with my classmate Mario Cesar Rodriguez.

Tl;dr: Want your child to (have a higher likelihood to) perform well in school?

  • Make sure both you and your spouse are educated (the more the better). Bonus points if you are both teachers.
  • Ensure that your child drinks alcohol as infrequently as possible (especially on workdays).
  • Situate yourself in an urban setting (i.e. close to school in this case).
  • Provide your child with access to internet.

Two reports were made, the first one focusing on exploratory data analysis (data mining), and the second one on applying machine learning algorithms to the data. The main results of these are presented below, with a more thorough write-up featured in the .pdf files in this repo.

1 - Feature extraction and visualisation

We were free to choose our own dataset for this course. I chose this one, as my previous experience with machine learning allowed me to see that the future tasks (such as linear regression) would be possible with it. Also, the subject matter does not require much field-specific subject knowledge that would prevent us from extracting insights from the data - especially as we are students ourselves and can thus make some scientific observations (it does not take a pedagogy PhD to deduce that having an alcoholic child who has to commute hours to school and can't access internet at home might, on average, have a harder time performing academically).

Taken from the investigation by Cortez and Silva [1], this investigation focuses on only 13 of the 33 available attributes of the students in the dataset [2], as only these 13 are considered by us to be within the parent's control. Categorical labels (e.g. type of parent's work) were one-hot encoded into nominal boolean variables. For PCA and ML, attributes were standardized so as to have zero mean and unit variance (μ = 0, δ = 1).

All of the attributes were discrete, and were either left as a boolean value (e.g. nominal - availability of internet at home), one-hot encoded (type of parent's education) (and based on their type (e.g. nominal - availability of internet at home, ordinal - level of parent's education, ratio - final student grade)

Correlation heatmap

Limiting ourselves to the bottom row in the heatmap above, we can see the correlations ρ between a student's final (Portuguese) grade and a certain attribute. It is clearly discernible from the magnitudes of ρ that the level of education of mother and father (Medu, Fedu), daily and weekly alcohol consumption (Dalc, Walc), proximity to school by living in the city (urban) and access to internet (online). Mathematics dataset showed similar insights, except for one...

Funny excerpt from Report 1

The above excerpt is a classic example of my technical writing. I like to think I keep my reports human, whilst maintaining professionality. You judge yourself.

Parents job boxplots

The boxplots above also show that students whose parents were teachers tended to perform better. An example of principal components discerned are 1:[workday alcohol consumption, weekday alcohol consumption, if parents are together] and 2:[if father works in healthcare, if mother works in healthcare, internet avaiability], decreasing the dimensionality of the (transformed) dataset from 27 to 14 components based on a 90% variance explained threshold.

PCA variance explained.

2 - Regression and classification

In the regression problem, an artificial neural network (ANN) with one-to-five nodes in its one hidden layer, a regularised linear regression model and a baseline model (featureless linear regression model that predicts the mean) were used. The classification task featured a similar three-model setup.

For classification purposes, the grade (out of 20) was transformed into a boolean pass/fail label based on a boundary of 10 points, a method employed in [1].

Grades distribution

The ANN with one node ended up outperforming more complex ANN models, and was itself outperformed by a linear regression model. The results overall helped hammer in the often-stated fact of machine learning that not always is a neural network necessary for solving a problem and that a simpler, more conventional machine learning algorithm may be often sufficient. The ANN for both regression and classification problems seems to have trained itself to just pick the mean of the dataset, such as the baseline model. Of course, these conclusion are affected by the fact that both my classmate and I lack experience training and optimising neural networks.

The conclusions made in '1 - Feature extraction and visualisation' have been largely confirmed by the magnitudes of the weights of the trained linear regression model.

If you got to reading it this far, you should now possess some knowledge to help make you a better future parent of the next super-student! If you already have kids, I hope it went (or is going) well.

Files

  • student-por.csv and student-mat.csv contain the dataset used in this investigation, holding various attributes and grades for either Portuguese or Maths subjects respectively.
  • 1 - Feature extraction and visualisation.pdf contains the first report for the course, which focused on feature extraction and visualisation of the dataset.
  • 2 - Classification and regression.pdf contains the final report for the course, in which classification and regression techniques are applied to the dataset.
  • report1.py and report2.py contain the 🐍Python code of these reports respectively.
  • student-merge.R contains an R script that should merge the two datasets if one would wish to, however it was found to do so erronously by Mario and I, and hence was not utilised.

Bibliography

[1] P. Cortez and A. Silva, "Using data mining to predict secondary school student perofrmance," 2008. Proceedings of 5th FUture Business TEChnology Conference (FUBUTEC 2008), pp. 5-12, Porto, Portugal.

[2] https://www.kaggle.com/datasets/uciml/student-alcohol-consumption [Accessed on 05/09/2021]

drink_and_study's People

Contributors

dsmordasov avatar

Watchers

 avatar Kostas Georgiou avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.