Giter Site home page Giter Site logo

waseemsalami / project-big-data-in-behavioral-science- Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 32.58 MB

An exciting Big Data project done during a course I took at the Technion university

HTML 68.60% Jupyter Notebook 31.40%
behavioral-science big-data-analytics machine-learning nlp praw-api reddit python

project-big-data-in-behavioral-science-'s Introduction

project-Big-Data-in-behavioral-science-

  • If you only intend to view the project you can either view the ipynb file here (on github) or download the HTML file and open it with any of your browsers, otherwise, you can download the ipynb file and surf through the code in any ipynb lab like Jupyter Notebook.

About: intro

A python data analyzing and modeling project done during a Big Data course I took at the Technion university as a Inf Sys Eng student, in which I teamed up as a data student with a psychology Masters Degree student, and together we researched a subject in behavioral science through extracting big and rogue Reddit data, processing and modeling it, and finally evaluating and visualizing the results while integrating statistical tools.

About: Theory

In the last decade, two phenomena have become very popular among various populations in the Western world. Veganism and minimalism, disseminated through movies, documentary series, and social networks, are two behavioral acts that influence not only the individual's personal behavior but also their behavior in society.

Veganism, which means that a person does not consume any animal products, has been extensively researched (Pendergrast, 2016, pp.106-122) and has a significant positive impact on the environment.

Minimalism, expressed through a preference for a life with reduced consumption and possessions (Palafox, 2021), also constitutes pro-environmental behavior.

Research focusing on pro-environmental and pro-social behavior, categorizes the motives of individuals for these actions into four categories (Snelgar, 2006):

  1. Altruism
  2. Biophilia - Animals
  3. Biophilia - Plants
  4. Egocentrism

In this Project we want to examine what motivates people to choose a vegan diet or minimize their consumption as much as possible. Do most people choose this lifestyle out of concern for others, concern for themselves, or concern for the environment (plants and/or animals)?

About: Practical

The project consists of 4 main Phases(tasks), each task has its own ipynb and html file displaying all the modeling proceess of this task.

Task 1: Extracting the raw Reddit Posts text Data by querying with praw from relevant subreddits

I extracted the text data of the Reddit posts from various Subreddits (e.g. https://www.reddit.com/r/minimalism/ and https://www.reddit.com/r/vegan/) and finaly I chose two relevant datasets for each group (minimalism\veganism) and deployed a first small-batch of them to Amazon MTurk workers asking them to read the Posts of each datasets and rate to what scale do they think the post tends to be biospheric\egocentric(more about how its measured in the ipynb files)

  • we used MTurk's ratings for deciding on the post's label: 2 for Biospheric ,1 for Egocentric and 0 for undecided (which was later converted to binary - 1 for Bio, 0 for Ego). So we actually used MTurk's ratings as our labeling and by that we formed our supervised data:
  1. searched for relevant questions to ask the MTurk workers

  2. Built a new MTurk project and designed the survey’s format

  3. Found a problem in our project plan – we first chose to build 4 models for each label , (and fixed it) but then
    we found a way to generalize the way we model and by that we got from 4 models into 2 models

  4. Created a big set of questions and chose the most unbiased and straightforward questions

  5. We finally deployed our chosen datasets for each of the groups: minimalism/veganism with the set of questions, in which we let MTurk workers label the datasets(posts) wether they think they tend to a vegan or minimalist person by reading 5 behvioral questions and rating them from (1 to 5)

Task 2(batch_analysis):

Retrieved labled data from MTurk workers, analyzed the batch and the labeling by different statistical and judges-agreemant rate approaches, and finally, after some adjustments for questions that we found less relevant, we sent the bigger batch for MTurk with updated questions list. hence the part b of task2: Big Batch Analysis.

Task 3(Big_batch_Analysis): here I conduct a thorough analysis of the recieved labled data:

On the Batch I performed: statistical analysis, Labeling (by deciding on a labeling method for the posts based on the scores of the answers from MTurk), Sentiment Analysis, Text Analysis and Statistical Tests on the labeling results and more.. (found on 3-Big_batch_analysis_mixed(task3).ipynb)

Task 4-A (4-extracing_unlabled_data(task4).ipynb) : extracting unlabled data - time to dive into the world of the unknown:

extracted a big batch of posts for each of the groups (veganism and minimalism)

Task 4-B (5-modeling(task4).ipynb) : Modeling

transfered labled batch to binary labeling and officialy split the labled data into 2 datasets for further modeling: minimalism data and veganism data, which will be split to train\test data sets for training our models.

the main steps(which are easily shown in (5-modeling(task4).ipynb) are:

  1. Feature Creation: top 90%+ common words of each dataset and compound score achieved from previous sentiment Analysis.
  2. Feature Selection: performed Filter Method, Wrapper Method, Embedded Method on the feature list from previous step, selected features size of: 4 - 4 features seemed reasonable given the fact that our datasets are not that big.
  3. Model Selection: performed with LOO(Leave One Out) method with following models: Nearest neighbors, Logistic Regression, Decision Tree and Random Forest. The model with the best auc-roc value we got on the subset of the train set is :

for veganism:RandomForestClassifier with features chosen by the embedded method : ['anim', 'motiv', 'want', 'compound']

for minimalism: RandomForestClassifier with features chosen by the wrapper method using rfe model : ['earth', 'environment', 'wast', 'compound']

  1. Evaluating the chosen model on the test set : predict_proba, Leave One Out cross validation.
  2. reading unlabled data
  3. Creating features for the unlabled data based on features selected from labled data
  4. Predicting label of unlabled data sets
  5. statistical analysis
  • Results are seen in the final ipynb file (modeling) in the 8th step (statistical analysis).
  • A summary of the results can be also found in the pptx presentation.

project-big-data-in-behavioral-science-'s People

Contributors

waseemsalami avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.