Giter Site home page Giter Site logo

adaepfl-final's Introduction

ADA Final Exam

Deadline

Thursday February 02, 2017 at 11:15AM

Important Notes

  • Make sure you upload your iPython Notebook with this form at the end of the exam, with all the cells already evaluated.
  • Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
  • Please write all your comments in English, and use meaningful variable names in your code.
  • As we have seen during the semester, data science is all about multiple iterations on the same dataset. Do not obsess over small details at the beginning, and try to complete as many tasks as possible during the first 2 hours. Then, go back to the obtained results, write meaningful comments, and debug your code if you have found any glaring mistake.
  • Remember, this is not a homework assignment -- no teamwork allowed!

Goal

Today you will wear the hat of a data scientist who studies the Social Media presence of the two major Swiss universities: EPFL and ETHZ. You will be given multiple tasks, ranging from data analysis to machine learning, all aimed at spotting key differences between the two universities.

Data Description

In this repository you can find two .json files containing the full Twitter history of the EPFL_en and ETH_en accounts. On the Twitter developers site you can read a full description of the Tweet objects contained in the files. We recommend you to read carefully the documentation, in order to understand how useful each attribute could be for the assigned tasks. Load the two files into Pandas dataframes, and then generate two additional dataframes filtered by id % 10 == LAST_DIGIT_OF_YOUR_SCIPER_NUMBER. Whenever asked, perform the task on both the full dataframes and the downsampled ones, discussing what is the impact of the downsampling (if any).

Tasks

  1. Perform data wrangling as you see fit on both the full and downsampled dataframes, justifying your choices.

  2. By means of descriptive statistics and plots, show the different volume of engagement (e.g., number of favorites and retweets) that the accounts generate. Compute the results per year (to highlight the growth trends), per month (to figure out if the accounts follow the academic year), and per hour of the day (to find out if tweets posted at a certain hour get more attention). Similarly, break down the results per hashtag (e.g., #EPFLisAwesome) -- are there hashtags that are used more often than others, and that obtain more engagement than others?

  3. Train a regressor (both on the full and downsampled dataframes) to predict how many retweets a certain Tweet will get. You are allowed to use as features only the attributes in the JSON objects (and any derivative that you can build locally) -- you are not allowed to download additional data from the Internet to boost your model. Discuss the obtained results, explain the performance on the downsampled dataframes, and briefly describe what additional features you would have used if you had access to the full Twitter API.

HINT: for a more powerful model, consider time (and how the audience of the accounts grew throughout the years...)

  1. Find the topics that are covered most of the time by the two Twitter accounts (both on the full and downsampled dataframes). You can run topic modeling and/or implement your own NLP pipeline. Do the topics change significantly over time? Is there an overlap with the hashtags used in the tweets?

HINT: clustering Tweets by some features (e.g., hashtags) will give you better results with topic modeling.

adaepfl-final's People

Contributors

antitoine avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.