News | General Information | Syllabus | Environment Setup | Class Schedules | Previous Years
- February 2022 Exam Session: Final Grades
Final grades are available at this link. - February 2022 Exam Session: Project Presentation Schedule
Presentations of the projects that have been accepted for oral discussion will take place remotely via Google Meet on February 8, 2021, at 10:00 a.m. CET, using the link indicated in the message sent on the Moodle forum. Everyone is welcome to join! - February 2022 Exam Session
Registrations to the February 2022 exam session are open on Infostud (id 793404), and so they will until February 4, 2022. Project submission week opens up on January 29, 2022 at 00:00 a.m. CET (Central European Time) and closes on February 4, 2022 at 11:59 p.m. CET.
(Please, see the announcement below for additional details on how to submit your project during this session, which is the first one of the academic year 2021-22.) - Students who are planning to submit their projects after the January 2022 session should refer to the Big Data Computing 2021-22 Moodle page, rather than the current one (i.e., Big Data Computing 2020-21). This is to align exam sessions to the correct academic year, since academic year 2020-21 formally ends on January, 31 2022. As such, starting from February 2022 until January 2023 all the exam sessions will be displayed on the newly created Moodle page indicated above, where students will be allowed to submit their work on the corresponding Project Submission Week that will be opened along the way, as usual. For example, the upcoming February Submission Week is available at the following link.
(NOTE: Only students who expect to complete the exam in one of the upcoming 2021-22 sessions must subscribe to the Big Data Computing 2021-22 Moodle page!)
Welcome to the Big Data Computing class!
This is a first-year, second-semester course of the MSc in Computer Science of Sapienza University of Rome.
This repository contains class material along with any useful information for the 2021-2022 academic year.
- Tuesday from 5:00 p.m. to 7:00 p.m.
- Wednesday from 8:00 a.m. to 11:00 a.m.
According to the guidelines provided by Sapienza University to contrast the COVID-19 pandemic, the course will be held both in-person and remotely. For any further information, students must refer to the official documentation available on the Sapienza website.
Students who are willing to attend classes in-person must issue their request through the Infostud Lab App or the Prodigit Sapienza online booking system, according to the rules established (please, see here). Once the booking is confirmed - according to the class schedule above - students must go to Room 1L, which is located in Via del Castro Laurenziano 7a.
Students who are willing to attend classes remotely online must register to the dedicated Zoom conference, using the following link: https://uniroma1.zoom.us/meeting/register/tZAkdOysqjkiG9SU5I1rG-oENGV-RIfCxLwv
Students must subscribe to the Moodle web page using the same credentials (username/password) to access Wi-Fi network and Infostud services, at the following link: https://elearning.uniroma1.it/course/view.php?id=14454
- Email: [email protected]
- Website: https://www.di.uniroma1.it/~tolomei
- Bacheca Sapienza: https://corsidilaurea.uniroma1.it/it/users/gabrieletolomeiuniroma1it
Please, drop me a message at [email protected] in case you would like to schedule a meeting, either online (i.e., via Google Meet or Zoom) or in-person (i.e., in Room 106 located at the 1st floor of Building E in Viale Regina Elena 295).
The amount, variety, and rate at which data is being generated nowadays both by humans and machines are unprecedented. This opens up a number of challenges on how to deal with those data, as traditional computing paradigms are not conceived to operate at such a scale.
"Big Data" is the umbrella term that has rapidly become popular to describe methodologies and tools specifically designed for collecting, storing, and processing very large or complex data sets. In addition to addressing foundational computer science problems, such as searching and sorting, big data computing mainly focuses on extracting knowledge - thereby value - from large-scale data sets using advanced data analysis techniques, such as machine learning.
This course is intended to provide graduate-level students with a deep understanding of programming models and tools that are suitable for the large-scale analysis of data distributed across clusters of computers. More specifically, the course will give students the ability to proficiently develop big data/machine learning solutions on top of industry-standard frameworks, such as Hadoop and Spark, to tackle real-world problems faced by the so-called "Big Five" tech companies (i.e., Apple, Amazon, Google, Microsoft, and Facebook): text/graph analysis, classification/regression, and recommendation, just to name a few.
The course assumes that students are familiar with the basics of data analysis and machine learning, properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language). Previous experience with Hadoop, Spark, or distributed computing is not required.
Students must prove their level of comprehension of the subject by developing a software project, leveraging the set of methodologies and tools introduced during classes. Projects must of course refer to typical Big Data tasks: e.g., clustering, prediction, recommendation (just to name a few) using very-large datasets in any application domain of interest.
Anyway, the topic of the project must be first agreed with the teacher through a proposal that must be sent at least one month before the targeted project submission deadline. NOTE: Only the projects that have been successfully approved will be considered for grading!
References where to select interesting projects will be suggested throughout the course (e.g., Kaggle). However, I strongly encourage you to come up with your own original ideas, as creativity will be very much appreciated.
Projects can be done either individually or in group of at most 2 students, and they should be accompanied by a brief presentation written in english (e.g., a few PowerPoint slides). Finally, there will be an oral exam where submitted projects will be discussed in english; other questions on any topic addressed during the course may also be asked, but those can be answered either in english or in italian, as the student prefers.
A document containing the main guidelines for the final project will be made available soon. Please, stay tuned!
No textbooks are mandatory to successfully follow this course. However, there is a huge set of references which may be worth mentioning, especially to those who wants to dig deeper into some specific topics. Among those, some readings I would like to suggest are as follows:
- Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] available online.
- Big Data Analysis with Python [Marin, Shukla, VK]
- Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti]
- Spark: The Definitive Guide [Chambers, Zaharia]
- Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia]
- Hadoop: The Definitive Guide [White]
- Python for Data Analysis [Mckinney]
Introduction
- The Big Data Phenomenon
- The Big Data Infrastructure
- Distributed File Systems (HDFS)
- MapReduce (Hadoop)
- Spark
- PySpark + Databricks
Unsupervised Learning: Clustering
- Similarity Measures
- Algorithms: K-means
- Example: Document Clustering
Dimensionality Reduction
- Feature Extraction
- Algorithms: Principal Component Analysis (PCA)
- Example: PCA + Handwritten Digit Recognition
Supervised Learning
- Basics of Machine Learning
- Regression/Classification
- Algorithms: Linear Regression/Logistic Regression/Random Forest
- Examples:
- Linear Regression -> House Pricing Prediction (i.e., predict the price which a house will be sold)
- Logistic Regression/Random Forest -> Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank)
Recommender Systems
- Content-based vs. Collaborative filtering
- Algorithms: k-NN, Matrix Factorization (MF)
- Example: Movie Recommender System (MovieLens)
Graph Analysis
- Link Analysis
- Algorithms: PageRank
- Example: Ranking (a sample of) the Google Web Graph
In this course, we will be using the Python application programming interface to the Apache Spark framework (a.k.a. PySpark), in combination with Google Colaboratory (or "Colab" for short). This will allows you to write and execute PySpark (as well as pure Python, for that matters) in your browser, with:
- Zero configuration required;
- Free access to Google's powerful cloud infrastructure (including GPUs);
- Easy sharing.
Of course, the same can be achieved also on your own local machine but that would require: (i) dealing with clumsy installation issues that are very specific to your platform, and (ii) sticking to "small" rather than real "big" data, as your machine cannot compare with Google's infrastructure!
Optionally, you may also want to install PySpark on your own local machine.
(NOTE: This step is not required for passing this class)
In case you would like to install and configure PySpark also on your local machine, please follow the instructions described here. Note that those guidelines may refer to older (or, even worst, deprecated) versions of the required installation packages; please, see the official PySpark documentation for the the most updated installation instructions.
Lecture # | Date | Topic | Material |
---|---|---|---|
Lecture 1 | 02/22/2022 | Introduction to Big Data: Motivations and Challenges | [slides: PDF] |
Lecture 2 | 02/23/2022 | MapReduce Programming Model | [slides: PDF] |
Lecture 3 | 03/01/2022 | Apache Spark | [slides: PDF] |
Lecture 4 | 03/02/2022 | PySpark Tutorial | [notebook: ipynb] |
In the following, you can quickly navigate through Big Data Computing class information and material from previous years.
NOTE: The folder containing the class material is unique and it is subject to changes and/or updates; as such, there may be differences between the content displayed on this website and what have been shown in class in the past.