Giter Site home page Giter Site logo

cmsc33581's Introduction

CMSC33581: Topics in Big Data

"The Role of Approximation, Randomization, Distributions in Data Analytics"

We are collecting and storing more data than ever, and in many domains, data is growing faster than the available computational resources. This course starts with the premise that expecting exact computation is unsustainable, and approximation is increasingly necessary. We will deep dive into both the theory and the practice of approximate, randomized, and encoded data structures.

CMSC Graduate Course Designation: Data Science Elective

Grading: Students will be graded on participation (25% of your grade) and a final course project (75% of your grade).

Prerequisites: Statistics/Probability at an undergraduate level and Database Systems/Data Engineering course at an undergraduate level.

Course Topics

Unit 1. Information Theory for Data Analytics

  • Probability distributions, entropy, conditional entropy, joint entropy, shannon inequalities
  • Lossless data compression: dictionary, huffman, and arithmetic coding (and how they relate to entropy)
  • Data corruption: error coding, noisy channel model, channel capacity
  • Worst-case optimal joins as info theory: what shearer's inequality tells us about join performance (and why!)

Unit 2. Estimation

  • Sampling statistics, independence, exchangeability, limits of distributions, and concentration inequalities
  • Stratified sampling, optimal sample allocation, and other advanced sampling ideas
  • Histograms and discretization
  • Sketches, distinct count, frequency moments, and heavy hitters

Unit 3. Natural Data Distributions

  • Species estimation and rare items
  • Beneford's law and multi-scale data
  • Preferential attachment, powerlaws, and urn processes
  • Rogue waves and other models for outliers

Unit 4. Randomized Data Structures

  • Johnson-Lindenstrauss and Locality Sensitive Hashing
  • Understanding metric spaces and their embeddings
  • MinHash algorithm and its analysis
  • Skip Lists and their analysis
  • Bloom Filters
  • Differential privacy

Project Ideas

  • Approximate fault tolerance under communication budget
  • Optimal data placement to avoid correlated failures
  • Deterministic non-uniform sampling with hashing
  • Evaluate Beneford's law on data in github
  • Learned sketches and learned species estimation
  • Optimal promotion probability for skip lists

cmsc33581's People

Contributors

barbarioli avatar sjyk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.