STA 380: Predictive Modeling

Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Office hours

I will hold office hours on Tuesdays and Thursdays, 3:20 to 4:30 PM, in CBA 6.478.

Exercises

The first set of exercises is available here. These are due Friday, August 10th at 5 PM.

The second set of exercises is available here. These are due Monday, August 20th at 5 PM.

Outline of topics

(0) The data scientist's toolbox

Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

Readings:

(1) Exploratory analysis

Contingency tables; basic plots (scatterplot, boxplot, histogram); lattice plots; basic measures of association (relative risk, odds ratio, correlation, rank correlation)

Some (optional) software walkthroughs:

Survival on the Titanic: summarizing variation in categorical variables
City temperatures: measuring and visualizing dispersion in one numerical variable.
Test scores and GPA for UT grads: association between numerical and categorical variables.

Readings:

excerpts from my course notes on data science. We'll look at some example graphics in Chapter 1.
Another interesting (if aesthetically dated) reference is the NIST Handbook, Chapter 1.
Bad graphics
Good graphics: scan through some of the New York Times' best data visualizations. Lots of good stuff here but for our purposes, the best things to look at are those in the "Data Visualizations" section, about 60% of the way down the page. Control-F for "Data Visualization" and you'll find it. Here are three examples:

(2) Foundations of probability

Basic probability, and some fun examples. Joint, marginal, and conditional probability. Law of total probability. Bayes' rule. Independence.

Readings:

Chapter 1 of these course notes.. There's a lot more technical stuff in here, but Chapter 1 really covers the basics.
In class, we will look at some pictures and tables from this packet of course notes.

Optional but interesting:

Bayes and the search for Air France 447.
YouTube video on Bayes and the USS Scorpion.

(3) Resampling methods

The bootstrap and the permutation test; joint distributions; using the bootstrap to approximate value at risk (VaR).

Scripts:

Readings:

ISL Section 5.2 for a basic overview.
These notes on bootstrapping and the permutation test.
Section 2 of these notes, on bootstrap resampling. You can ignore the stuff about utility if you want.
This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.

Shalizi (Chapter 6) also has a much lengthier treatment of the bootstrap, should you wish to consult it.

If time:

An R walkthrough on an introduction to hypothesis testing.
Another R walkthrough on the permutation test in a simple 2x2 table.

(4) Clustering

Basics of clustering; K-means clustering; hierarchical clustering.

Scripts and data:

Readings:

ISL Section 10.1 and 10.3 or Elements Chapter 14.3 (more advanced)
K means examples: a few stylized examples to build your intuition for how k-means behaves.
Hierarchical clustering notes: some slides on hierarchical clustering.
K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.

(5) Latent features and structure

Principal component analysis (PCA).

Scripts and data:

pca_intro.R
congress109.R, congress109.csv, and congress109members.csv
FXmonthly.R, FXmonthly.csv, and currency_codes.txt

If time:

gasoline.R and gasoline.csv
cca_intro.R, mmreg.csv, and mouse_nutrition.csv

Readings:

ISL Section 10.2 for the basics or Elements Chapter 14.5 (more advanced)
Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.

(6) Networks and Association Rules

Networks and association rule mining.

Scripts and data:

medici.R and medici.txt
playlists.R and playlists.csv

Readings:

Miscellaneous:

Gephi, a great piece of software for exploring graphs
The Gephi quick-start tutorial
a little Python utility for scraping Spotify playlists

(7) Text data

Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).

Scripts and data:

Readings:

Intro slides on text
Stanford NLP notes on vector-space models of text, TF-IDF weighting, and so forth.
Great blog post about word vectors.
Using the tm package for text mining in R.
Dave Blei's survey of topic models.
A pretty long blog post on naive-Bayes classification.

hhjoy / sta380-course-material Goto Github PK

sta380-course-material's Introduction

STA 380: Predictive Modeling

Office hours

Exercises

Outline of topics

(0) The data scientist's toolbox

(1) Exploratory analysis

(2) Foundations of probability

(3) Resampling methods

(4) Clustering

(5) Latent features and structure

(6) Networks and Association Rules

(7) Text data

sta380-course-material's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent