This repo is my studying resource where I attempt to code everything in the lectures to understand them. It has worked for me in the past so I'm going to try again.
This section will go over all of the ml algorithms coded from scratch.
Located in the /dtrees
folder, is the algorithm for decision trees.
The above visualization demonstrates how the algorithm works at a high level.
Each node that is not a leaf node has a split condition.
These split conditions split the data set on a single feature based on a single bound.
These splits are determined using the information gain ratio(IGR).
To define IGR we first need to define entropy which is,
Where,
So what this is saying intuitively is IG is the difference between the entropy of the even occurring over the data set and the entropy of the event happening when another event occurs.
Lets work through this example below,
First we calculate,
Next we calculate,
I wrote code in utilities/entropy
calculating the results.
I just want to make a few notes, the notation does not account for conditional probability situations.
This again, is very annoying because the conditional entropy is defined differently than regular entropy.
As we can see from the equations above.
To get the result you have to apply Bayes rule to redefine the conditional within the log to be a product of the joint probabilities divided by the probability of
We can now finally define the IGR we discussed earlier.
Finally we have a formal definition of IGR.
Now that we have IGR we can walk through the decision tree algorithm.
The candidate splits are determined by the highest IGR when selecting a point to be conditioned over.
The stopping criteria will be when the IGR is zero or when the set of training instances is empty.
A leaf node will delegate a label once a testing sample is ran through the dtree.
Finding the best split is just using the maximal IGR we determined from when making
dtrees/decision_trees.py
.
For the final note on dtrees, let us describe the inductive bias. The hypothesis space bias is the decision tree algorithms favor trees with single features, axis parallel splits. What this means is that the algorithm will tend toward designing models that are focused in a single dimension and place splits in parallel. We can think about it as Dtrees are tend towards dividing up spaces with straight lines through where the lines are defined only in a single dimension. The preference bias is indicative of small trees identified by greedy search. This means that the dtree algorithm will tend to try and make smaller trees with the greedy search algorithm.
One of the simpler algorithms, the nearest neighbor algorithm will classify the points based on the point it is closest to in the dataset.
Below depicts a
As we can see, the regions around each determine which points will share that label based on the other point's location.
A little confusing but the idea here is that a point is classified based on the closest point(when
When
Excuse the lack of an indicator function, I do not know how to write it in markup. However, the equation above formalizes the idea of prediction.
Standardization is another topic that can be applied to multiple algorithms here but the lecture slides introduce the concept when discussing kNN. To discuss the process, we first need to formalize the standard deviation and mean. Mean
Standard Deviation
Above is explicitly not in the notes, however I do believe the
The standardization formalization can now be written as.
$$
\hat{x}_a^j = \frac{x_a^j - \mu_a}{\sigma_a}
$$
I implemented standardization in the file utilities/standardization.py
.
I show that the sklearn standardization is gets the same result as my simple implementation.
I implemented a basic kNN algorithm in the file nearest_neighbors/nn_alg.py
.
I have not implemented fancy nearest neighbors yet.
I think I eventually will.
There still remains the question, How do you deal with irrelevant features?
Above is an example of how irrelevant features can affect the distances.
This is just a problem with kNN models in general.
They are weak to irrelevant features.
Strengths
- Easy to explain predictions
- Simple to implement
- No training
- Often a good solution in practice
Weaknesses
- Sensitive to irrelevant features (see image above)
- Prediction stage can be expensive
- No "model" to interpret
The Inductive Bias is defined as assumptions a learner uses to predict a label
Linear regression is the concept of finding a function
For the setup, I am going to be referencing the slide below.
We are given the usual setup, real-numbered training data with
Where
Notice how we can actually represent
The first regression type on our list is ridge regression. Ridge regression has a loss function defined below.
When we solve for
Notice the
The natural question then becomes, how should one pick
Ridge regression's main goal stated in the slides is actually to prevent large weights. Making the matrix invertible may be an after thought for these lectures, but always having a invertible matrix is a very important component of the ridge regression as well.
LASSO looks very similar to ridge at first glance. Equation below defines LASSO as,
The only difference is that we are now using the
The goal of LASSO is to encourage sparse solutions, meaning
There are several metrics we can look to when evaluating linear regression models. We will start with R-squared which is defined as,
Where
The next is MSE which is mean squared error. There is also RMSE which is the same thing as MSE except you just take the root of the MSE. These tell us how far in euclidean space the predicted value is from the actual.
There is also the MAE which is the mean average error. This tells us the average error over all predictions. MAE will likely tell is the same thing as MSE but with out squaring the error first.
We are finally discussing gradient descent. Boy am I excited. Below summarizes gradient descent quite well.
The goal is to optimize
Gradient descent is actually faster.
Inverting a giant matrix is actually a very difficult task for a computer to do.
For example, inverting a square
So what are the drawback of gradient descent(GD)? There are a couple, to take the GD you need to have a convex function. How do we define a convex function? The figure below is a good starting point.
The above figure is a fancy way of saying that the function must have a local minima between two points.
The figure straight up says that you must be able to draw a line from
After a grueling time with the likelihood estimator, we can start discussing logistic regression. If you are unfamiliar with the likelihood estimator, I would highly encourage you to read that section under concepts. We define the logistic distribution as,
Which has a conditional distribution defined as,
We define the loss of the log likelihood estimation as
Thus, the optimization problem becomes,
So, with all of the above, whats the big deal with logistic regression?
- It is bounded on
$(0,1)$ - Its symmetric,
$1 - \sigma(z) = \sigma(-z)$ - Its gradient is super easy,
$\sigma^{`}(z) = \sigma(z)(1-\sigma(z))$
I included an example logistic regression file in logistic_regression/lr.py
.
Naive Bayes assumes that features are conditionally independent.
Let us define conditional independence before continuing.
Formally, let us say there are three probability distributions,
Below is a good visualization of Naive Bayes' overall model.
This one took awhile, there is a lot of details we need to discuss before continuing.
If you don't care about the details and just want some code, I wrote a naive bayes model from scratch in the file naive_bayes/nb.py
.
Check that out and if you are still confused come back here.
Let me break down the picture above in a simpler way.
There is a huge difference between
We will walk through the image above using an example and explaining each component of the image.
Example Problem
We have an entire set of athletes
The actual formulation of the model requires us to understand one quick math rule
Training is easier than it sounds, all you have to do is find the likelihood of a value for each class for each feature. We will list each step below:
- Separate points by their label.
- Calculate
$P(Y) = \frac{|X_Y|}{|X|}$ where$X_Y$ are the sets of points divided up by label and$|X_Y|$ represents how many points are within that set. This value is the prior of$Y$ . - Calculate the mean and variance for each feature within the class. See code for how to do this.
Great, we just trained the model! The confusing part is the prediction methodology. If you are unfamiliar with probability this part will be confusing. It was for me and I took an advanced probability course. You have been warned. For our prediction algorithm we need to pick a distribution to base our PMF on. If you don't know, use a Gaussian or normal distribution. You can also do some likely hood estimates over multiple distributions to see which works the best but for simplicity I will be using a Gaussian distribution for the rest of the explanation. The Gaussian PMF is defined below,
Where
With the PMF defined, predictions are as follows:
- Find the likelihood of a label of a point by looking at the likelihood of each feature given the priors we computed earlier. Likelihood refers to calculating the PMF over each feature for the normal distribution.
- Take the log of all results and some the results with the log of the prior.
- After doing that for each class, find the class with the highest sum.
- Return that class.
Please look at the python example for any questions.
There is one last concept we need to discuss and that is smoothing. If you're smarter than me, you may have noticed this algorithm fails the training data is missing a class from the class list. Referring back to our sports example, if we do not have any data for professional bowlers, then our model will not be able to classify athletes that bowl. However, our model can account for what it doesn't know with smoothing. If we add a constant to all priors of the classes we can still account for things we have not seen if we know they will show up.
We should start by just showing a picture of the first example.
There a great image in the slides I am going to use here.
Notice how with every point the line gets redrawn depending on the new point. So how does this work formally? An example algorithm is as follows:
import numpy as np
def training(x,y,w):
# Make w have as many zeros as x has features.
w = np.zeros(len(x))
#Iterate through each row of the dataset.
for i in range(len(x.shape[0])):
#If a classification is incorrect, shift the line based on the product of the label and the point.
if y[i]@w.T@x[i,:] < 1:
w = w + y[i]@x[i,:]
else:
continue
return w
Assuming you have got all the matrices the correct size then this algorithm will work.
The above algorithm's loss function is called hinge loss.
As you can see, only incorrectly classified points will have an effect on the weights.
This one is odd, or at least the explanation can be hard to understand.
Let us walk through it.
First let us define
The mistake bound is defined as
So what is the intuitive understanding of the equation? This equation says that the total amount of mistakes we can make for a perceptron is proportional to the size of the data($D(S)$) and inversely proportional to the closest misclassified point. In other words, we can draw a circle that wraps the furthest misclassified point and the point with the largest distance magnitude away from the line.
Neural nets are built off of perceptrons in a nice way. In a perceptron, the activation function was fixed. Whereas in neural nets, the activation function is variable and will change. More on that later. The main idea behind neural nets is a more robust and dense perceptron.
Neural nets have three components, an Input Layer, Hidden Layers, and an Output Layer. Notice how the hidden layers is plural. Meaning we can have multiple hidden layers doing whatever we want. Again, we will go more into that later.
Input layers give each feature a singular node. These feature nodes are then passed to the hidden layers and weights are applied to them. The input layers just define what the input feature vector should look like.
Feature Encodings is the idea of using multiple dimensions of a vector to encode a single feature.
There are three types, nominal, ordinal, and real-valued features.
Real-valued features is something we are already sort of familiar with.
It just means that the features are real numbers.
Nominal features are pictured below,
The image above indicates 4 possible values for our input: A,C,G, and T.
For DNA sequences, it is quite nice to model our data as it makes it simple to input into the model.
Thus, nominal data is considered a one hot encoding.
One hot referring to the fact that only one of the features is not zero.
Ordinal features are pictured below.
Now we all elements could be 1 or 0.
They also indicated different classes.
This type of encoding is also referred to as the thermometer encoding.
As more ones indicates more of something.
Output layers predict the result based on the results of the hidden layers. If an output layer only has a single node, it is used for binary classification. If a network has multiple nodes, then it will be used for multi-class classification. Simple enough. The reason for this distinction is that the output layer will determine the prediction of the input. In the binary case, usually you would decide if that final node is a one or a zero based on a threshold or some activation function. Whereas, the multi-class case will you would usually pick the highest valued output node. The index of the choice would then be the prediction.
Hidden Layers
Hidden Layers are where things get interesting.
Hidden layers define activation functions that manipulate the input in some way we will refer to the activation functions as
From the figure above, we can see that each input is multiplied by some weight and each neuron applies
- sigmoid
$$ \sigma(\theta^{T}x) = \frac{1}{1-e^{-\theta^{T}x}} $$ 2. tanh
- ReLU
I do not know how to do piecewise functions in markdown so I just took a picture or ReLU.
There is a great image in the slides I will use here that generalizes the process well.
We have a dataset
The code I have I think will work.
I think there is something off but it looks like back propagation is correct.
So for how to implement back propagation, check out the file neural_nets/neural_net.py
.
Regularization is fairly straight forward. It is idea/algorithm whose goal it to prevent overfitting. There are several types of regularization methods each with their own pros and cons.
This method is pretty direct. We set a constraint for the loss that isn't zero and when it reaches the constraint we finish training.
Soft constraint is a bit different.
We allow the function to go to zero but we add a term to the optimization problem.
This way the loss will change location of the local min but it wont overfit.
We can tune
The other two I have seen before in 532 this one is new.
In this case, our regularizer is dependent on the probability of
The typical route one should take if they don't know is to just use a soft constraint. Use a hard constraint if the bound of the data is known. Use a Bayesian if the domain knowledge is easy to represent as a prior.
Because we are adding a new term, we have to update our gd algorithm.
I know the notation is not consistent, but bare with me.
This of
Convolutional neural nets(CNNs) get their own section because of how they differ from the traditional neural nets. CNNs revolutionized computer vision when it first showed up. The main difference comes from a signals concept called convolution. You think of convolution as combining two signals together equally. A intuitive way to grasp it is by starting with multiplication. When we multiply two numbers together we get a third number that is equally composed of the two numbers we started with. How do we multiply signals? If you multiply two sine waves together there really isn't too much of a change. But when we multiply a fourier transform of a arbitrary signal, which is a random collection of sine waves, we cannot just multiply the two signals as one would assume we need something that covers all possibilities. CNNs only have to apply convolution in a single layer to be considered a CNN.
Convolution is defined, formally, as:
Where
Finally, we can discuss CNNs in more detail.
I will define CNN convolution from with a side from the lectures.
Above we see a more intuitive visual.
Discussing the problem intuitively, we see that the kernel moves over the output and shrinks the input.
However, it retains some information from the input.
notices how the larger values on the output can correspond to different regions of the image.
The astute will have noticed that edges of the image could cause problems.
For example, the edge of an image could skew inferences or confuse the model.
The way around this is padding.
Padding surrounds the input with zeros to preserve the edges.
With padding comes a new way to reduce the dimensions.
If we let the padding be
The final component to our dimensionality formalization is the stride.
Stride is when you skip over rows or columns.
if we say that
There is also pooling. Pooling is the idea of applying an operation over the selected kernel values rather than multiplying and summing them up. Pooling is great for targeted searches. For example, if you would want to filter by eye color, pooling is the best approach.
CNNs solve 3 main issues, translation invariance, locality, and reduces the number of parameters. The third reason is pretty self explanatory for images. translation invariance is the idea that where you look at an image could change what you are seeing. Meaning that space between objects is arbitrary and two things within the same image can uniquely be identified. Locality is actually the opposite, it says that a collection of closer pixels could uniquely make up objects. CNNs address both of these concerns.
This is the idea that a dataset does not have any labels for its training data. There are less formalities with this type of machine learning. I don't know if that is applied through the entire study but this lecture does not take equal amount of time to define this type of learning.
Given: $$ {x^1,x^2,...,x^n} $$ The goal is to discover interesting regularities/structures/patterns that characterize the instances. For example, clustering, anomaly detection, and dimensionality reduction all fall under unsupervised learning. To paraphrase a bit, unsupervised learning is designed for pattern identification with a lack of labels.
The goal for clustering is to model h such that it divides training set into clusters with intra-cluster similarity amd inter-cluster dissimilarity. To paraphrase again, the goal is to group close points and separate far points. You may be asking yourself, how do you define distance for points where there is no clear definition of distance? Well, there are many ways, one can project points with no sense of distances into a space where distance is defined. Or as my professor put it, just don't use the clustering algorithms.
Above is a visualization of the concepts discussed.
The goal of anomaly detection is to detect anomalies. Yes I know riveting. But this problem is actually harder than it sounds. Although simple to define what the goal is, defining what an anomaly is depends completely on the space you are working in. An anomaly for a signal processing problem will look completely different than a computer vision anomaly. Defining the anomaly to a computer is the most difficult component.
The goal for dimensionality reduction is to find patterns in lower dimensional feature vectors. In other words to detect correlations between lower dimensions within the dataset. This concept again is harder than it sounds. A easy example for those who know about it is PCA or Principal Component Analysis. PCA defines what dimensions have the most information about the data. I do not want to get into the definition here as this is a high level overview but the point is that it accomplishes the goal of dimensionality reduction which is find lower dimensional patterns within a dataset.
This is the idea that a machine learning model is trained on data that has labels.
This is the primary focus of the first lecture.
We will formalize this concept below.
Let us say there is a set of all possible instances
Classification extracts labels from a finite set of possibilities. Regression, on the other hand, can have an infinite number of possible outputs. Classification's goal is to determine a point's class where as the goal of regression is to map a set of points to a continuous function.
We discussed earlier that
Let
There are two types of learning defined in the lectures.
- Batch Learning
- Online Learning
Batch Learning trains over all instances at once. Whereas, Online Learning trains the model on a specific group then updates are gotten per group.
The general idea behind these two concepts says that both methods for training work. I am a bit confused as to why the separation is important but I digress.
Goal: Find a model
Why is this a machine learning concept? Well, the translated phrase is "Entities should not be multiplied beyond necessity." Further translation, "When you have two competing theories that make the same prediction the simpler one is better." This concept is applied to decision tree learning because the central hypothesis is, "the simplest tree that classifies the training stances accurately will generalize." So how does Occam's razor apply? Well, there are fewer small sized trees than long ones. A short tree is unlikely to fit the data very well whereas a large tree would be more likely. How hard is it to find the shortest decision tree? It is NP-Hard to find the shortest possible decision tree. Therefore, the general solution is to greedily choose splits based on entropy. For more on how that is done, see the decision trees section.
Model overfitting is one of the most common problems in machine learning. A model overfits under two conditions:
- A low error on training data.
- A high error over whole distribution.
Again, there is a great visualization in the slides below.
Another good visualization of overfitting comes again from the slides.
The figure above has an x-axis of tree size and a y-axis of accuracy.
As the tree grows in size, the training data becomes more accurate where as the test set becomes less accurate.
Thus it is a great example of overfitting.
To describe the general phenomenon further, there is another figure from the slides.
Above is shows the optimal capacity is the boundary of the overfitting zone.
It is the moment that the model begins to overfit to the training data.
This section will be very important and I will try to have code examples for a lot of these.
A tuning set is a set that is not used from primary training purposes but used to select among models. Terrible description, a tuning set is a subset of the training data which is used to observe how the data will "tune" the model. When I say "tune" I mean make slight adjustments to the parameters on data the model has not seen yet and see how the hyperparameters effect accuracy.
We first should discuss why using a single training/testing set is not the best option for model evaluation. A single training set does not tell us how sensitive accuracy is to a particular training sample. A single partition of data into training/testing misses out on several details about the dataset. First, if the testing data is too large we get a reliable estimate of accuracy but we have a lower variance estimate. If the training data is too large it will be a better representation of the whole distribution. As you can see there is this trade off of making one set larger than the other. Thus, we must partition the training data
To address the first problem of a lack of variation one should randomly resample their training and testing data using random resampling.
The process is visualized below.
As you can see, the training and testing data have a fixed size and then we randomly partition the data into the sets.
However, this process could lead to issues where the distribution is not properly modeled.
For example, a random partition based on the figure above would be where all of the
+
values are in the test set but none are in the training set leading to a terrible model.
To address that concern you can apply stratified sampling.
With stratified sampling, the class proportions are maintained when preforming the partition.
This preserves the distribution while maintaining the random value selection.
Cross validation is considered to be the industry standard.
I was doing cross validation in COMP 532, implying that this concept is widely used.
Let is first visualize it below.
What this does is partition the training data into
A learning curve is the accuracy of a method as a function training set size.
For the learning curve, I made a python script in utilities/learning_curve_example.py
.
The script goes over the algorithm and compares a Naive Bayes model and a SVM model.
The result is below.
Confusion matrices are a great way to examine model performance per class.
Below is an example a multi-class confusion matrix.
The majority of this lecture discusses a 2-class confusion matrix which looks like this.
From the above we can now formally define accuracy, recall, precision, error, and false positive rate.
Accuracy
Error
Recall
False Positive Rate
Precision
Receiver Operating Characteristic(ROC) curves plot the TP rate(recall) vs the FP-rate.
The area under the ROC curve is sometimes used to evaluate the model.
ROC curves can be confusing, what they show is how sensitive the models are.
Below is a visual of what the curves can look like.
The ideal curve shows what great performance should be you can think of the relationship as
A more curved algorithm, like alg1, implies that the locations of the false positives are fairly close to one another.
Whereas a more straight curve, like alg2, demonstrate the TP and FP are spread out.
This means that alg2 has several false positives after seeing the first one.
Therefore, what really matters is the AUC as the table above suggests.
The closer the AUC is to 1, the better the model's performance.
Below is a visual of what the algorithm looks like.
The final metric we are going to be discussing is the PR curve.
PR curves show the fraction of predictions that are false positives.
At the recall increases, the false positives should decrease.
This is due to the denominators of the two have a decreasing relationship.
All false predictions are either
A likelihood function captures the probability of seeing some data as a function of model parameters. This is weird and took me a ChatGPT conversation to really understand what it means. A likelihood function determines how well some parameters model a given dataset. When you want to maximize of a set of parameters, the likelihood function's peaks tell you which parameters best model the distribution of the dataset. If you are able to find parameters that model the dataset well, then the parameters that you have found will become your predictor function. There are two examples that I will use to attempt to explain this: a uniform distribution and a normal distribution.
The uniform distribution will have a likelihood function that looks like, $$ L(\theta;X) = \prod_j{p_{\theta}(x_j)} $$
Where utilities/likelihood_function_example_uniform.py
.
The normal distribution example can be found in the utilities/likelihood_function_example.py
script.
When given a set of points over a normal distribution, the likelihood function will find the values of
Now with this understanding, we can finally define the maximum likelihood as,
Cross-Entropy is a common loss function that is seen frequently in the ML scene(I think). It is a measure of how different the predicted probability distribution of a model is from the true probability distribution of the data. Thus, minimizing the difference in distribution would best represent the training data.
Let
How do we minimize cross entropy?
Well, the slides say KL divergence.
I do not know what that is, lemme ask the oracle ChatGPT.
It has spoken, the KL divergence is a measure how different two distributions are from one another.
This makes sense as the equation for KL divergence of two distributions
The first term in the sum within the sum is the cross entropy, the second term is the fixed entropy of the data we have.
The softmax converts a vector into a probability vector. The equation is below,
I wrote some code as an example calculation in the file utilities/softmax_calc.py
.
Discriminative models are what people are usually first exposed to when learning about machine learning. These are the predictive models. When given features, these bad boys will produce a label. Examples of this model are linear and logistic regression.
All my NLP homies love generative models.
These models can be either supervised or unsupervised.
The goal of these models is to return how the data was generated.
These model
Below is a good table discussing the differences between the two.
Notice how both generative and discriminative models utilize MLE and MAP to estimate model parameters.
This will be more of a probability concept that we will apply in Naive Bayes models. Therefore, this section is important a particular model. Let us first establish some terminology,
Where
MAP estimation is a statistical thing used to estimate the most likely values of model parameters given some observed data and a prior distribution. In machine learning, we treat the parameters of a model as random variables with a prior distribution. Learning can then be defined in the terms of Bayesian Inference,
From the above equation, we can define MAP estimation formally as,