Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests i.e ๐ฅ1 and ๐ฅ2. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model. The scatter plot of training data is as shown below. Note features have been normalized.
Logisitic Regression is used to model the probability of a feature belonging to a certain class (in this case, pass/fail). Each object would be assigned a probability between 0 and 1 and a discriminant function would group the features to the appropriate classes. The basic model is displayed below:where Y denotes the set of classes {0,1} and x is the feature vector of attributes [๐ฅ1, ๐ฅ2]. A total of three weights were trained with batch gradient descent and fed into the sigmoid activation function, with the discriminant function placing features with P >= .5 into class 1. Training accuracy yielded less than 50% because the data is not lineary classifiable. To better fit the data, more features were created for each data point, adding more basis equations to the weighted sum with degrees up to the 6th power. As a result, the input data has transformed into a matrix spanning 28-dimensions.
While a higher dimensional phi is a more accurate classifier, it is susceptible to overfitting and would yield low testing accuracy. Therefore, a regularized regression model would be required, along with a weight penalizer in the gradient descent algorithm. The regularization of choice is L2 ridge regression which adds a squared magnitude of the coefficient as a penalty term to the loss function. The model is displayed below:
Where lambda represents an arbitrary constant that specifies the intensity of random noise added to the weights. Note the bias term is not regularized in the weight decay process. The gradient then simplifies to:
and our weight update can be expressed as:
Figure 1: Graph of the training data cross-entropy loss vs number of iterations. Learning rate was set to 0.08 with pmax = 10,000.- Jupyter Notebook - Web environment
- Numpy - Third party framework for linear algebra/vector calculations
- Matplotlib - Python library for graphing and visualizations
- Walter Nam - Initial work - profile