tirth8038 Goto Github PK

followers: 3.0 following: 6.0 repos: 14.0 gists: 3.0

Name: Tirth Patel

Type: User

Bio: MS in Applied Data Science, University of Southern California

Location: Los Angeles, CA

Tirth Patel's Projects

algorithmic-trading-using-machine-learning

coursera-python-for-data-science-and-ai

Solutions to the most valuable courses on the online learning platforms.

email-spambase-classification

Task : Build a classification model which will be able to distinguish between spam/not spam.

Expectation Maximization is an iterative algorithm for calculating the maximum likelihood estimates(MLE) or Maximum a posteriori probability (MAP) of parameters. It helps to estimate the missing values in the dataset given the general form of probability distribution associated with these latent variables and then using that data to update the values of the parameters in the Maximization step. In this task, we do not know whether Coin A or Coin B is flipped for each set of 30 flips.Hence, In this scenario, the coin is not observed, and could be considered a hidden or latent variable. Initialization Step: We can initialize random biases for the selection of the Coin which can give the estimate of which coin was chosen in each trial. Expectation Step: Given the estimation of the coin selected, we can determine what is the probability of getting heads or tails in the outcome using the concept of Conditional Probability(Bayes Theorem). One approach could be to see which coin bias better matches the flips and assign all flips to that coin. For example: For a flip, if we see 11110011001110001111 and our current assumed biases for Coin A and Coin B are 0.4 and 0.7 respectively, we just assume that it is coin B with 13 H and 7 T. But the problem arises when the cases are not obvious like if the Coin has equal chances of getting H or T. Hence, we estimate the probability that each coin is the true coin given the flips we see in the trial, and use that to assign H and T counts to each coin. As a final result, it gives the probability of H and T for Coin A and Coin B respectively for each flips. Maximization Step: Given the values for the latent variables we computed in the Expectation step, we estimate new values for thetas for both Coins that maximize a variant of the likelihood function. The Theta values will be the probability of H from Coin A and Coin B out of the total probability of both H and T from Coin A and Coin B. Convergence Step: After getting the values of Thetas for both Coins, we need to optimize them using a number of iterations till they converge to the global minima. Here, we check whether the Thetas are converging or not, if yes, then stop, otherwise repeat the “Expectation” step and “Maximization” step until the convergence occurs. For this, I am using a dynamic number of iterations by comparing the values of Theta with Theta values of previous 2 iterations. If it remains constant, I will break the loop and store the last Theta values as Optimal Values.

image-manipulation-using-opencv-and-c-

Various Manipulation function for Image Processing using Opencv and CPP

image-proccessing-using-python-opencv

Image Proccessing using Python & OpenCV

jobstats

Job Insights using Adzuna API

machine-learning-artificial-neural-network-overfitting

To demonstrate, that an (artificial) neural network can overfit to training data

machine-learning-curse-of-dimensionality

machinelearning-two-nearest-neighbours-twonn

multiclass-image-classification-

The main aim of the project is to scan the X-rays of human lungs and classify them into 3 given categories like healthy patients, patients with pre-existing conditions, and serious patients who need immediate attention using Convolutional Neural Network. The provided dataset of Grayscale Human Lungs X-ray is in the form of a numpy array and has dimensions of (13260, 64, 64, 1). Similarly, the corresponding labels of X-ray images are of size (13260, 2) with classes (0) if the patient is healthy, (1) if patient has pre-existing conditions or (2) if patient has Effusion/Mass in the lungs. During data exploration, I found that the class labels are highly imbalanced. Thus, for handling such imbalanced class labels, I used Data augmentation techniques such as horizontal & vertical flips, rotation, altering brightness and height & width shift to increase the number of training images to prevent overfitting problem. After preprocessing the data, the dimension of the dataset is (31574, 64, 64, 1). For Model Selection, I built 4 architectures of CNN Model similar to the architecture of LeNet-5, VGGNet, AlexNet with various Conv2D layers followed by MaxPooling2D layers and fitted them with different epochs, batch size and different optimizer learning rate. Moreover, I also built a custom architecture with comparatively less complex structure than previous models. Further to avoid Overfitting, I also tried regularizing Kernel layer and Dense layer using Absolute Weight Regularizer(L1) and to restrict the bias in classification, I used Bias Regularizer in the Dense layer. In addition to this, I also tried applying Dropout with a 20% dropout rate during training and Early Stopping method for preventing overfitting and evaluated that Early Stopping gave better results than Dropout. For evaluation of models, I split the dataset into training,testing and validation split with (60,20,20) ratio and calculated Macro F1 Score , AUC Score on test data and using the Confusion Matrix, I calculated the accuracy by dividing the sum of diagonal elements by sum of all elements. In addition to this, I plotted training vs. validation loss and accuracy graphs to visualize the performance of models. Interestingly, the CNN model similar to VGGNet with 5 Conv2D and 3 MaxPooling layers and 2 Dense layers performed better than other architecture with Macro F1 score of 0.773 , AUC score of 0.911 and accuracy of 0.777.

simpson-paradox

It states that, “The trend or result that appears in several different groups of data but it reverses or disappears when the data is combined.”

tweets-sentiment-classification

The main aim of the project is to analyze the Twitter data describing the covid situation and to build a text classification model which can distinguish the tweets into 5 categories such as Extremely Negative (0), Negative (1), Neutral (2), Positive (3) and Extremely Positive (4). The provided dataset contains tweets with dimension (37041, 2) and numerical labels with dimension (37041,2) of above categories separately. However, the provided tweets need to be cleaned as it contains irrelevant elements such as mentions (@), HTTP links, HTML tags, punctuation marks and URL. Using the regex function, I removed those elements and Stopwords from tweets. Apart from this, to normalize the terms, I implemented Porter Stemmer and used WordNet Lemmatizer to convert the term to its base form. After this, to convert the words into vectors of equal length, I tokenized the tweets and converted it to sequence and then post padded the sequence with zero and kept the length of largest sequence in tweets as maximum length. After Preprocessing the data, the Tweet dataset has dimension of (37041, 286). For Model Selection, I build 3 different models consisting of one Baseline model such as Multinomial Naive Bayes and 2 advanced Recurrent Neural Network models such as GRU Architecture with a single Embedding layer, 1 Bidirectional layer followed by Global Average Pooling 1D and 2 Dense layers & LSTM Architecture with a single Embedding layer followed by 2 Bidirectional layers and 2 Dense layers. In addition to this, I also tried applying Dropout with a 40% dropout rate during training of RNN models and Early Stopping method for preventing overfitting and evaluated that Early Stopping gave better results than Dropout. For evaluation of models, I splitted the dataset into training,testing and validation split with (80,10,10) ratio and calculated F1 macro, AUC Score on test data and using the Confusion Matrix, I calculated the accuracy by dividing the sum of diagonal elements by the sum of all elements. In addition to this, I plotted training vs. validation loss and accuracy graphs to visualize the performance of models. Interestingly, by not implementing the preprocessing techniques like removing stopwords, Porter Stemmer or WordNetLemmatizer and using just basic text cleaning function in the RNN model with LSTM architecture, the accuracy of the model was increased from 73.87% to 77.1% and had AUC score of 0.95.

wordembedding

Word embeddings provide a dense representation of words and their relative meanings.They are an improvement over sparse representations used in simpler bag of word model representations.Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.

tirth8038 Goto Github PK

Tirth Patel's Projects

algorithmic-trading-using-machine-learning

coursera-python-for-data-science-and-ai

email-spambase-classification

expectation-maximization

image-manipulation-using-opencv-and-c-

image-proccessing-using-python-opencv

jobstats

machine-learning-artificial-neural-network-overfitting

machine-learning-curse-of-dimensionality

machinelearning-two-nearest-neighbours-twonn

multiclass-image-classification-

simpson-paradox

tweets-sentiment-classification

wordembedding

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent