This repository contains a step by step guide to using the BERT model for text classification. I have choosen to use a dataset of tweets for positive and negative sentiment classification, as this has been one of the areas that newcomers to machine learning are more interested in.
This guide has been created with the intention of making machine learning a bit more accessible to the beginner user - though, the complexity of machine learning inevitably makes it better suited as a resource for users that are already familiar with statistics and python. Still, I have kept the guide simple enough that most beginner programmers will still be able to make a lot of headway with understanding and using the BERT model. My recommendation is that the person using this guide should read through it very slowly. Attempt to understand the code in each cell along with the explanations given before moving forward.
It is very difficult to avoid touching on advanced topics in statistics and taking other concepts for granted. In particular I have avoided going into detail on explanations of logistic regression which would be extremely complex even for the advanced user.
The google collaboratory notebook - 'A_Step_by_Step_Guide_To_BERT.ipynb' - gives a complete step by step walkthrough of:
- loading data
- doing the preprocessing
- intializing the bert model
- generating the embeddings
- and, performing text classification using logistic regression
The dataset used for the walktrough is contained in the file 'data.csv'.
If you wish to explore machine learning in a more in depth manner, you should consult the following resources:
- Getting Started with Goole BERT - by Sudharsan Ravichandiran
- Transformers for natural language processing - by Denis Rothman
- What is ChatGPT doing...and why does it work - by Stephen Wolfram
If you are a more advanced user and you want to understand logistic regression itself, you should consult:
- Logistic Regression: A Self Learning Text - by David Kleinbaum & Mitchel Klein