Giter Site home page Giter Site logo

breast-cancer-detection's Introduction

Breast Cancer Detection w/ K-Nearest Neighbours (2019)

This is my approach at implementing K-Nearest Neighbours from scratch and applying it to the Breast Cancer Wisconsin (Diagnostic) dataset, provided by the UCI Machine Learning Repository

How do I use this Repo?

Simple. Just clone the repo into a folder, open a terminal in that directory and run python cancer.py

By default the following code will run...

### MAIN ###
dataset = loadData()
trainingData, testingData = crossValidate(dataset, trainSize=0.65)

k = 7
knn = knn.KNearestNeighbour()
score = 100*knn.getScore(trainingData, testingData, k)

print("average accuracy: {0:.5f}%".format(score))

Alternatively, you can open a terminal in the directory and just import the relevant modules and run the methods. For example:

python
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

>>> import knn
>>> import cancer

>>> dataset = cancer.loadData()
>>> trainingData, testingData = cancer.crossValidate(dataset, trainSize=0.65)

>>> k = 7
>>> knn = knn.KNearestNeighbour()
>>> score = 100*knn.getScore(trainingData, testingData, k)

>>> print("average accuracy: {0:.5f}%".format(score))

"accuracy for M diagnosis: 91.33%"
"accuracy for B diagnosis: 96.80%"
"average accuracy: 94.06500%"

This will:

  • load the dataset from data.csv
  • cross validate the data, splitting into a training and testing set
  • initialise a new KNN model
  • test KNN on the testing set given the training set
  • output the average accuracy (for both benign/malignant)
Feel free to remove this code and mess around with the methods!

cancer.py

Loading the Data from .csv

First, I start off by loading the data from the .csv using the loadData() function. What is returned will look something like this:

{"B": [b1, b2, ..., bn], "M": [m1, m2, ..., mn]} where bn or mn for any n, is a 30 dimensional list (vector).

Note: it is 30-D instead of 32-D as I have removed the "id" and "diagnosis" columns within loadData()

for example, loading the data into the format shown above:

dataset = loadData()

Cross Validating the Data

As I have a finite dataset to work with, I cannot use all of the dataset to "train" KNN, instead I will cross-validate. This is essentially shuffling the data and splitting into two groups, one for training purposes and another for testing.

The reason this is done is because if I used all the dataset to train and test my model, it would get a higher accuracy compared to unseen data points (as there will always be one training point exactly on the testing point)... You wouldn't give your student a final exam which was the same as a past paper.

crossValidate(data, trainSize=0.8) is used to do this cross-validation, which returns trainingData and testingData (in that order), where the arguments...

data is the training data, a matrix of row vectors which are the datapoints (n by m) and

trainSize is the percentage of data which is set aside for training (default being 0.8)

for example, creating two datasets for training and testing:

trainingData, testingData = crossValidate(dataset, trainSize=0.6)

let's assume that dataset had 100 vectors, then trainingData and testingData would have 60 and 40 vectors, respectively.

Note: trainingData and testingData are disjoint (share no common elements)

Finding the best hyper-parameters (K and trainSize)

This is a completely inefficient way to find the best hyper parameters for the model. bruteForceBestHyperParams(diagnosis) where diagnosis is “M” or “B”.

This method will run through values of trainSize from 0.4 to 0.9 for cross validation,

Each time it will also run through values of k from 1 to 15.

Randomising the dataset and running KNN to create a list of accuracies and hyper parameters.

The basic pseudocode is:

BestParameters = []
For trainSizes from 0.4 to 0.9
	For k from 1 to 15
		Randomise dataset
		Get accuracy 
		BestParameters += (accuracy, trainSizes, k)
		
Sort BestParameters by accuracy
Return BestParameters

From doing this and collating the results, I have concluded that the best results from my KNN can be achieved by using a trainSize ≈ 0.63 and k ≈ 7

knn.py

Predicting a new sample given data

knn.predict(data, sample, k=n) is used to predict what class (benign/malignant) sample will be, based on data - the other data points around it. where the arguments...

data is the training data, a matrix of row vectors which are the datapoints (n by m),

sample is a single row vector (1 by m) and

k is the number of nearest neighbours the algorithm will consider before classifying sample.

for example, predicting what the first value of testingData["M"] (for malignant class) will be:

knn.predict(trainingData, testingData["M"][0], k=7)
>> "M"

Finding the Accuracy

getScore(trainingData, testingData, k) is used to calculate the accuracy of the model. This is done by inputting trainingData and testingData and a value for k. This method is similar to predict, the difference being that a testingData is also a matrix of row vectors instead of a single row vector and a percentage (between 0-1) is returned.

Ideally trainingData and testingData should be cross validated from the original dataset.

Note: by "accuracy" I mean (number of correct diagnoses)/(total number of cases)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.