Breast cancer classification using machine learning techniques has become an essential area of research for improving early detection and diagnosis. This project focuses on developing a reliable model that can accurately differentiate between malignant and benign breast tumors.
In this breast cancer classification project, a Support Vector Machine (SVM) algorithm was employed to develop an accurate model for distinguishing between malignant and benign breast tumors.
The dataset used comprises 569 cases, with 212 cases labeled as malignant and 357 cases labeled as benign. The dataset consists of 30 features, including mean radius, mean texture, mean perimeter, mean area, mean smoothness, mean compactness, mean concavity, and mean concave points, among others.
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0.0 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0.0 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0.0 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0.0 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0.0 |
5 rows ร 31 columns
Data Visualization
0 - indicates Malignant --> The life threatning case
1 - indicates Benign
Some observation :
-
Looking at the distributions for mean radius, mean area and mean perimeter. We see that Malignant cases tend to be larger than Benign.
-
Looking at mean texture, we see that Melignant have a higher mean texture compared with Benign.
-
Benign have higher mean smoothness than Melignant
Case count
Some observation:
- More Benign cases than Malignent in the dataset.
Correlation
-
Define Matrix of Features X and target y
-
Split into testing and training data
-
Fit the SVM model
# Matrix of features X and target y
X = df.drop(["target"], axis = 1)
y = df["target"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
model1 = SVC()
model1.fit(X_train, y_train)
Confusion Matrix
Looking at these results:
- We have 0 type II errors. I.e, so the model did not give any False Negatives.
- 7 type I errors. The model gave 7 False Positive.
- When a cell was malignant, the model correcrly 41. And When the cell was benign, the model correcrlt identified 66.
-
Feature scaling (Uniity Based Normalization).
-
Grid Search. SVM parameters optimization:
- C parameter : Controll the trade of between classyfying and having a smooth decision bounary.
- Small C (loose) : Makes cost of misclassification low (soft margin).
- Large C (strict) : Makes cost of misclassification high, forcing model to explain input data stricter potentially over fitting.
- Gamma parameter : Controls how far the influence of a single training set reaches.
-
Large Gamma : Close reach (closer data points have more weight)
-
Small Gamna : Far reach (more generalized solution)
-
- C parameter : Controll the trade of between classyfying and having a smooth decision bounary.
1. Results After Feature Scaling
precision | recall | f1-score | support | |
---|---|---|---|---|
0.0 | 1.00 | 0.92 | 0.96 | 48 |
1.0 | 0.94 | 1.00 | 0.97 | 66 |
accuracy | 0.96 | 114 | ||
macro avg | 0.97 | 0.96 | 0.96 | 114 |
weighted avg | 0.97 | 0.96 | 0.96 | 114 |
2. Results After Grid Search
precision | recall | f1-score | support | |
---|---|---|---|---|
0.0 | 1.00 | 0.92 | 0.96 | 48 |
1.0 | 0.94 | 1.00 | 0.97 | 66 |
accuracy | 0.96 | 114 | ||
macro avg | 0.97 | 0.96 | 0.96 | 114 |
weighted avg | 0.97 | 0.96 | 0.96 | 114 |
In this case, the grid parameter optimization seems to have not affected the model.
We only have 4 type I errors and 0 type II errors.
- Built a model that can classify between Benign and Malignant.
- Model had an precision of 97%. Only 4 type I errors and 0 type II errors. There is still room for improvement.
What I have learned :
- How to implement Support Vector Machine Classifier
- Feature Scaling
- Grid Search for parameter optimization