Breast Cancer Classification

Breast cancer classification using machine learning techniques has become an essential area of research for improving early detection and diagnosis. This project focuses on developing a reliable model that can accurately differentiate between malignant and benign breast tumors.

In this breast cancer classification project, a Support Vector Machine (SVM) algorithm was employed to develop an accurate model for distinguishing between malignant and benign breast tumors.

The dataset used comprises 569 cases, with 212 cases labeled as malignant and 357 cases labeled as benign. The dataset consists of 30 features, including mean radius, mean texture, mean perimeter, mean area, mean smoothness, mean compactness, mean concavity, and mean concave points, among others.

Exploratory Data Analysis

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

Data Visualization

0 - indicates Malignant --> The life threatning case

1 - indicates Benign

Some observation :

Looking at the distributions for mean radius, mean area and mean perimeter. We see that Malignant cases tend to be larger than Benign.
Looking at mean texture, we see that Melignant have a higher mean texture compared with Benign.
Benign have higher mean smoothness than Melignant

Case count

Some observation:

More Benign cases than Malignent in the dataset.

Correlation

Model Training

Define Matrix of Features X and target y
Split into testing and training data
Fit the SVM model

# Matrix of features X and target y
X = df.drop(["target"], axis = 1)
y = df["target"]


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)


from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix


model1 = SVC()

model1.fit(X_train, y_train)

Model Evaluation

Confusion Matrix

Looking at these results:

We have 0 type II errors. I.e, so the model did not give any False Negatives.
7 type I errors. The model gave 7 False Positive.
When a cell was malignant, the model correcrly 41. And When the cell was benign, the model correcrlt identified 66.

Improving Model

Feature scaling (Uniity Based Normalization).
Grid Search. SVM parameters optimization:
- C parameter : Controll the trade of between classyfying and having a smooth decision bounary.
  - Small C (loose) : Makes cost of misclassification low (soft margin).
  - Large C (strict) : Makes cost of misclassification high, forcing model to explain input data stricter potentially over fitting.
- Gamma parameter : Controls how far the influence of a single training set reaches.
  - Large Gamma : Close reach (closer data points have more weight)
  - Small Gamna : Far reach (more generalized solution)

1. Results After Feature Scaling

	precision	recall	f1-score	support
0.0	1.00	0.92	0.96	48
1.0	0.94	1.00	0.97	66
accuracy			0.96	114
macro avg	0.97	0.96	0.96	114
weighted avg	0.97	0.96	0.96	114

2. Results After Grid Search

	precision	recall	f1-score	support
0.0	1.00	0.92	0.96	48
1.0	0.94	1.00	0.97	66
accuracy			0.96	114
macro avg	0.97	0.96	0.96	114
weighted avg	0.97	0.96	0.96	114

In this case, the grid parameter optimization seems to have not affected the model.

We only have 4 type I errors and 0 type II errors.

Conclusion

Built a model that can classify between Benign and Malignant.
Model had an precision of 97%. Only 4 type I errors and 0 type II errors. There is still room for improvement.

What I have learned :

How to implement Support Vector Machine Classifier
Feature Scaling
Grid Search for parameter optimization

adilsaid64 / breast-cancer-classification Goto Github PK