Application of different methods to predict mines on the following dataset:
https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)
Requirements:
The program is written and tested with Python 3.6.8 and needs the following libraries to work:
Dataset Information:
The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock.
Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the chirp.
Summary of the dataset:
We can see that the dataset is composed of 208 lines. Furthermore in the Class
column we can see that the mean
is about 46.6% so we can assume that the dataset is balanced but with more rocks (0) than mines (1).
If we look in details we get the following results:
- 97 mines ๐ฃ
- 111 rocks ๐ฟ
We open the dataset from the csv file with Pandas, we drop the Class
column and we split the dataset as following: 70% for the training part and 30% for the testing part.
We split randomly because if we look at the csv file we can see that we have all the mines at first and after all the rocks.
# We get the data from the csv file
def prepare_dataset(path_csv):
messages = pd.read_csv(path_csv, sep=';', encoding='latin-1')
return messages
# Split dataset
def split_dataset(dataset):
dataset_features = dataset.drop(['Class'], axis=1)
print(dataset_features)
return train_test_split(dataset_features, dataset['Class'], test_size=0.3, random_state=20)
Precision: true positives / (true positives + false positvises), it corresponds to "What proportion of positive identifications was actually correct?"
Recall: true positives / (true positives + false negatives), it corresonds to "What proportion of actual positives was identified correctly?"
Accuracy: (true positives + true negatives) / Total, it's the fraction of predictions our model got right.
Receiver Operating Characteristic (ROC): is a graphical plot that illustrates the diagnostic ability of a binary classifier system
# Naive Bayes classifier
def naiveBayesMethod(features_train, features_test, mines_train):
classifier = MultinomialNB()
classifier.fit(features_train, mines_train)
return classifier.predict(features_test)
Confusion matrix:
Predicted negative | Predicted positive | |
---|---|---|
Acutal negative | 29 | 9 |
Actual positive | 10 | 15 |
Precision: 15/(15+9) = 62.5%
Recall: 15/(15+10) = 60%
Accuracy: (29+15)/(29+9+10+15) = 69.8%
ROC: 68.1%
The Logistic Regression is a good method when the result is a binary classification such as in our case.
# Logistic Regression Method
def logistic_regression(features_train, features_test, mines_train):
mines_model = LogisticRegression(solver='lbfgs', max_iter=200)
mines_model.fit(features_train, mines_train)
return mines_model.predict(features_test)
Confusion matrix:
Predicted negative | Predicted positive | |
---|---|---|
Acutal negative | 32 | 6 |
Actual positive | 7 | 18 |
Precision: 18/(18+6) = 75%
Recall: 18/(18+7) = 72%
Accuracy: (32+18)/(32+6+7+18) = 78.1%
ROC: 78.1%
To make this neural netowrk I used Keras which is a neural networks API written in Python.
def neuralNetwork(features_train, features_test, mines_train):
classifier = Sequential()
classifier.add(Dense(300, activation='relu', kernel_initializer='random_normal', input_dim=60))
classifier.add(Dense(300, activation='relu', kernel_initializer='random_normal')) # Hidden layer
classifier.add(Dense(300, activation='relu', kernel_initializer='random_normal')) # Hidden layer
classifier.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal')) # Output Layer
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
classifier.fit(features_train, mines_train, batch_size=32, epochs=300)
y_pred = classifier.predict(features_test)
y_pred = (y_pred > 0.5)
return y_pred
Confusion matrix:
Predicted negative | Predicted positive | |
---|---|---|
Acutal negative | 34 | 4 |
Actual positive | 7 | 18 |
Precision: 18/(18+4) = 81.8%
Recall: 18/(18+7) = 72%
Accuracy: (34+18)/(34+4+7+18) = 80.7%
ROC: 84.7%
If we look at the results obviously the Neural Network is the best solution, the Logistic Regression comes after and finally the Naive Bayes Classifier.
I could obtain better results if I took the time to look in details for the parameters of each methods but for the Neural Network and the Logistic Regression there are not so bad.
I could also look for other methods such as Random Forest which could be a good method for this case.