Introduction
The data given is related with direct marketing campaigns of a banking institution. The goal of this mini project is to train a Neural Network Model in order to predict whether a client would respond positive or negative to the campaign and subscribe to a term bank deposit. Note: the project was done in Google Colab, therefore before running the code you need to put .csv file into Google Colab explorer.
Data processing
The data was checked on/got rid off:
-
Duplicates
-
“unknown” values
-
Null values
Deposit data was found to be unbalance, much more people do not subscribe to deposit.
Fig. 1. Deposit distribution overall.
Furthermore, the features were correlated with deposit information (fig. 2,3,4, 5). For full analysis of categories please look into the google colab notebook.
Fig. 2. Deposit distribution in every feature.
Fig. 3. Deposit distribution in every feature.
Fig. 4. Heat map
Fig. 5. Box plot visualization for age.
Conclusion: overall, we can see that from all clients old people are more responsive to a deposit marketing. Also we note that emp.var.rate, cons.price.idx, euribor3m and nr.employed are features with very high correlation to deposit status!
Now we need to process the data before using it to train model. The categories duration, campaign, month, day_of_week, contact were dropped because they were no relevant to prediction deposit status of a client.
Because we have categorical data it was decided to use One Hot Encoding to make the data more useful and expressive, and it can be rescaled easily. By using numeric values, we more easily determine a probability for the values. Example of One Hot Encoded data we can see in fig. 6.
Fig. 6. One Hot Encoded data. The data is ready for creating the model.
Modelling
For network modelling I used Keras, an open-source software library that provides a Python interface for artificial neural networks. Performance of the model was improved by changing selecting features and deleting irrelevant ones (for examples, ‘contact’). The performance of model was evaluated. The accuracy for test set reached 0.8569, and the difference between train set accuracy and test set accuracy was not big, meaning the model wasn’t overfitted.
Fig. 7. Accuracy scores of the model.
Model also was checked for AUC and ROC curve. Higher the AUC value, higher the performance of the model. AUC for the train set is 0.7219. Note, that it is not the accuracy of the model.
Fig. 7. AUC scores of the model.
As for the ROC curve, it was visualized:
Fig. 8. ROC plot.
Furthermore, the network architecture was also visualized with the help of keras.utils:
Fig. 9. Network architecture.
Conclusion
During the work on project, there were several bottlenecks.
• The data type
Most variable of dataset were categorical, but machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric. For that reason, I used One Hot Encoding to encode data into numeric variables. Another approach is to use other methods of encoding, such as Embedding Categorical Data and others.
• Network visualization
Unfortunately, beautiful visualizer ANN_viz is not supported by Keras anymore, therefore I have to find another way to visualize the network. So I used keras.utils embedded visualization.
In the conclusion, a Neural Network Model was built and used to predict deposit status for clients.