Abstract from my September 2020 master's dissertation:
An investigation was conducted into a 40 GB, 326 million record event dataset. This dataset contained anonymised event information representing performance, availability and security issues of 172,000 network devices from approximately 150 different customers. It was hypothesised that network device event data gathered from one customer environment could be used to predict events in another customer environment. After analysis of the dataset, a binary model was developed to predict when a process might request too much compute resources on a device. The model was developed on one set of customer data and tested on another unseen set of customer data. The Matthews correlation coefficient for the model on the unseen test data was 0.66, the F1 score was 0.72, and the False Negative rate was 27%. This was a substantial improvement over a model with no skill.
If you need something to read before you go to sleep, the full dissertation is at dissertation.pdf
-
Data
- /data/data1k.csv
- /data/data1m.csv
- /data/long_cpu_hog_prod126.csv
-
Graphs
- /code/graphs.R - some of this won't work because of MySQL dependency
-
Data Preparation - needs MySQL DB
- /code/script1.sql
- /code/script2.sql
-
Data Manipulation
- /code/data_prep_cpu_hog-exp1.r
- /code/data_prep_cpu_hog-exp2.r
- /code/data_prep-exp3.r
-
Train and Test
- /code/multivariate_cpu_hog_labels.ipynb
- /code/multivariate_cpu_hog_module.ipynb
- /code/xgboost_exp3.ipynb
Does my code really work? Try it here:
- Download and unzip the data files (You'll need an app that handles split zipped files; I used PeaZIP)
- Run graphs.R files (some parts won't work because of the RStudio MySQL DB connector dependency)
- Run Data Manipulation Code
- Run Train & Test Code (you may need to make some edits if you don't have NVIDIA CUDA installed)
Some example graphs from the paper.
Conslusions from the dissertation:
To summarise, a machine learning classifier was developed for predicting a CPU hogging issue using a network event dataset. This data was generated by the Connected TAC service provided by Cisco Systems. The classifier was trained on one set of customer data and tested on an unseen set of data from other customer’s environments. Even though that dataset was not developed specifically for event prediction, the classifier was found to have some efficacy in predicting CPU hogging events.
The current classifier would need to be refined and developed further prior to production. However, if implemented in real-time, a crowdsourced prediction classifier could potentially be used to complement the existing knowledge-based Connected TAC service.
In addition, it is hypothesised that the methodology could be extended to other devices and other external performance-related issues, such as memory. However, it is unknown if it could be applied to internal issues like configuration errors. Perhaps approaches like process mining, which attempts to discover dependencies between events, might be more successful in exposing those dependencies with configuration errors.