Giter Site home page Giter Site logo

rimtouny / dynamic-dns-traffic-analysis-for-data-exfiltration-detection-with-kafka Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 4.84 MB

Crafting static and dynamic models for data exfiltration detection via DNS traffic analysis. Static model trained on batch data, while dynamic model simulates a continuous stream. Rigorous analysis, feature engineering, and model training conducted. Implementation part of AI for Cyber Security Master's assignment at the University of Ottawa, 2023.

License: MIT License

Jupyter Notebook 100.00%
anova chi-square dns dynamic-model f1-score hyperparameter-tuning kafka mutual-information pipeline rfe

dynamic-dns-traffic-analysis-for-data-exfiltration-detection-with-kafka's Introduction

Enhanced Data Exfiltration Detection via Dynamic DNS Traffic Analysis usng Kafka

Crafting static and dynamic models for data exfiltration detection via DNS traffic analysis using . Static model trained on batch data Static_dataset.csv while dynamic model simulates a continuous streamKafka_dataset.csv [that should treat as a data stream (local Kafka Server) which will be used to evaluate the dynamic model]. Rigorous analysis, feature engineering, and model training conducted. Implementation part of AI for Cyber Security Master's assignment at the University of Ottawa, 2023.

  • Required libraries: scikit-learn, pandas, matplotlib.
  • Execute cells in a Jupyter Notebook environment.
  • The uploaded code has been executed and tested successfully within the Google Colab environment.

Bianry-class classification problem

Task is to enhanced data exfiltration detection through DNS traffic analysis : 1 / 0.

Independent Variables:

  • 'timestamp': The time at which the data was recorded.
  • 'FQDN_count': The count of fully qualified domain names.
  • 'subdomain_length': The length of the subdomain.
  • 'upper': The count of uppercase characters.
  • 'lower': The count of lowercase characters.
  • 'numeric': The count of numeric characters.
  • 'entropy': Entropy value.
  • 'special': The count of special characters.
  • 'labels': The count of labels.
  • 'labels_max': Maximum count of labels.
  • 'labels_average': Average count of labels.
  • 'longest_word': The longest word in the subdomain.
  • 'sld': Second-level domain.
  • 'len': Length of the subdomain.
  • 'subdomain': The subdomain.

Target variable:

  • 'Target Attack' : Target Attack label, where 1 indicates an attack and 0 indicates no attack

Key Tasks Undertaken

  • Static Model

    1. Data Analysis:

      • Loaded and explored the "Static_dataset.csv."

      • Utilized various statistical tools and visualizations to understand feature distributions, identify imbalances, and assess the characteristics of numerical and categorical variables. merge_from_ofoct

      • Employed histograms, QQ plots, and boxplots for a comprehensive analysis of numerical features. merge_from_ofoct

      • Examined the count of attack and non-attack cases for categorical features through count plots. download

    2. Feature Engineering and Data Cleaning:

      • Analyzed the dataset for string variables and performed necessary transformations.
      • Addressed missing values within the dataset , duplicate rows , drop unnecessary features.
      • Applied embedding techniques to encode categorical variables, maintaining interpretability.
    3. Feature Filtering/Selection:

      • Employed different statistical techniques, including Mutual Information, ANOVA F-values, Chi-squared scores, and RandomForest-based Recursive Feature Elimination (RFE). merge_from_ofoct
      • Selected relevant features based on the results of the feature selection techniques.
    4. Model Selection: - Splitting data to train ,test. - Apply Normalization using StandardScaler. - Chose three machine learning models for evaluation: Random Forest, Logistic Regression, and XGBoost. - Configured each model with default parameters. merge_from_ofoct (2)

    5. Evaluation performance: - Using F1-score, get the Best Feature Selection/ Model

      Number of Best Feature:

    • Best F1-score is using Mutual Information on Random Forest Model.
      selected_features=['FQDN_count','entropy','labels','labels_average','longest_word','lower','sld','special']
    1. Hyperparameter Tuning & Model evaluation: using selected_features from Mutual Information.

      Best hyperparameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
      Best hyperparameters for Logistic Regression: {'C': 0.1, 'penalty': 'l2', 'solver': 'newton-cg'}
      Best hyperparameters for XGB Extreme X Gradient Boosting: {'colsample_bytree': 0.8, 'learning_rate': 0.3, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 200, 'subsample': 1.0}```
      
    2. Champion Static Model :

    3. Save the Champion Model for the Dynamic phase.

  • **Dynamic Model

    1. Kafka Consumer Setup:

      • Created a Kafka consumer instance for 'ml-raw-dns' topic, connecting to a Kafka broker on 'localhost:9092'.
      • Configured the consumer to start from the earliest offset and use manual offset committing.
    2. Data Retrieval and Adjustment:

      • Implemented a function to retrieve 1000 records from the Kafka consumer.
      • Utilized the retrieved data to create a DataFrame with predefined columns.
    3. Data Cleaning: as done in Static Model.

      • Defined functions for adjusting and cleaning data, including converting categorical values to numerical indices.
      • Dropped unnecessary columns and converted the DataFrame to a consistent data type.
    4. Model Loading and Retraining:

      • Loaded a pre-trained Random Forest model from a pickle file.
      • Initialized both static and dynamic models with the loaded model.
    5. Dynamic Model Evaluation and Retraining:

      • Simulated continuous data processing over 199 iterations.
      • Evaluated the dynamic model's F1 score without retraining for each iteration.
      • Retrained the dynamic model if its F1 score fell below 0.80 and updated it with new training data.
    6. Static Model Evaluation:

      • Evaluated the F1 score of the static model for each iteration without retraining.
    7. Performance Comparison Visualization:

      • Plotted F1 scores of the dynamic model across iterations to observe its performance over time.

      • Plotted F1 scores of the static model across iterations for comparison.

      • Plotted F1 scores of both models on the same plot for a comprehensive comparison.

dynamic-dns-traffic-analysis-for-data-exfiltration-detection-with-kafka's People

Contributors

rimtouny avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.