Naive Bayes is branched off of the famous Bayes Theorem. It assumes independence among each attributes. Also, the calculations are super simplified as compared to Bayes, hence it is highly computationally efficient.
Over any IOT network, it is highly expected that packet traces repeat. Therefore, rule generation over here is a highly probabilistic task since any recurrent task has certain probabilities associated with it. Also, it is not good to draw a correlation between each attributes since each rule attributes are independent of each other. For this reason, Bayes is outed for Naive Bayes.
Naive Bayes specializes in probabilistic prediction and is widely used for such tasks. It is known to predict recurrent patterns with a high probabilstic accuracy. Although Naive Bayes is used for classification purposes, but the underlying principle can be modified to suit any patterns which favours probability. Rule generation is one such task.
The Rule Generation is briefly dissected into 5 parts:
- JSON Walk
- Segregation
- Generate Bayesian Frequency Table
- Compute Naive Bayes for Each Rule Attribute
- Generate the Rule
Here, we walk the JSON file and pick up the desired attributes from the entire Static Header Context. Then call a segregate_attribute
API for each field and its respective attribute.
# Walk the json file and collect the fieldname and the corresponding attributes
for rule in data["rules"]:
for flow in rule["flows"]:
for entry in flow["ruleEntries"]:
if not entry["fieldName"].lower().startswith("coap"):
field = entry["fieldName"]
META_DATA[f'{field}_NOS'] += 1
for attribute, value in entry.items():
if attribute != "fieldName":
segregate_attributes(field, attribute, value)
Each attribute is segregated on the basis of the fields. Example: IP4_ADDRESS
attributes are: targetValue, cdactionFunction .. fieldPosition
# then calls for a corresponding baeysian frequency table
def segregate_attributes(field, attribute, value):
row = assign_index_based_on_field(field)
if attribute == "targetValue":
value = value[0]
targetValue[row].append(value)
generate_bayesian_frequency_table(field, row, value, b_f_targetValue)
elif attribute == "cdactionFunction":
# .
# .
# .
Then for each attribute, it triggers the generate_bayesian_frequencey_table
.
The generate_bayesian_frequencey_table
computes the frequency table by keeping a track of all the occurence of the attribute values. These tables will be highly useful while computing Naive Bayes Probability.
# generating bayesian frequency table
def generate_bayesian_frequency_table(field, row, value, b_f_table):
if len(b_f_table[row]) == 0 or value not in b_f_table[row].keys():
b_f_table[row][value] = 1
else:
increment_freq = b_f_table[row][value] + 1
b_f_table[row].update({value: increment_freq})
This function computes the NB probability from the data obtained from generate_bayesian_frequency_table
and certain METADATA
collected about the rules while Walking JSON
# computes naive bayes probability per attribute from the bayesian freq table
# P(IP4_vERSION | targetVariable=4) .... P(IP4_vERSION | fieldPosition=1)
def compute_naive_bayes_probability(b_f_table):
for index, field_wise_meta in enumerate(b_f_table):
for key, value in field_wise_meta.items():
prob_of_attribute = field_wise_meta[key] / META_DATA[f'{fieldName[index]}_NOS']
prob_of_choosing_attribute_given_field_name = prob_of_attribute * (1/len(fieldName))
field_wise_meta[key] = prob_of_choosing_attribute_given_field_name
The rule is finally generated by sorting out the best probable attributes per fieldName. This is done with the help of a helper function sort_b_f_table_by_probability
# generates the best rule from naive bayes probabilites.
# after sorting the b_f_table, pick up the first element
# from each attribute. This is the element with highest probability.
def generate_rule_by_naive_bayesian_combinations(B_F_TABLES):
for b_f_table in B_F_TABLES:
sort_b_f_table_by_probability(b_f_table)
COMBINATION_MATRIX = [ [] for i in range(len(B_F_TABLES)) ]
# TODO: CREATE A COMBINATION OF BAYESIAN PROBABILITES
for i in range(len(B_F_TABLES)):
for element in B_F_TABLES[i]:
for key, value in element.items():
COMBINATION_MATRIX[i].append(key)
break
# print the rules
print(COMBINATION_MATRIX)
The helper function used:
def sort_b_f_table_by_probability(b_f_table):
for i in range(len(b_f_table)):
dict = b_f_table[i]
sorted_values = sorted(dict.values(), reverse=True)
new_dict = {}
for value in sorted_values:
for key, values in dict.items():
if dict[key] == value:
new_dict[key] = dict[key]
break
b_f_table[i] = new_dict
The scope here is that we can generate multiple rules based on decreasing probability. This can lead to much diverse rule generation. As it is evident that, all the COMBINATION_MATRIX
is doing right now, is generating the best probable rule from the rules it learnt.
All this program does is, spit out some of the best suited values for the rule. Now, how can we make some sense out of these values? Well, we would have to parse this rule.
Rule parsing is a fairly convoluted process of identifying the correct and corresponding attribute and inserting the raw values. The following function achieves the same. The function parse_predicted_rule
makes use of a helper function parse_matching_operator
to fill in the values of MatchingOperator
. This is because MatchingOperator
has its value as a dictionary, while our values are discrete. Go through the below code to understand about it in a bit more detail:
# Parse the obtained Naive Bayes values into rules
def parse_predicted_rule(best_rule):
RULE = []
for i in range(len(fieldName)):
dict = {"fieldName": fieldName[i]}
RULE.append(dict)
for row in range(len(best_rule)):
for i in range(len(fieldName)):
dict = {"fieldName": fieldName[i]}
if row == 0:
RULE[i]['targetValue'] = [best_rule[row][i]]
elif row == 1:
RULE[i]['cdactionFunction'] = best_rule[row][i]
elif row == 2:
RULE[i]['matchingOperator'] = parse_matching_operator(best_rule, row, i)
elif row == 3:
RULE[i]['fieldLength'] = best_rule[row][i]
elif row == 4:
RULE[i]['direction'] = best_rule[row][i]
elif row == 5:
RULE[i]['fieldPosition'] = best_rule[row][i]
else:
raise IndexError('Definitely, some problem with the parse logic')
return RULE