echowei / deeptraffic Goto Github PK

View Code? Open in Web Editor NEW

661.0 18.0 299.0 933.22 MB

Deep Learning models for network traffic classification

License: Mozilla Public License 2.0

PowerShell 15.07% Python 77.76% Shell 7.17%

deep-learning cnn-model lstm-model malware-analysis encrypted-traffic traffic-analysis traffic-classification

deeptraffic's People

Contributors

Stargazers

Watchers

Forkers

xiaoshengjun pnnngchg saylovebutleave birnbap w0726 fredqrp kun0906 xu9010 yituoniba linweilun zhaosongyi chenliangchen aj351 chapzq77 ryfan-rs eggpan95 tuliplan sgmath12 keep-steady pm-brando yue123161 hollake rileycai spykerx wangwuhen yuyang2017 sunflowcoco djjowfy alqatf gu5hanl1gh7n1n tartaruszen wxw322 ssongss cebler yelianjin jeffreynjh greycr0w pingzhenyu tools-only guoyyy prabhu-singh duanpengyu xiaoyezhang2017 wuyukun888 langxin1233 deep-learning-term amirunpri2018 leirunze119 idsdarg afroze-ali jxsrlsl1234 fwb123 humanlkx garrardmew royaizadi ramparam rubiruchi threeme trunksong gaoxiang45 xiongnudahan ahoruszjy foreverwjl feihongyin waxberry1 lxh-123 hanshanley pipi24 ustclfn akrusher yangxiongkun threatintel-c jhlholly yukaorigenji cmz0714 xue-jl andruxa-smirnov fate9091 flywingm 0xdarkman yjxyy luziya caicaijason brucemareri nieshuaibao tyzeng youthflyingzml 1057960320 vinkki daisywill yueyihua sparsh9012 al-dailami mehedihassantonmoy marin111 njuptxxx jerrymarck sword1996 sir123hong toxic2

deeptraffic's Issues

Few bugs in the code

I ran the whole process, and found (and fixed) some potential bugs.

1_Pcap2Session.ps1: foreach($f in gci 1_Pcap *.pcap)
This code only handle .pcap files (There are two types of files: .pcap and .pcapng).
Fix: So we should first convert .pcapng to .pcap using splitcap, then run this script.

2_ProcessSession.ps1 : $test = $files | get-random -count ([int]($count/10))
When $count is less than 10, it'll cause error, and $test is still the $test coming from the previous loop. This leads to some data wrongly classified.
Fix: ignore the .pcap file that has less than 10 packets.

CNN.py : y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, w_fc2) + b_fc2)
This would cause nan or 0 gradient if "tf.matmul(h_fc1_drop, w_fc2) + b_fc2" is all zero or nan, so when the training iteration is big enough, all weights could suddenly became 0.
Fix: use tf.nn.softmax_cross_entropy_with_logits instead. It handles the extreme case safely.

Folder Structure for Training Data in Encrypted Traffic Classification Task

For the encrypted traffic classification task, the Dataset.txt and Png2Mnist.py files seem to imply that each class (label) should have it's own folder with associated pcap files inside (in other words the label information is determined by the folder structure). However the Pcap2Session and ProcessSession files seem to assume all pcap files are together in a single folder (for example gci just looks within the single folder).

Maybe I am missing something about these assumptions?

关于训练准确率极低的情况

20类训练测试的输出如下

step 0, train accuracy 0
step 2000, train accuracy 0.9
step 4000, train accuracy 0.96
step 6000, train accuracy 0.94
step 8000, train accuracy 0.92
step 10000, train accuracy 0.98

2021-01-05 01:03:39
DATA_DIR: /PUBLIC/sakura/self_secuity/echowei/reproduction2/USTC-TK2016-ubuntu/5_Mnist
0, aimchat, 0.07194244604316546, 0.01020408163265306
1, AIM_Chat, 0.0, 0.0
2, browsing, 0.5, 0.003875968992248062
3, browsing2-1, 0.020446096654275093, 0.01089108910891089
4, browsing2-2, 0.12269129287598944, 0.09470468431771895
5, browsing2, 0.03184713375796178, 0.07286995515695067
6, browsing_ara, 0.0, 0.0
7, browsing_ara2, 0.0, 0.0
8, browsing_ger, 0.07142857142857142, 0.001026694045174538
9, Email_IMAP_filetransfer, 0.0, 0.0
10, AUDIO_spotifygateway, 0.0, 0.0
11, AUDIO_tor_spotify, 0.0, 0.0
12, AUDIO_tor_spotify2, 0.0, 0.0
13, BROWSING_gate_SSL_Browsing, 0.0, 0.0
14, BROWSING_ssl_browsing_gateway, 0.0, 0.0
15, BROWSING_tor_browsing_ara, 0.0, 0.0
16, BROWSING_tor_browsing_ger, 0.0, 0.0
17, BROWSING_tor_browsing_mam, -1, 0.0
18, BROWSING_tor_browsing_mam2, 0.0, 0.0
19, CHAT_aimchatgateway, 0.0, 0.0
Total accuracy: 0.0184

环境是Ubuntu18.04，所以用的是Ubuntu的分支处理
我的数据处理流程如下：

1.	pwsh 1_Pcap2Session.ps1 -f
2.	pwsh 2_ProcessSession.ps1 -a -s
3.	python 3_Session2Png.py 
4.	python 4_Png2Mnist.py

请问是我哪一步做错了吗？
感谢您的开源，期待您的回复！

Pre-processing for HAST-IDS

Sir
Can you provide the preprocessing tool and code to split the raw pcap into flows just as you did for the malware and encrypted traffic classification.

关于您提供的数据预处理结果

作者您好，我读了您的文章后下载您的代码，和您提供的数据预处理结果，因为我想实验一下你提出的模型的分类准确度是多少，然后我在运行” encrypt_traffic_cnn_1d.py“后会提示类别不对，您提供的数据预处理结果中的6class.zip以及12class.zip中真的有6个类别以及12个类别吗？期待您的回复！祝您生活工作顺利。

没有预处理文件请问能添加上预处理的代码部分吗

Pickles used to train HAST-IDS

Can you please share the sessions and labels pickles? or the code you used to generate them from the pcaps.

训练集和测试集使用预处理工具处理后的测试结果很低

使用您给的PCAP数据和预处理工具得到的训练集和测试集与您直接提供的得到的结果完全不同，不知道是哪里出了问题，怀疑预处理或数据有问题，要不就是哪里遗漏了，或是预处理的方式不一样，我得到的训练集和测试集与您的不同，一直很困扰我，请原谅~

input_data.py 文件

作者您好，感谢开源，代码包中好像缺少input_data.py这个module，请您确认一下。期待你的回复。

malware这篇的一个疑惑

请问malware这篇，预处理之后是每条流里每个packet的前784字节，都作为一条训练数据，还是一个流整体作为一条训练数据？如果是一个流或会话整体作为一条训练数据，打一个标签，那么是选取前三个packet还是几个？
在train_cnn.py中，好像是按照mnist的格式，即一个784字节的数据对应一个标签。所以有上述疑惑。

HAST-IDS中PreprocessedISCX2012_5class_pkl文件

如题，您好，请问是否能提供HAST-IDS中PreprocessedISCX2012_5class_pkl文件？
给您添麻烦了，不胜感激！

关于evaluate the model模块

作者您好，请教一下：我在win环境下跑的代码，#evaluate the model这部分10分类的能跑通，2和20分类的只训练了模型，没有跑出来acc等数据，比较疑惑原因，望解惑

Only call `softmax_cross_entropy_with_logits` with named arguments (labels=..., logits=..., ...)

softmax_cross_entropy_with_logits参数不对请问下各位是用什么版本的TensorFlow运行的

运行遇到几个问题

不知道是不是版本的问题
第一个错误
File "/DeepTraffic/2.encrypted_traffic_classification/4.TrainAndTest/1d_cnn_25+3/encrypt_traffic_cnn_1d.py", line 133, in
ValueError: Cannot feed value of shape (50, 10) for Tensor u'Placeholder_1:0', which has shape '(?, 2)'

第二个错误
File "DeepTraffic/2.encrypted_traffic_classification/4.TrainAndTest/1d_cnn_25+3/encrypt_traffic_cnn_1d.py", line 107, in
ValueError: Only call softmax_cross_entropy_with_logits with named arguments (labels=..., logits=..., ...)

第三个问题
作者为什么要把 y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, w_fc2) + b_fc2)
注释掉呢，下面还有很多y_conv的引用。
感觉回答

期待你的回复, 祝好.