Comments (4)
作者您好,我找到了您发布的上图,想确认下是15G的公开数据集吗?另外,还想问一下4.3G的encrypted_traffic_burst.txt文件是30G的数据生成的吗?谢谢。
from et-bert.
1,预训练的数据集中选取是没有什么加入约束的,因此可以使用尽可能丰富的协议流量进行替代。
2,encrypted_traffic_burst.txt是基于预训练数据生成的
from et-bert.
抱歉再次打扰,我想问一下,您在主页的readme提到用vocab_process/main.py生成corpora,在data_process的readme中又提到pre-training stage用data_generation生成burst,哪个才是能够生成encryted_traffic_burst.txt的方法呢。因为我发现这两个都能生成txt文件。您能否再详细说明一下呢,谢谢。
from et-bert.
抱歉再次打扰,我想问一下,您在主页的readme提到用vocab_process/main.py生成corpora,在data_process的readme中又提到pre-training stage用data_generation生成burst,哪个才是能够生成encryted_traffic_burst.txt的方法呢。因为我发现这两个都能生成txt文件。您能否再详细说明一下呢,谢谢。
你好,data_process中是生成用于预训练corpora所需的流量burst数据,然后由vocab_process生成相应的corpora,可以把这两部分理解为流量数据预处理和预训练数据生成的过程。
from et-bert.
Related Issues (20)
- 有关数据清洗和数据预处理的问题 HOT 2
- 关于预训练和微调部分数据集来源一致的问题 HOT 1
- 微调数据预处理如何生成tsv文件的问题 HOT 3
- 关于微调数据集把5000个samples都划分到同一类里的问题 HOT 2
- 直接下载的处理好的数据集进行微调时,发现缺失了一个关键文件 HOT 3
- 作者大大,请问dataset_cleanning.py里导入的ml_classifier模块怎么找不到 HOT 4
- 关于预训练的问题 HOT 2
- VRAM needed for finetuning HOT 2
- how long to train? HOT 1
- Data labeling? HOT 2
- 关于微调后模型泛化能力的问题
- CrossPlatForm数据集的问题 HOT 1
- 关于直接下载您处理好的cstnet-tls1.3数据集的疑问 HOT 4
- 关于vocab_process的问题
- Have you removed bidirectional IP and port information and protocol information to reduce the impact of packet headers? (e.g. remove 5-tuples) HOT 1
- 为什么采用bi-gram的形式,而不用tri-gram的形式 HOT 1
- How to generate .tsv files HOT 2
- 关于用于预训练的语料问题? HOT 2
- dataprogress HOT 1
- 有关从pcap生成tsv文件遇到的问题 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from et-bert.