purseclab / atlas Goto Github PK

ATLAS: A Sequence-based Learning Approach for Attack Investigation

License: Apache License 2.0

Python 100.00%

atlas's Introduction

ATLAS

This repository contains artifacts for the paper: "ATLAS: A Sequence-based Learning Approach for Attack Investigation" accepted at the 30th USENIX Security Symposium.

Note

The artifacts in this repository include ATLAS source code, and audit logs that include the APT attacks we detailed in the paper. If you have used any of the artifacts published in this repository, please acknowledge the use by citing our paper.

@inproceedings{alsaheel2021atlas,
  title={$\{$ATLAS$\}$: A sequence-based learning approach for attack investigation},
  author={Alsaheel, Abdulellah and Nan, Yuhong and Ma, Shiqing and Yu, Le and Walkup, Gregory and Celik, Z Berkay and Zhang, Xiangyu and Xu, Dongyan},
  booktitle={30th USENIX Security Symposium (USENIX Security 21)},
  pages={3005--3022},
  year={2021}
}

Dependencies

Python 3 (tested on Python 3.7.7)
TensorFlow 2.3.0
keras 2.4.3
fuzzywuzzy 0.18.0
matplotlib 2.2.5
numpy 1.16.6
networkx 2.2

How to use

The "paper_experiments" folder includes individual folders for all the experiments presented in the paper. Each folder contains a copy of ATLAS so that the experiments results can be easily reproduced. Each experiment folder contains the preprocessed log files, thus, you could skip the steps (A) through (C) listed below. However, the raw audit logs can be found in the folder "raw_logs".

(A) preprocess.py usage:

execute the command "python3 preprocess.py" to preprocess the "logs" folders located in the training_logs and testing_logs folders, and for each "logs" folder it will generate one preprocessed logging file at the "output" folder.

(B) graph_generator.py usage:

execute the command "python3 graph_generator.py" to take each preprocessed logs files from the "output" folder and generate a corresponding graph file at the "output" folder.

execute the command "python3 graph_reader.py" to take each graph file from the "output" folder and generate a corresponding sequence (text) file at the "output" folder.

(D) atlas.py usage:

edit atlas.py and set the variable "DO_TRAINING" to "True", or set it to "False" if you would like to do testing instead.
execute the command "python3 atlas.py" to run ATLAS.

ATLAS "training" phase output:

model.h5 will be written to the "output" folder, now you can proceed to ATLAS "testing" phase.

ATLAS "testing" phase output:

ATLAS will predict the attack entities and will print each attack entity with its prediction probability score similar to this: [(["0xalsaheel.com", "c:/users/aalsahee/index.html"], 0.9724874496459961), (["0xalsaheel.com", "192.168.223.3"], 0.9721188545227051), (["0xalsaheel.com", "c:/users/aalsahee/payload.exe"], 0.9706782698631287), (["0xalsaheel.com", "c:/users/aalsahee/payload.exe_892"], 0.8397794365882874), (["0xalsaheel.com", "c:/users/aalsahee/payload.exe_1520"], 0.6693234443664551)]

Do some manual cleaning, such that you remove the redundant attack entities such as the file "payload.exe" and its redundant attack process entity "payload.exe_892" (both entities refer to the same file). Moreover, you could also add "obviously" related attack entities if needed, for example if ATLAS reported that "0xalsaheel.com" is an attack entity then obviously its resolved IP address "192.168.223.3" is also an attack entity. After doing this, the result shown above should become similar to this: ["0xalsaheel.com", "aalsahee/index.html", "192.168.223.3", "payload.exe"]

(E) evaluate.py usage:

After you finish ATLAS testing phase, a JSON file that starts with the name "eval_**" is generated in the "output" folder. You will have to edit that file by opening it in a text editor, then replace the first "[]" with your cleaned result (e.g., ["0xalsaheel.com", "aalsahee/index.html", "192.168.223.3", "payload.exe"]), then save the file.

Notes

If this result is for a host (e.g., h1) in a multi-host attack scenario (e.g., M1), then copy the JSON file to the "output" folder in the second host folder (e.g., h2), this way when we run the evaluate.py program (in h2 folder) it will consider all involved hosts. Execute the command "python3 evaluate.py" and the final result will be printed based on all the json eval_** files stored at the "output" folder.
To find the precision, recall and f1-score for each experiment, we use the number of false positives and negatives reported by atlas and we update them at the Excel sheet paper_experiments/docs/atlas.xlsx to get the result.

atlas's People

Contributors

Stargazers

Watchers

Forkers

aksh97 desperatek gyh-bupt rkzhang95 nscosine zhangxiaohuan968 liujie40 jiawozhong ting2-wang katharn-kth research-zoo goddraven mmamun1 cods-gcs littleboys578 matthew-renodin wlmnzf zlsfe wind9hawk jimmyokok yjnjerry timblank lyx24 mnsalim beiluomi gypark94 happyhiroc bigbrobro goosesteps wsgan001 yz0097 tarrett cumt-seu gzb1128 icl-ml4csec abdellinasredine polosec ice-jeffrey zhoufengxi rsriramc husky25130 yoeelingbin kmcrystal aldaihanabdullah3 likasheng piddl jinyuchata lotswn mennder-sethapol c6ai

atlas's Issues

Running evaluate.py ERROR.

Script Path: ATLAS/paper_experiment/M1/h1.
If I run evaluate.py direcly, there will be no Error. But after running atlas.py which over-write the file eval_seq_graph_testing_preprocessed_logs_M1-CVE-2015-5122_windows_h1.dot.txt.json, then I get an Error when running evaluate.py:
ERROR: Please add the cleaned predicted entities for abstracted and raw logs.

[*] Questions about sysmon-config

I'm now preparing for data collection, but there are some problems while I install sysmon. I didn't find a sysmon-config that has the same system-log format likes yours (there are so many sysmon-config can be used). Is that important for the whole experiment or not?

Have a good day. :D

Why go through "all malicous labels + one possible subject"?

Hi Alsaheel,

Sorry for bothering again. May I ask why do you need to append every subject to original malicious labels in this for loop?

This step was taking a lot of time and computation. I wonder would only checking original malicious labels work here? Or is there something I missed in your "result_list" logic?

Thanks!

graph_generator.py ValueError

Hey!When I run graph_generator.py, there is a error:ValueError: Node names and attributes should not contain ":" unless they are quoted with "". For example the string 'attribute:data1' should be written as '"attribute:data1"'.Please refer pydot/pydot#258 I can not find where contains":".
How to solve the problem?Thanks for your help!

altas.py:: result list is null

Hi, I encountered a problem.
I resampled, and then I used resampled data for model training, but when I conducted testing, the entity prediction list of the atlas.py is empty. Why is this and how can I solve this problem?
Thanks for your help!

newline on user artifact

When processing the user artifact or any of the input files the strings need to be stripped of newline characters.

It is common on tools such as Vi too automatically append this character. It would be best practice to remove them.

What does "No skill and Lines" mean in evaluate.py?

Hi Alsaheel,

Sorry for coming again. After I did an evaluation, I am a bit confused what are the following metrics:

No Skill: ROC AUC=0.500
Logistic: ROC AUC=0.850
Lines No Skill: ROC AUC=0.500
Lines Logistic: ROC AUC=0.951

I think the Logistic is the algorithm in NLP. Then what about the No skill and Lines? Could you give me an example of what they represent?

Thanks!

The results of evaluate.py about entity

After running evaluate.py, I got the result about entity as follow. It seems different from the result in your paper.
Could you tell me what's wrong with it? Thanks！

Info (entity)
Number of unique entities: 652
Number of malicious entities: 11
Result (entity)
TP: 11
TN: 641
FP: 0
FN: 0

[*] Run graph_reader err

When I run graph_reader, I got that:

python graph_reader.py

============
processing the graph: output/graph_testing_preprocessed_logs_M1-CVE-2015-5122_windows_h1.dot
"connection_fe80::fd1b:d78f:dab1:8114_ff02::1:2" -> "c:/windows/system32/svchost.exe_836"  [capacity="1.0", dip=ff02::1:2, dport=547, key=0, label=connect_26519780, sip=fe80::fd1b:d78f:dab1:8114, sport=546, timestamp=26519780, type=connect];
                                                                                           ^
Expected "}", found '['  (at char 513267), (line:3595, col:92)
Traceback (most recent call last):
  File "graph_reader.py", line 29, in <module>
    G = read_dot(path)
  File "D:\Anaconda\lib\site-packages\networkx\utils\decorators.py", line 795, in func
    return argmap._lazy_compile(__wrapper)(*args, **kwargs)
  File "<class 'networkx.utils.decorators.argmap'> compilation 5", line 5, in argmap_read_dot_1
  File "D:\Anaconda\lib\site-packages\networkx\drawing\nx_pydot.py", line 78, in read_dot
    return from_pydot(P_list[0])
TypeError: 'NoneType' object is not subscriptable

by the way, I also found that, in file "graph_testing_preprocessed_logs_M1-CVE-2015-5122_windows_h1.dot" line 2535,

"dt.adsafeprotected.com/dt?adventityid=151939&asid=e516bb60-b043-4777-aae3-ed30f4d357c0&tv={c" [timestamp=30731585, type=web_object];

this URL ends with '{c', but I found it in "testing_preprocessed_logs_M1-CVE-2015-5122_windows_h1", the full URI been requested looks like this:

30731585,,,,,,,,,,request,dt.adsafeprotected.com/dt?adventityid=151939&asid=e516bb60-b043-4777-aae3-ed30f4d357c0&tv={c:xrw4abpingtime:-2time:160type:aim:{prf:{bea:10833bez:10835mfa:10838cma:10840ina:10840inz:10851pra:10851prz:10898si:10915poa:10918poz:10931cmz:10931mfz:10931loa:10979loz:10985lta:10991ltz:10991}}sca:{dfp:{df:4sz:728.90dom:body}}env:{gca:1sf:0gcd:{appl:0cnst:0glbl:namtdt:bozjzy6ozjzy6ahabbaab6aaaaaaaa}pom:1}clog:[{piv:0vs:or:lw:728h:90t:77},,,,,,,,-LB-

Are there any problems while preprocessing data? Or did I miss some steps while I replay this experiment?
:D

missing files from the experiments

I was just reviewing the contents and trying to run your experiments. I unpacked S1 and tried to execute the preprocess function. It was missing the files in the logs directories of the training and testing.

graph_reader.py read_dot() encounter error

When the function read_dot() parse the data like below will encouter error: pyparsing.exceptions.ParseException: Expected '}' , found ':' (at char 613806) (line:3344, col:16)
connection_fe80::fd1b:d78f:dab1:8114_ff02::1:3 -> "c:/windows/system32/svchost.exe_1180" [key=0, capacity="1.0", label=connect_26519816, type=connect, timestamp=26519816, sip=fe80::fd1b:d78f:dab1:8114, sport=61605, dip=ff02::1:3,
How to solve this problem? Thank you for your help!

Use STIX 2.1 for objects

The representation of the data objects should conform to a standard such as STIX 2.1 for cyber-observables. This will provide interoperability and compatibility along with semantic meaning.

About FireFox audit logs

Hey! I'm confused about the audit logs of Firefox. Can you tell me how to get the audit log of Firefox?Thanks!

different number of nodes

Hi Alsaheel,
After running the graph_generator.py, I found that the number of nodes in the result was not exactly the same as the number of entities in the ground truth.

The results I obtained were 7467，33977, 8996 and 13019, respectively. What was the reason and how to solve ?
This situation was also the same for events.
Thanks for your help!

Confused about manual cleaning step

Hey! In your manual cleanup step, how do we define if it is a redundant ENTITY? And whether I clean up all the redundant entities or only some of them, the ones with the highest predicted probability, will have an impact on the final experimental result (f1)?