Giter Site home page Giter Site logo

atlas's Introduction

ATLAS

This repository contains artifacts for the paper: "ATLAS: A Sequence-based Learning Approach for Attack Investigation" accepted at the 30th USENIX Security Symposium.

Note

The artifacts in this repository include ATLAS source code, and audit logs that include the APT attacks we detailed in the paper. If you have used any of the artifacts published in this repository, please acknowledge the use by citing our paper.

@inproceedings{alsaheel2021atlas,
  title={$\{$ATLAS$\}$: A sequence-based learning approach for attack investigation},
  author={Alsaheel, Abdulellah and Nan, Yuhong and Ma, Shiqing and Yu, Le and Walkup, Gregory and Celik, Z Berkay and Zhang, Xiangyu and Xu, Dongyan},
  booktitle={30th USENIX Security Symposium (USENIX Security 21)},
  pages={3005--3022},
  year={2021}
}

Dependencies

  • Python 3 (tested on Python 3.7.7)
  • TensorFlow 2.3.0
  • keras 2.4.3
  • fuzzywuzzy 0.18.0
  • matplotlib 2.2.5
  • numpy 1.16.6
  • networkx 2.2

How to use

The "paper_experiments" folder includes individual folders for all the experiments presented in the paper. Each folder contains a copy of ATLAS so that the experiments results can be easily reproduced. Each experiment folder contains the preprocessed log files, thus, you could skip the steps (A) through (C) listed below. However, the raw audit logs can be found in the folder "raw_logs".

(A) preprocess.py usage:

  • execute the command "python3 preprocess.py" to preprocess the "logs" folders located in the training_logs and testing_logs folders, and for each "logs" folder it will generate one preprocessed logging file at the "output" folder.

(B) graph_generator.py usage:

  • execute the command "python3 graph_generator.py" to take each preprocessed logs files from the "output" folder and generate a corresponding graph file at the "output" folder.

(C) graph_reader.py usage:

  • execute the command "python3 graph_reader.py" to take each graph file from the "output" folder and generate a corresponding sequence (text) file at the "output" folder.

(D) atlas.py usage:

  • edit atlas.py and set the variable "DO_TRAINING" to "True", or set it to "False" if you would like to do testing instead.
  • execute the command "python3 atlas.py" to run ATLAS.

ATLAS "training" phase output:

  • model.h5 will be written to the "output" folder, now you can proceed to ATLAS "testing" phase.

ATLAS "testing" phase output:

  • ATLAS will predict the attack entities and will print each attack entity with its prediction probability score similar to this: [(["0xalsaheel.com", "c:/users/aalsahee/index.html"], 0.9724874496459961), (["0xalsaheel.com", "192.168.223.3"], 0.9721188545227051), (["0xalsaheel.com", "c:/users/aalsahee/payload.exe"], 0.9706782698631287), (["0xalsaheel.com", "c:/users/aalsahee/payload.exe_892"], 0.8397794365882874), (["0xalsaheel.com", "c:/users/aalsahee/payload.exe_1520"], 0.6693234443664551)]

Do some manual cleaning, such that you remove the redundant attack entities such as the file "payload.exe" and its redundant attack process entity "payload.exe_892" (both entities refer to the same file). Moreover, you could also add "obviously" related attack entities if needed, for example if ATLAS reported that "0xalsaheel.com" is an attack entity then obviously its resolved IP address "192.168.223.3" is also an attack entity. After doing this, the result shown above should become similar to this: ["0xalsaheel.com", "aalsahee/index.html", "192.168.223.3", "payload.exe"]

(E) evaluate.py usage:

  • After you finish ATLAS testing phase, a JSON file that starts with the name "eval_**" is generated in the "output" folder. You will have to edit that file by opening it in a text editor, then replace the first "[]" with your cleaned result (e.g., ["0xalsaheel.com", "aalsahee/index.html", "192.168.223.3", "payload.exe"]), then save the file.

Notes

  • If this result is for a host (e.g., h1) in a multi-host attack scenario (e.g., M1), then copy the JSON file to the "output" folder in the second host folder (e.g., h2), this way when we run the evaluate.py program (in h2 folder) it will consider all involved hosts. Execute the command "python3 evaluate.py" and the final result will be printed based on all the json eval_** files stored at the "output" folder.
  • To find the precision, recall and f1-score for each experiment, we use the number of false positives and negatives reported by atlas and we update them at the Excel sheet paper_experiments/docs/atlas.xlsx to get the result.

atlas's People

Contributors

cssaheel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

atlas's Issues

Running evaluate.py ERROR.

Script Path: ATLAS/paper_experiment/M1/h1.
If I run evaluate.py direcly, there will be no Error. But after running atlas.py which over-write the file eval_seq_graph_testing_preprocessed_logs_M1-CVE-2015-5122_windows_h1.dot.txt.json, then I get an Error when running evaluate.py:
ERROR: Please add the cleaned predicted entities for abstracted and raw logs.

[*] Questions about sysmon-config

I'm now preparing for data collection, but there are some problems while I install sysmon. I didn't find a sysmon-config that has the same system-log format likes yours (there are so many sysmon-config can be used). Is that important for the whole experiment or not?

Have a good day. :D

Why go through "all malicous labels + one possible subject"?

Hi Alsaheel,

Sorry for bothering again. May I ask why do you need to append every subject to original malicious labels in this for loop?

This step was taking a lot of time and computation. I wonder would only checking original malicious labels work here? Or is there something I missed in your "result_list" logic?

Thanks!

graph_generator.py ValueError

Hey!When I run graph_generator.py, there is a error:ValueError: Node names and attributes should not contain ":" unless they are quoted with "". For example the string 'attribute:data1' should be written as '"attribute:data1"'.Please refer pydot/pydot#258 I can not find where contains":".
How to solve the problem?Thanks for your help!

altas.py:: result list is null

Hi, I encountered a problem.
I resampled, and then I used resampled data for model training, but when I conducted testing, the entity prediction list of the atlas.py is empty. Why is this and how can I solve this problem?
Thanks for your help!

newline on user artifact

When processing the user artifact or any of the input files the strings need to be stripped of newline characters.

It is common on tools such as Vi too automatically append this character. It would be best practice to remove them.

What does "No skill and Lines" mean in evaluate.py?

Hi Alsaheel,

Sorry for coming again. After I did an evaluation, I am a bit confused what are the following metrics:

No Skill: ROC AUC=0.500
Logistic: ROC AUC=0.850
Lines No Skill: ROC AUC=0.500
Lines Logistic: ROC AUC=0.951

I think the Logistic is the algorithm in NLP. Then what about the No skill and Lines? Could you give me an example of what they represent?

Thanks!

The results of evaluate.py about entity

After running evaluate.py, I got the result about entity as follow. It seems different from the result in your paper.
Could you tell me what's wrong with it? Thanks!

Info (entity)
Number of unique entities: 652
Number of malicious entities: 11
Result (entity)
TP: 11
TN: 641
FP: 0
FN: 0

image

[*] Run graph_reader err

When I run graph_reader, I got that:

python graph_reader.py

============
processing the graph: output/graph_testing_preprocessed_logs_M1-CVE-2015-5122_windows_h1.dot
"connection_fe80::fd1b:d78f:dab1:8114_ff02::1:2" -> "c:/windows/system32/svchost.exe_836"  [capacity="1.0", dip=ff02::1:2, dport=547, key=0, label=connect_26519780, sip=fe80::fd1b:d78f:dab1:8114, sport=546, timestamp=26519780, type=connect];
                                                                                           ^
Expected "}", found '['  (at char 513267), (line:3595, col:92)
Traceback (most recent call last):
  File "graph_reader.py", line 29, in <module>
    G = read_dot(path)
  File "D:\Anaconda\lib\site-packages\networkx\utils\decorators.py", line 795, in func
    return argmap._lazy_compile(__wrapper)(*args, **kwargs)
  File "<class 'networkx.utils.decorators.argmap'> compilation 5", line 5, in argmap_read_dot_1
  File "D:\Anaconda\lib\site-packages\networkx\drawing\nx_pydot.py", line 78, in read_dot
    return from_pydot(P_list[0])
TypeError: 'NoneType' object is not subscriptable

by the way, I also found that, in file "graph_testing_preprocessed_logs_M1-CVE-2015-5122_windows_h1.dot" line 2535,

"dt.adsafeprotected.com/dt?adventityid=151939&asid=e516bb60-b043-4777-aae3-ed30f4d357c0&tv={c" [timestamp=30731585, type=web_object];

this URL ends with '{c', but I found it in "testing_preprocessed_logs_M1-CVE-2015-5122_windows_h1", the full URI been requested looks like this:

30731585,,,,,,,,,,request,dt.adsafeprotected.com/dt?adventityid=151939&asid=e516bb60-b043-4777-aae3-ed30f4d357c0&tv={c:xrw4abpingtime:-2time:160type:aim:{prf:{bea:10833bez:10835mfa:10838cma:10840ina:10840inz:10851pra:10851prz:10898si:10915poa:10918poz:10931cmz:10931mfz:10931loa:10979loz:10985lta:10991ltz:10991}}sca:{dfp:{df:4sz:728.90dom:body}}env:{gca:1sf:0gcd:{appl:0cnst:0glbl:namtdt:bozjzy6ozjzy6ahabbaab6aaaaaaaa}pom:1}clog:[{piv:0vs:or:lw:728h:90t:77},,,,,,,,-LB-

Are there any problems while preprocessing data? Or did I miss some steps while I replay this experiment?
:D

missing files from the experiments

I was just reviewing the contents and trying to run your experiments. I unpacked S1 and tried to execute the preprocess function. It was missing the files in the logs directories of the training and testing.

graph_reader.py read_dot() encounter error

When the function read_dot() parse the data like below will encouter error: pyparsing.exceptions.ParseException: Expected '}' , found ':' (at char 613806) (line:3344, col:16)
connection_fe80::fd1b:d78f:dab1:8114_ff02::1:3 -> "c:/windows/system32/svchost.exe_1180" [key=0, capacity="1.0", label=connect_26519816, type=connect, timestamp=26519816, sip=fe80::fd1b:d78f:dab1:8114, sport=61605, dip=ff02::1:3,
How to solve this problem? Thank you for your help!

Use STIX 2.1 for objects

The representation of the data objects should conform to a standard such as STIX 2.1 for cyber-observables. This will provide interoperability and compatibility along with semantic meaning.

About FireFox audit logs

Hey! I'm confused about the audit logs of Firefox. Can you tell me how to get the audit log of Firefox?Thanks!

different number of nodes

Hi Alsaheel,
After running the graph_generator.py, I found that the number of nodes in the result was not exactly the same as the number of entities in the ground truth.
image

The results I obtained were 7467,33977, 8996 and 13019, respectively. What was the reason and how to solve ?
This situation was also the same for events.
Thanks for your help!

Confused about manual cleaning step

Hey! In your manual cleanup step, how do we define if it is a redundant ENTITY? And whether I clean up all the redundant entities or only some of them, the ones with the highest predicted probability, will have an impact on the final experimental result (f1)?

How to get "malicious_labels.txt"

Hi Alsaheel,

May I ask how did you get the "malicious_labels.txt (inside training/testing logs)" from raw logs?
Does the "prepocess.py" assume that "malicious_labels.txt" already exists?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.