oalieno / asm2vec-pytorch Goto Github PK

View Code? Open in Web Editor NEW

70.0 2.0 19.0 62 KB

Unofficial implementation of asm2vec using pytorch ( with GPU acceleration )

License: MIT License

Python 100.00%

asm2vec pytorch unofficial python machine-learning neural-language-processing gpu-acceleration

asm2vec-pytorch's Introduction

Hi there 👋

asm2vec-pytorch's People

Contributors

Stargazers

Watchers

Forkers

research-zoo markgllin fossabot youngc2015 sami2316 dagrons deadly-platypus captwake anandramakris gystar easy-forks easy-forks easy-forks kmoonsun tyeyeah ylca0 ktss1219 andralet wandera

asm2vec-pytorch's Issues

Error in the process of training the model

Hello oalieno, I'm reproducing a paper which used Asm2Vec. I used Ghidra to extract function features and used the method of bin2asm.py to normalize assembly code.

The origin code

#include <stdio.h>

int main() {
    printf("Hello World!\n");
    return 0;
}

Before normalization (just one of the results)

.name main
.offset 00101139
.file a.out
PUSH RBP
MOV RBP,RSP
LEA RAX,[0x102004]
MOV RDI,RAX
CALL 0x00101030
MOV EAX,0x0
POP RBP
RET

After normalization

.name main
.offset 00101139
.file a.out
PUSH RBP
MOV RBP,RSP
LEA RAX,[CONST]
MOV RDI,RAX
CALL CONST
MOV EAX,CONST
POP RBP
RET

And then I want to run python scripts/train.py -i asm/ -o model.pt, but an error has occurred

Traceback (most recent call last):
  File "scripts/train.py", line 52, in <module>
    cli()
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "scripts/train.py", line 48, in cli
    learning_rate=lr
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/utils.py", line 74, in train
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/utils.py", line 43, in preprocess
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/datatype.py", line 115, in random_walk
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/datatype.py", line 115, in <listcomp>
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/datatype.py", line 117, in _random_walk
IndexError: list index out of range

Is this only available with radare2 to extract function features? How should I use this model with other decompilers? Thanks very much.

obtaining function vector

from the readme:

python scripts/test.py -i asm/123456 -m model.pt

After you train your model, try to grab a assembly function and see the result.
This script will show you how the model perform.
Once you satisfied, you can take out the embedding vector of the function and do whatever you want with it.

from my testing, test.py appears to return predictions for instruction embeddings for the function(s) passed in, and not predictions for function embedding itself. How do we obtain function vectors for further use?

The loss is very high and the acc is very low!

I think there must be some problems in ur code...
Whatever dataset used, the model shows a bad performance...
In fact, I can only get a 0.3 accuracy and the loss is still about 0.2 even after 128 epochs.

or I make something wrong if u can get a high acc.

Missing Functions?

I am trying the tool to get all functions embedding in a binary file, and the file has 10 thousand functions, but the tool can only get a few of them and missed about 9000 functions. I found that many functions in the symbol table are missed and many offsets are different from the idea pro, seems the function results in bin2asm.py are not a real function.

The difference as follow:

The missed functions in the symbol table showed in IDA as follow:

    functions, tokens_new = asm2vec.utils.load_data(data)
    for f in functions:
        print(f.meta)

Why does this happen? And Is there a solution?
Besides, my goal is to get all functions embedding in a file and get all the embedding in a tensor or a NumPy with unknown numbers, but maybe I can only get the specific numbers once. I don't know much about the torch, so what can I do?
Such as get the first three but I can't get the 0 to function number.

v1 = model.to('cpu').embeddings_f(torch.tensor([0,2,3]))
print(v1)

How to apply the implementation to PE files on Windows OS?

I'm not familiar with radare2 and r2pipe, but I installed them and already set up the environment path. I found some mistakes when I run bin2asm.py script, and I tried to modify it but failed. Can you help me solve this problem?

I have already set the return value of validEXE() to be True, but there were still mistakes occurring.
File "bin2asm.py", line 68, in bin2asm
for fn in r.cmdj('aflj'):
TypeError: 'NoneType' object is not iterable

I tried to recapture the experiments in the Asm2Vec paper using this implementation yet my results are really bad - what is the problem?

I attempted to use this library to figure out the cosine similarity between the O0 and O3 optimized functions in coreutils version 8.30 (taken from https://github.com/yueduan/DeepBinDiff). In order to try to replicate the results, I used the same options for training as in the paper - embedding dimension 200, learning rate 0.025 - and changed the number of random walks to be 10 as in the paper.

I found the average cosine similarity to be 0.128. Given that the results in the paper show that Asm2Vec should correctly match around 80% of O0 and O3 functions, the score is very poor. Do you have an explanation?

Is there an any dataset?

I wanna test this model.
But, I cannot find any dataset on this repository.

Which binary you use as a dataset? Could you please update it on repository or etc. ?

About parsing assembly & normalization

I found the difference between radare2 and xed-interface while doing some experiment

radare2 will output mov byte [rax], 0xb2
xed-interface will output mov byte ptr [rax], 0xb2

This might be a problem.
If you do not use bin2asm.py to generate the data, the assembly code you get elsewhere may not be noramlized and may have tiny difference.

Parse the assembly code and normalize them in asm2vec library may be a better solution?
Maybe use keystone and capstone to assemble and then disassemble to obtain a unified representation.

model reports low cosine similarity for identical functions

using a single function as a training dataset, I'm able to generate a model using train.py. With the same function as both target function 1 and 2, compare.py reports cosine similarity values close to 0 when expected values are closer to 1 (i.e. almost identical).

# asm/ contains a singular file with one function
python scripts/train.py -i scripts/train.py -i asm/ -o model.pt --epochs 100

# asm/function is used for both training + comparison
python scripts/compare.py -i1 asm/function -i2 asm/function -m model.pt
=> cosine similarity : 0.019504

Am I misunderstanding the usage/purpose of asm2vec-pytorch?

Attached is an example function used for both training + comparison in the model (although I found this to be true of every function I've tested)
function.txt

If it's relevant, this is a function extracted from a statically linked busybox binary.

About classification

Since the original paper seems to see this as a multi-label classification problem while learning embedding. For example, mov rbp, rsp will be split to 3 tokens mov, rbp, rsp. And we try to increase the corresponding classifier output value of these 3 tokens to be higher. The problem is that these 3 tokens share only one classifier. But we already know that the assembly code will only be split to maximum 3 part. push rbp can be split to push, rbp, <empty>. ret can be split to ret, <empty>, <empty>. We can use 3 classifiers to classify these 3 slots and treat it as a normal multi-category classification problem. The network may learn better. Just a thought.