Giter Site home page Giter Site logo

oalieno / asm2vec-pytorch Goto Github PK

View Code? Open in Web Editor NEW
70.0 2.0 19.0 62 KB

Unofficial implementation of asm2vec using pytorch ( with GPU acceleration )

License: MIT License

Python 100.00%
asm2vec pytorch unofficial python machine-learning neural-language-processing gpu-acceleration

asm2vec-pytorch's Introduction

Hi there ๐Ÿ‘‹

asm2vec-pytorch's People

Contributors

captwake avatar markgllin avatar oalieno avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

asm2vec-pytorch's Issues

Error in the process of training the model

Hello oalieno, I'm reproducing a paper which used Asm2Vec. I used Ghidra to extract function features and used the method of bin2asm.py to normalize assembly code.

The origin code

#include <stdio.h>

int main() {
    printf("Hello World!\n");
    return 0;
}

Before normalization (just one of the results)

.name main
.offset 00101139
.file a.out
PUSH RBP
MOV RBP,RSP
LEA RAX,[0x102004]
MOV RDI,RAX
CALL 0x00101030
MOV EAX,0x0
POP RBP
RET

After normalization

.name main
.offset 00101139
.file a.out
PUSH RBP
MOV RBP,RSP
LEA RAX,[CONST]
MOV RDI,RAX
CALL CONST
MOV EAX,CONST
POP RBP
RET

And then I want to run python scripts/train.py -i asm/ -o model.pt, but an error has occurred

Traceback (most recent call last):
  File "scripts/train.py", line 52, in <module>
    cli()
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "scripts/train.py", line 48, in cli
    learning_rate=lr
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/utils.py", line 74, in train
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/utils.py", line 43, in preprocess
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/datatype.py", line 115, in random_walk
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/datatype.py", line 115, in <listcomp>
  File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/datatype.py", line 117, in _random_walk
IndexError: list index out of range

Is this only available with radare2 to extract function features? How should I use this model with other decompilers? Thanks very much.

obtaining function vector

from the readme:

python scripts/test.py -i asm/123456 -m model.pt

After you train your model, try to grab a assembly function and see the result.
This script will show you how the model perform.
Once you satisfied, you can take out the embedding vector of the function and do whatever you want with it.

from my testing, test.py appears to return predictions for instruction embeddings for the function(s) passed in, and not predictions for function embedding itself. How do we obtain function vectors for further use?

The loss is very high and the acc is very low!

I think there must be some problems in ur code...
Whatever dataset used, the model shows a bad performance...
In fact, I can only get a 0.3 accuracy and the loss is still about 0.2 even after 128 epochs.
image

or I make something wrong if u can get a high acc.

Missing Functions?

I am trying the tool to get all functions embedding in a binary file, and the file has 10 thousand functions, but the tool can only get a few of them and missed about 9000 functions. I found that many functions in the symbol table are missed and many offsets are different from the idea pro, seems the function results in bin2asm.py are not a real function.

The difference as follow:

1616078208(1)
1616078281(1)

The missed functions in the symbol table showed in IDA as follow:

1616078734(1)
1616078702(1)

    functions, tokens_new = asm2vec.utils.load_data(data)
    for f in functions:
        print(f.meta)

Why does this happen? And Is there a solution?
Besides, my goal is to get all functions embedding in a file and get all the embedding in a tensor or a NumPy with unknown numbers, but maybe I can only get the specific numbers once. I don't know much about the torch, so what can I do?
Such as get the first three but I can't get the 0 to function number.

v1 = model.to('cpu').embeddings_f(torch.tensor([0,2,3]))
print(v1)

How to apply the implementation to PE files on Windows OS?

I'm not familiar with radare2 and r2pipe, but I installed them and already set up the environment path. I found some mistakes when I run bin2asm.py script, and I tried to modify it but failed. Can you help me solve this problem?

I have already set the return value of validEXE() to be True, but there were still mistakes occurring.
File "bin2asm.py", line 68, in bin2asm
for fn in r.cmdj('aflj'):
TypeError: 'NoneType' object is not iterable

I tried to recapture the experiments in the Asm2Vec paper using this implementation yet my results are really bad - what is the problem?

I attempted to use this library to figure out the cosine similarity between the O0 and O3 optimized functions in coreutils version 8.30 (taken from https://github.com/yueduan/DeepBinDiff). In order to try to replicate the results, I used the same options for training as in the paper - embedding dimension 200, learning rate 0.025 - and changed the number of random walks to be 10 as in the paper.

I found the average cosine similarity to be 0.128. Given that the results in the paper show that Asm2Vec should correctly match around 80% of O0 and O3 functions, the score is very poor. Do you have an explanation?

Is there an any dataset?

I wanna test this model.
But, I cannot find any dataset on this repository.

Which binary you use as a dataset? Could you please update it on repository or etc. ?

About parsing assembly & normalization

I found the difference between radare2 and xed-interface while doing some experiment

  • radare2 will output mov byte [rax], 0xb2
  • xed-interface will output mov byte ptr [rax], 0xb2

This might be a problem.
If you do not use bin2asm.py to generate the data, the assembly code you get elsewhere may not be noramlized and may have tiny difference.

Parse the assembly code and normalize them in asm2vec library may be a better solution?
Maybe use keystone and capstone to assemble and then disassemble to obtain a unified representation.

model reports low cosine similarity for identical functions

using a single function as a training dataset, I'm able to generate a model using train.py. With the same function as both target function 1 and 2, compare.py reports cosine similarity values close to 0 when expected values are closer to 1 (i.e. almost identical).

# asm/ contains a singular file with one function
python scripts/train.py -i scripts/train.py -i asm/ -o model.pt --epochs 100

# asm/function is used for both training + comparison
python scripts/compare.py -i1 asm/function -i2 asm/function -m model.pt
=> cosine similarity : 0.019504

Am I misunderstanding the usage/purpose of asm2vec-pytorch?

Attached is an example function used for both training + comparison in the model (although I found this to be true of every function I've tested)
function.txt

If it's relevant, this is a function extracted from a statically linked busybox binary.

About classification

Since the original paper seems to see this as a multi-label classification problem while learning embedding. For example, mov rbp, rsp will be split to 3 tokens mov, rbp, rsp. And we try to increase the corresponding classifier output value of these 3 tokens to be higher. The problem is that these 3 tokens share only one classifier. But we already know that the assembly code will only be split to maximum 3 part. push rbp can be split to push, rbp, <empty>. ret can be split to ret, <empty>, <empty>. We can use 3 classifiers to classify these 3 slots and treat it as a normal multi-category classification problem. The network may learn better. Just a thought.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.