oalieno / asm2vec-pytorch Goto Github PK
View Code? Open in Web Editor NEWUnofficial implementation of asm2vec using pytorch ( with GPU acceleration )
License: MIT License
Unofficial implementation of asm2vec using pytorch ( with GPU acceleration )
License: MIT License
Hello oalieno, I'm reproducing a paper which used Asm2Vec. I used Ghidra to extract function features and used the method of bin2asm.py
to normalize assembly code.
The origin code
#include <stdio.h>
int main() {
printf("Hello World!\n");
return 0;
}
Before normalization (just one of the results)
.name main
.offset 00101139
.file a.out
PUSH RBP
MOV RBP,RSP
LEA RAX,[0x102004]
MOV RDI,RAX
CALL 0x00101030
MOV EAX,0x0
POP RBP
RET
After normalization
.name main
.offset 00101139
.file a.out
PUSH RBP
MOV RBP,RSP
LEA RAX,[CONST]
MOV RDI,RAX
CALL CONST
MOV EAX,CONST
POP RBP
RET
And then I want to run python scripts/train.py -i asm/ -o model.pt
, but an error has occurred
Traceback (most recent call last):
File "scripts/train.py", line 52, in <module>
cli()
File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "scripts/train.py", line 48, in cli
learning_rate=lr
File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/utils.py", line 74, in train
File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/utils.py", line 43, in preprocess
File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/datatype.py", line 115, in random_walk
File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/datatype.py", line 115, in <listcomp>
File "/home/bronya/.conda/envs/py3.7/lib/python3.7/site-packages/asm2vec-1.0.0-py3.7.egg/asm2vec/datatype.py", line 117, in _random_walk
IndexError: list index out of range
Is this only available with radare2 to extract function features? How should I use this model with other decompilers? Thanks very much.
from the readme:
python scripts/test.py -i asm/123456 -m model.pt
After you train your model, try to grab a assembly function and see the result.
This script will show you how the model perform.
Once you satisfied, you can take out the embedding vector of the function and do whatever you want with it.
from my testing, test.py
appears to return predictions for instruction embeddings for the function(s) passed in, and not predictions for function embedding itself. How do we obtain function vectors for further use?
I am trying the tool to get all functions embedding in a binary file, and the file has 10 thousand functions, but the tool can only get a few of them and missed about 9000 functions. I found that many functions in the symbol table are missed and many offsets are different from the idea pro, seems the function results in bin2asm.py are not a real function.
The difference as follow:
The missed functions in the symbol table showed in IDA as follow:
functions, tokens_new = asm2vec.utils.load_data(data)
for f in functions:
print(f.meta)
Why does this happen? And Is there a solution?
Besides, my goal is to get all functions embedding in a file and get all the embedding in a tensor or a NumPy with unknown numbers, but maybe I can only get the specific numbers once. I don't know much about the torch, so what can I do?
Such as get the first three but I can't get the 0 to function number.
v1 = model.to('cpu').embeddings_f(torch.tensor([0,2,3]))
print(v1)
I'm not familiar with radare2 and r2pipe, but I installed them and already set up the environment path. I found some mistakes when I run bin2asm.py script, and I tried to modify it but failed. Can you help me solve this problem?
I have already set the return value of validEXE() to be True, but there were still mistakes occurring.
File "bin2asm.py", line 68, in bin2asm
for fn in r.cmdj('aflj'):
TypeError: 'NoneType' object is not iterable
I attempted to use this library to figure out the cosine similarity between the O0 and O3 optimized functions in coreutils version 8.30 (taken from https://github.com/yueduan/DeepBinDiff). In order to try to replicate the results, I used the same options for training as in the paper - embedding dimension 200, learning rate 0.025 - and changed the number of random walks to be 10 as in the paper.
I found the average cosine similarity to be 0.128. Given that the results in the paper show that Asm2Vec should correctly match around 80% of O0 and O3 functions, the score is very poor. Do you have an explanation?
I wanna test this model.
But, I cannot find any dataset on this repository.
Which binary you use as a dataset? Could you please update it on repository or etc. ?
I found the difference between radare2
and xed-interface
while doing some experiment
radare2
will output mov byte [rax], 0xb2
xed-interface
will output mov byte ptr [rax], 0xb2
This might be a problem.
If you do not use bin2asm.py
to generate the data, the assembly code you get elsewhere may not be noramlized and may have tiny difference.
Parse the assembly code and normalize them in asm2vec
library may be a better solution?
Maybe use keystone and capstone to assemble and then disassemble to obtain a unified representation.
using a single function as a training dataset, I'm able to generate a model using train.py
. With the same function as both target function 1 and 2, compare.py
reports cosine similarity values close to 0 when expected values are closer to 1 (i.e. almost identical).
# asm/ contains a singular file with one function
python scripts/train.py -i scripts/train.py -i asm/ -o model.pt --epochs 100
# asm/function is used for both training + comparison
python scripts/compare.py -i1 asm/function -i2 asm/function -m model.pt
=> cosine similarity : 0.019504
Am I misunderstanding the usage/purpose of asm2vec-pytorch?
Attached is an example function used for both training + comparison in the model (although I found this to be true of every function I've tested)
function.txt
If it's relevant, this is a function extracted from a statically linked busybox binary.
Since the original paper seems to see this as a multi-label classification problem while learning embedding. For example, mov rbp, rsp
will be split to 3 tokens mov
, rbp
, rsp
. And we try to increase the corresponding classifier output value of these 3 tokens to be higher. The problem is that these 3 tokens share only one classifier. But we already know that the assembly code will only be split to maximum 3 part. push rbp
can be split to push
, rbp
, <empty>
. ret
can be split to ret
, <empty>
, <empty>
. We can use 3 classifiers to classify these 3 slots and treat it as a normal multi-category classification problem. The network may learn better. Just a thought.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.