Giter Site home page Giter Site logo

xda's Introduction

Introduction

XDA is a tool to disassemble instructions and recovers function boundaries of stripped binaries. It is based on transfer learning using Transformer encoder with masked language modeling objective [1, 2, 3]. It outperforms state-of-the-art tools (e.g., IDA Pro, Ghidra, and bidirectional RNN [4]). Please find the details in our paper: XDA: Accurate, Robust Disassembly with Transfer Learning

@inproceedings{pei2021xda,
    title={XDA: Accurate, Robust Disassembly with Transfer Learning},
    author={Pei, Kexin and Guan, Jonas and King, David Williams and Yang, Junfeng and Jana, Suman},
    year={2021},
    booktitle={Proceedings of the 2021 Network and Distributed System Security Symposium (NDSS)}
}

Installation

We recommend using conda to setup the environment and install the required packages.

First, create the conda environment,

conda create -n xda python=3.7 numpy scipy scikit-learn colorama

and activate the conda environment:

conda activate xda

Then, install the latest Pytorch (assume you have GPU):

conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

Finally, enter the xda root directory: e.g., path/to/xda, and install XDA:

pip install --editable .

Preparation

Pretrained models:

Create the checkpoints and checkpoints/pretrain_all subdirectory in path/to/xda

mkdir -p checkpoints/pretrain_all

Download our pretrained weight parameters and put in checkpoints/pretrain_all

Finetuned models:

We also provide the finetuned model for you to directly play on function boundary recovery. The finetuned model is trained on binaries compiled by MSVC x64. Create the checkpoints/finetune_msvs_funcbound_64 subdirectory in path/to/xda

mkdir -p checkpoints/finetune_msvs_funcbound_64

Download our finetuned weight parameters and put in checkpoints/finetune_msvs_funcbound_64.

Play with the finetuned model

We have put some sample data from BAP corpus compiled by MSVC x64 in data-raw/msvs_funcbound_64_bap_test. There are two columns in the data files. The first column is all raw bytes of the binary, and the second column is the label indicating it is function start (F), function end (R), or neither.

To predict the function boundary in these files, run:

python scripts/play/play_func_bound.py

This scripts will load the finetuned weights you put in checkpoints/finetune_msvs_funcbound_64 and predict the function boundaries. It will also compare to the ground-truth and the results from IDA.

Sample data with function boundaries

We provide the sample training/testing files of pretraining and finetuning in data-src/

  • data-src/pretrain_all contains the sample raw bytes from stripped binaries for pretraining
  • data-src/funcbound contains the sample raw bytes with function boundaries

We have already provided the pretrained models on a huge number of binaries. But if you want to pretrain on your own collected data, you can prepare the sample files similar to the format in data-src/pretrain_all (concatenate all bytes from all binaries, and delimit by a newline \n to make sure each line does not exceed the max length that model accepts). Similarly, if you want to prepare the finetuning data yourself, make sure you follow the format shown in data-src/funcbound.

We have to binarize the data to make it ready to be trained. To binarize the training data for pretraining, run:

./scripts/pretrain/preprocess-pretrain-all.sh

The binarized training data ready for pretraining will be stored at data-bin/pretrain_all

To binarize the training data for finetuning, run:

./scripts/finetune/preprocess.sh

The binarized training data ready for finetuning (for function boundary) will be stored at data-bin/funcbound

Training

If you are using your own parsed binaries for pretraining, and you have already binarized them in data-bin/pretrain_all, run:

./scripts/pretrain/pretrain-all.sh

To finetune the model, run:

./scripts/finetune/finetune.sh

The scripts loads the pretrained weight parameters from checkpoints/pretrain_all/ and finetunes the model.

RNN baseline

  • bi-RNN implementation is released under ./bi-RNN/
  • To run, download our sample processed SPEC 2017 O1 dataset training, testing and put in birnn/

References

[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

[3] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).

[4] Shin, Eui Chul Richard, Dawn Song, and Reza Moazzezi. "Recognizing functions in binaries with neural networks." 24th USENIX Security Symposium. 2015.

xda's People

Contributors

peikexin9 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.