Giter Site home page Giter Site logo

quarkslab / qbindiff Goto Github PK

View Code? Open in Web Editor NEW
158.0 8.0 6.0 5.21 MB

Quarkslab Bindiffer but not only !

Home Page: https://diffing.quarkslab.com

License: Apache License 2.0

Python 91.88% Cython 5.78% Roff 0.94% Meson 1.40%
binary-diffing network-alignment program-analysis reverse-engineering vulnerability-research

qbindiff's Introduction

QBinDiff

QBinDiff is an experimental binary diffing tool addressing the diffing as a Network Alignement Quadratic Problem.

But why developing yet another differ when Bindiff works well?

Bindiff is great, no doubt about it, but we have no control on the diffing process. Also, it works great on standard binaries but it lacks flexibility on some corner-cases (embedded firmwares, diffing two portions of the same binary etc...).

A key idea of QBinDiff is enabling tuning the diffing programmatically by:

  • writing its own feature
  • being able to enforce some matches
  • emphasizing either on the content of functions (similarity) or the links between them (callgraph)

In essence, the idea is to be able to diff by defining its own criteria which sometimes, is not the control-flow and instructions but could for instance, be data-oriented.

Last, QBinDiff as primarily been designed with the binary-diffing use-case in mind, but it can be applied to various other use-cases like social-networks. Indeed, diffing two programs boils down to determining the best alignment of the call graph following some similarity criterion.

Indeed, solving this problem is APX-hard, that why QBinDiff uses a machine learning approach (more precisely optimization) to approximate the best match.

Like Bindiff, QBinDiff also works using an exported disassembly of program obtained from IDA. Originally using BinExport, it now also support Quokka as backend, which extracted files, are more exhaustive and also more compact on disk (good for large binary dataset).

Note

QBinDiff is an experimental tool for power-user where many parameters, features, thresholds or weights can be adjusted. Obtaining good results usually requires tuning these parameters.

(Please note that QBinDiff does not intend to be faster than other differs, but rather being more flexible.)

Warning

QBinDiff graph alignment is very memory intensive (compute large matrices), it can fill RAM if not cautious. Try not diffing binaries larger than +10k functions. For large program use very high sparsity ratio (0.99).

Documentation

The documentation can be found on the diffing portal or can be manually built with

pip install .[doc]
cd doc
make html

Below you will find some sections extracted from the documentation. Please refer to the full documentation in case of issues.

Installation

QBinDiff can be installed through pip with:

pip install qbindiff

As some part of the algorithm are very CPU intensive the installation will compile some components written in native C/C++.

As depicted above, QBinDiff relies on some projects (also developed at Quarkslab):

  • python-binexport, wrapper on the BinExport protobuf format.
  • python-bindiff, wrapper around bindiff (used to write results as Bindiff databases)
  • Quokka, another binary exported based on IDA. Faster than binexport and more exhaustive (thus diffing more relevant)

Usage (command line)

After installation, the binary qbindiff is available in the path. It takes in input two exported files and start the diffing analysis. The result can then be exported in a BinDiff file format. The default format for input files is BinExport, for a complete list of backend loader look at the -l1, --loader1 option in the help. The complete command line options are:

Usage: qbindiff [OPTIONS] <primary file> <secondary file>

  QBinDiff is an experimental binary diffing tool based on machine learning technics, namely Belief propagation.

Options:
  -l1, --loader1 <loader>       Loader type to be used for the primary. Must be one of these ['binexport', 'quokka',
                                'ida']  [default: binexport]
  -l2, --loader2 <loader>       Loader type to be used for the secondary. Must be one of these ['binexport', 'quokka',
                                'ida']  [default: binexport]
  -f, --feature <feature>       Features to use for the binary analysis, it can be specified multiple times.
                                Features may be weighted by a positive value (default 1.0) and/or compared with a
                                specific distance (by default the option -d is used) like this <feature>:<weight>:<distance>.
                                For a list of all the features available see --list-features.
  -n, --normalize               Normalize the Call Graph (can potentially lead to a partial matching). [default
                                disabled]
  -d, --distance <function>     The following distances are available ['canberra', 'euclidean', 'cosine',
                                'jaccard_strong']  [default: canberra]
  -s, --sparsity-ratio FLOAT    Ratio of least probable matches to ignore. Between 0.0 (nothing is ignored) to 1.0
                                (only perfect matches are considered)  [default: 0.75]
  -sr, --sparse-row             Whether to build the sparse similarity matrix considering its entirety or processing
                                it row per row
  -t, --tradeoff FLOAT          Tradeoff between function content (near 1.0) and call-graph information (near 0.0)
                                [default: 0.75]
  -e, --epsilon FLOAT           Relaxation parameter to enforce convergence  [default: 0.5]
  -i, --maxiter INTEGER         Maximum number of iteration for belief propagation  [default: 1000]
  -e1, --executable1 PATH       Path to the primary raw executable. Must be provided if using quokka loader
  -e2, --executable2 PATH       Path to the secondary raw executable. Must be provided if using quokka loader
  -o, --output PATH             Write output to PATH
  -ff, --file-format [bindiff|csv]
                                The file format of the output file  [default: csv]
  -v, --verbose                 Activate debugging messages. Can be supplied multiple times to increase verbosity
  --version                     Show the version and exit.
  --arch-primary TEXT           Force the architecture when disassembling for the primary. Format is
                                'CS_ARCH_X:CS_MODE_Ya,CS_MODE_Yb,...'
  --arch-secondary TEXT         Force the architecture when disassembling for the secondary. Format is
                                'CS_ARCH_X:CS_MODE_Ya,CS_MODE_Yb,...'
  --list-features               List all the available features
  -h, --help                    Show this message and exit.

Library usage

The strength of qBinDiff is to be usable as a python library. The following snippet shows an example of loading to binexport files and to compare them using the mnemonic feature.

from qbindiff import QBinDiff, Program
from qbindiff.features import WeisfeilerLehman
from pathlib import Path

p1 = Program(Path("primary.BinExport"))
p2 = Program(Path("secondary.BinExport"))

differ = QBinDiff(p1, p2)
differ.register_feature_extractor(WeisfeilerLehman, 1.0, distance='cosine')

differ.process()

mapping = differ.compute_matching()
output = {(match.primary.addr, match.secondary.addr) for match in mapping}

Contributing & Contributors

Any help, or feedback is greatly appreciated via Github issues, pull requests.

Current:

  • Robin David
  • Riccardo Mori
  • Roxane Cohen

Past:

  • Alexis Challande
  • Elie Mengin

All contributions

qbindiff's People

Contributors

fenrisfulsur avatar patacca avatar rc-94 avatar robindavid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

qbindiff's Issues

Seperate the data types between simple and composite

Right now the data types are divided in "simple" (word, dword, ascii, etc...) and "structure" (structure, union, enum).
Imho this is not a good choice of words as the term "structure" could either mean a proper "structure" (like a C struct) or the collection of data types (struct, union and enum).

I propose to refactor the data types to add enum as a "simple" type (after all it's not a composite type, its fields' type are homogeneous) and rename the "structure" type to "composite". A "composite" data type would be either a struct or a union, making it in fact a heterogeneous type.

More tests need to be done to prove that a enum can effectively be moved into the "simple" types.

BinExport files result in many keyerror issues

Following the documentation I've tried secondary = Program(LoaderType.binexport, "./objcopy.BinExport")
and p1 = Program(Path("objcopy.BinExport")) both of which crash in unexpected ways.

...
    91     dst = cg.vertex[edge.target_vertex_index].address
     92     self.callgraph.add_edge(src, dst)
---> 93     self[src].children.add(self[dst])
     94     self[dst].parents.add(self[src])
     96 # Create a map of function names for quick lookup later on
...
  self.be_prog = binexport.ProgramBinExport(file)
  File ".../python3.10/site-packages/binexport/program.py", line 93, in __init__
    self[src].children.add(self[dst])
KeyError: 4751856

When using the cli via qbindiff -l 'binexport' file1.BinExport file2.BinExport I get the same sort of key errors as well despite these files working fine with BinDiff as-is.

Note: If I do not explicitly ask for binexport I get this issue

     59     self._backend = ProgramBackendQuokka(*args, **kwargs)
     61 else:
---> 62     raise NotImplementedError("Loader: %s not implemented" % loader)
     64 self._filter = lambda x: True
     65 self._load_functions()

Any idea what could be causing the issue? I am using BinDiff 8 and Ghidra 10.3

Problem with type of instruction (related to capstone)

Hi, I'm getting the following error when using differ.compute_matching():

Traceback (most recent call last):
  File "test.py", line 70, in <module>
    main()
  File "test.py", line 64, in main
    matches = differ.compute_matching()
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/qbindiff/differ.py", line 313, in compute_matching
    for _ in tqdm.tqdm(self._matching_iterator(), total=self.maxiter, disable=not is_debug()):
  File "/lib/python3.11/site-packages/tqdm/std.py", line 1170, in __iter__
    for obj in iterable:
  File "/lib/python3.11/site-packages/qbindiff/differ.py", line 325, in _matching_iterator
    self.process()
  File "/lib/python3.11/site-packages/qbindiff/differ.py", line 303, in process
    self.run_passes()  # User registered passes
    ^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/qbindiff/differ.py", line 274, in run_passes
    self.p_features, self.s_features = pass_func(
                                       ^^^^^^^^^^
  File "/lib/python3.11/site-packages/qbindiff/passes/base.py", line 224, in __call__
    primary_features = self._visitor.visit(primary, key_fun=key_fun)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/qbindiff/visitor.py", line 79, in visit
    self.visit_item(graph, node, collector)
  File "/lib/python3.11/site-packages/qbindiff/visitor.py", line 150, in visit_item
    self.visit_function(program, item, collector)
  File "/lib/python3.11/site-packages/qbindiff/visitor.py", line 226, in visit_function
    self.visit_basic_block(program, bb, collector)
  File "/lib/python3.11/site-packages/qbindiff/visitor.py", line 245, in visit_basic_block
    self.visit_instruction(program, inst, collector)
  File "/lib/python3.11/site-packages/qbindiff/visitor.py", line 263, in visit_instruction
    self.visit_operand(program, op, collector)
  File "/lib/python3.11/site-packages/qbindiff/visitor.py", line 277, in visit_operand
    callback(program, operand, collector)
  File "/lib/python3.11/site-packages/qbindiff/features/graph.py", line 225, in visit_operand
    if operand.type in (OperandType.memory, OperandType.displacement, OperandType.phrase):
       ^^^^^^^^^^^^
  File /lib/python3.11/site-packages/qbindiff/loader/operand.py", line 46, in type
    return self._backend.type
           ^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/qbindiff/loader/backend/binexport.py", line 144, in type
    raise NotImplementedError(f"Unrecognized capstone type {cs_op_type}")
NotImplementedError: Unrecognized capstone type 3

The code in which the exception occurs is the following:

@property
def type(self) -> OperandType:
    """Returns the capstone operand type"""
    op = self.cs_operand
    typ = OperandType.unknown
    cs_op_type = self.cs_operand.type

    if cs_op_type == capstone.CS_OP_REG:
        return OperandType.register
    elif cs_op_type == capstone.CS_OP_IMM:
        return OperandType.immediate
    elif cs_op_type == capstone.CS_OP_MEM:
        # A displacement is represented as [reg+hex] (example : [rdi+0x1234])
        # Then, base (reg) and disp (hex) should be different of 0
        if op.mem.base != 0 and op.mem.disp != 0:
            typ = OperandType.displacement
        # A phrase is represented as [reg1 + reg2] (example : [rdi + eax])
        # Then, base (reg1) and index (reg2) should be different of 0
        if op.mem.base != 0 and op.mem.index != 0:
            typ = OperandType.phrase
        if op.mem.disp != 0:
            typ = OperandType.displacement
    else:
        raise NotImplementedError(f"Unrecognized capstone type {cs_op_type}")
    return typ

However the type 3 correspond to capsone.CS_OP_MEM as you can see in the interpreter:

>>> import capstone
>>> capstone.CS_OP
{0: 'CS_OP_INVALID', 1: 'CS_OP_REG', 2: 'CS_OP_IMM', 3: 'CS_OP_MEM', 4: 'CS_OP_FP'}
>>> capstone.CS_OP_MEM
3

So there must be something tricky going on...

Update to sphinx >= 7.2

This is just a reminder to update to sphinx >= 7.2 as it has a better support for pep 585 builtins generics (see sphinx-doc/sphinx#11570).

Currently there are some blockers to update to sphinx >= 7.2:

  • sphinx_rtd_theme
  • enum_tools

Clarify the difference between FunctionType.extern and FunctionType.imported

It seems that there shouldn't be any difference between an imported and an extern function, however there are currently two different types.
This difference originates from the Quokka python bindings where it is introduced the external type.

If I guessed correctly the problem arises because sometimes IDA behaves very differently with imported functions in a PE binary (treating them like data instead of code).
More testing is needed.

Qbindiff can't guess the arch

While trying to use qbindiff on a ARM32 Thumb program, i got the following exception:

Cannot guess the instruction set of the instruction at 0x....

I fixed the issue by hard-coding the mode and arch inside the file qbindiff/loader/backend/binexport.py but it could be cool to let the user define the arch and mode when he knows it, something like:

differ = qbindiff.QBinDiff(
    p, q,
    distance=Distance.canberra,
    ...,
    arch="ARM-32",
    mode="THUMB"
)

Belief propagation raised a RuntimeWarning

Hello,

Each time I used qbindiff, I have this warning which is raised, it appears with pip package but also when I have built qbindiff from sources.

As I do not know if this warning is important or note, I cannot propose a patch. But I suppose it should be handled by qbindiff.

[HOME]/.virtualenvs/qbindiff-dev/lib/python3.11/site-packages/qbindiff/matcher/belief_propagation.py:273: RuntimeWarning: overflow encountered in power
  x / (1 + x) for x in np.clip(np.power(math.e, curr_marginals.data), 0, 1e6)

Thanks!

The Linear Assignment Problem doesn't always make good matches

The LAP (Linear Assignment Problem) sometimes can make very poor choices because the marginals have been already flattened (because of the epsilon relaxation) and it's impossible to extract valuable information. We need either a better algorithm or a better source of information other than the flattened marginals.

It might be possible to use directly the score matrix or run again the NAP problem only on a subset of the nodes.

Use a better feature for the function names

Right now the only feature that leverages the functions name is FuncName and it works as a yes/no: it gives a similarity of 1 two functions have the exact same name, 0 otherwise.

It might be nice to have a better similarity score, for example using the levenshtein string distance or some LSH function.

Parallelize the feature extractions

We could write cython++ code to make use of multiple cores to parallelize some cpu intensive features like the Weisfeiler Lehman Graph Kernel

Capstone version compatibility

After merging #36 we will break compatibility with older versions of capstone. We should identify which version is the first one that is compatible with qbindiff and enforce it in pyproject.toml

Move from setuptools to a different build system

Numpy is incompatible with recent version of setuptools starting from version 65.5.2 and will be impossible to use it entirely starting from python 3.12 because of the removal of the deprecated distutils module. See this for reference.
It is recommended to use setuptools < 60.0 for python 3.12 and later even though this won't be enough to guarantee it will work forever.
On top of that setuptools is not very stable in its configuration API relying already on 3 config files, none of which can replace all the others: pyproject.toml, setup.cfg, setup.py.

It seems we cannot expect a fix for this to ever happen as both setuptools and numpy developers have no intention of keeping compatibility between each other.

The best solution would be to drop the usage of setuptools entirely and transition to meson-python that is also the solution used by numpy.

Optionally add the raw binary path with binexport

BinExport doesn't export the raw binary path in the protobuf message so either we have to guess it or we can accept it as a parameter in qbindiff. The latter would be better of course but then we lose the option to diff just two .BinExport files, without the raw binaries.

In any case without the raw binary path we cannot recreate the .BinDiff sqlite database.

Add a CFG/CG normalization pass

Possible ideas:

  • Better handling of thunk functions (can safely be removed in the normalization pass)
  • Sometimes there might be some code duplication. When function A jump to function B sometimes the whole code of function B is copied within function A instead of considering the jump like a function call

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.