matt-kempster / m2c Goto Github PK

View Code? Open in Web Editor NEW

379.0 19.0 45.0 2.41 MB

A MIPS and PowerPC decompiler.

License: GNU General Public License v3.0

Python 40.39% Assembly 49.46% C 10.11% C++ 0.04%

decompiler mips python powerpc

m2c's People

Contributors

Stargazers

Watchers

Forkers

danebou simonlindholm mewbak mrcheeze donfanning xeeynamo crackercat emill dragorn421 m000z0rz onlymx13 ethteck jon1scr zbanks stjordanis blue3rd kelebek1 riidefi foxfire22 krimtonz ellipticellipsis tmcg2 barracuda156 kassra-k abaresk angheloalf macweese lollipopman121 mkst tailscreatesstuff32 bates64 r-burns roniskitee uepuepuep gillou68310 ribbanya ayaz345 seekyct juneman cakehonolulu triledev drgn-drc happy-ferret-entertainment nickoairi

m2c's Issues

Same-register variable detection heuristic can incorrectly unify types

With the following asm:

glabel test
addiu   $sp, $sp, -0xC8
sw      $ra, 0x34($sp)
sw      $zero, 0xC4($sp)
lw      $t4, 0xC4($sp)
addiu   $t4, $t4, 0x1234
sw      $t4, 0x40($sp)
addiu   $r6, $a0, 0x10
sw      $r6, 0x40($sp)
lw      $ra, 0x34($sp)
addiu   $sp, $sp, 0xC8
jr      $ra
nop

and context C:

struct Dummy {
    char pad[0x10];
    long member;
};
void test(struct Dummy* arg);

we output

void test(struct Dummy *arg) {
    long *spC4;
    long *sp40;

    spC4 = NULL;
    sp40 = spC4 + 0x1234;
    sp40 = &arg->member;
}

spC4 gets incorrectly detected as long. This is because $t4 is used for both spC4 and sp40, and it hits the overwrite_reg codepath which unifies types.

Warn when blocks don't get emitted

I've seen blocks silently dropped e.g. when they were detected as being part of &&/||, or before I implemented a591103. Would be good to warn in these cases (assuming we don't get too many false positives).

(We should only do this for blocks that have statements to be written.)

Detect zero extension

E.g. (x << 24) >> 24 is a cast to u8/s8, sign depending on whether we have an arithmetical vs. logical right shift.

Identify signed division by constant

    phi_s3 = arg1 >> 3;
    if (arg1 < 0)
    {
        phi_s3 = (s32) (arg1 + 7) >> 3;
    }

should simply become phi_s3 = arg1 / 8;

u32 -> f32 pattern not always detected

We detect

bgez  $a3, .L00400100
 cvt.s.w $reg, whatever
lui   $at, 0x4f80
mtc1  $at, $temp
nop
add.s $reg, $reg, $temp
.L00400100:

but I once saw

cvt.s.w $reg, whatever
bgez  $a3, .L00400100
 something unrelated
lui   $at, 0x4f80
mtc1  $at, $temp
nop
add.s $reg, $reg, $temp
.L00400100:

in some piece of -O2-compiled code (IIRC). Can we work around this with the current framework? Or are these patterns arbitrarily reordered together with other code, making a more complex approach necessary? Needs to be investigated.

Integration with Ghidra/IDA?

Pattern-match function body for matching macros

In if_statements.py, when we have a Body, we should look for sequences of statements that match input macros (akin to how we match types given an input preprocessed C file).

For unrecognized instructions, show input register values

Detect bitfields

if ((a->unk0 << 2) < 0) really checks for (unk0b20 != 0).

The pattern (x << a) < 0 should be replaced by x & 0x20000000 and if x is a struct access to a known struct with a bitfield at the relevant position then a use of that bitfield.

Would be nice to detect (a->b & c) >> d as well.

Investigate removed assertion about emit_goto return nodes

5e4e537

Hit in practice in a case related to jump tables. I think the assertion was safe to remove but I'm not certain.

Also needs a test.

Don't move loads to after conflicting writes

E.g. return x++; currently emits something like

x = x + 1;
return x;

which is wrong. Maybe we could iterate over all registers contents recursively on writes and for loads overlapping the write mark corresponding EvalOnceExpr's as "emit if any new use appears".

I imagine post-increments are the most common case of this (?), so might be worth special-casing.

Edit: function calls are roughly the same as reads in this regard, except they also shouldn't be moved past other function calls, writes, or branches.

Early returns happening in the wrong order

the misc1.c fails in two ways, both for -g and -O2:

it outputs if (x) { return y; } return z; rather than if (!x) { return z; } return y (not super important, but could be worth fixing)
~~it returns something else than it's supposed to (local variable vs constant). I thought I fixed this with the phi node work... need to look into this.~~ Update: fixed.

Don't require jump table targets to follow specific naming pattern

Currently they must be named L<hex addr>, or they will be detected as functions, which breaks everything (you generally get a cryptic KeyError on before_target = label_prev_instr[old_label] due to non-existent labels).

The simplest way to solve this would be to not mark a new function unless all branch labels are found (if they are found in the file at all). It might have false positives if the default case comes first and does a jr ra, however, and the switch statement is not within an if.

Another option would be to look at all jr <non-ra>, look backwards to what table they read from, read that table and parse all entries as non-functions. However, that's more complex and requires rodata to be available.

I think the first option + our current heuristic + a good error message when jump table targets don't exist would be good. Or maybe #36 will make the second option simpler.

Array indexing

Currently outputs *(pointer arithmetic)

Emit a goto instead of emitting nodes multiple times

A few times I've seen nodes get emitted multiple times, e.g. with continue, early returns, and nested &&/||. (Nightmare example that I just encountered: if (blah || (blah && (blah ? (var = func_1(blah)) : (var = func_2(blah))) != 0)).) It's usually not a big deal to manually fix up the control flow in that case, but it's not often clear that it's going on in the first place.

Solving control flow in the general case is impossible, but what we can do is emit gotos that jump into the original node if we encounter a duplicate one, similar to how we handle loops right now. I think the phi node work should make sure that this in fact preserves correctness.

As a smaller starting point, it would also be helpful to just print a warning when duplicate nodes are encountered.

Allow switch targets to be defined without glabel

We should do switch target detection before we prune away unreferenced labels.

Default to decompiling all functions in a .s file

and move the non-verbose ERROR printouts to behind a flag

test

payload

sp-relative arrays are broken

Currently the addiu sp, sp, -stack_size instruction modifies the recorded sp register to contain sp - stack_size, but then we pretend that never happened when using sp-relative addressing in most contexts ("sp is always sp"). In a few contexts, however, mostly/only involving arrays, we actually do fetch the sp register though, ending up with references to local variables like &sp-20 (intended to be &sp).

IIRC, I also saw some switch statements move the stack pointer before, within a small block. So it might be worth explicitly storing the current sp offset, and then make every sp-relative computation take that into account. (Edit: I think this is wrong and I was just confused by static functions.)

Phi elimination based on equal values is too greedy

In the following case:


glabel test
    li    $s0, 0
.loop1:
    move  $v1, $s0
.loop2:
    sw    $v1, ($zero)
    move  $v1, $s0
    bltz  $zero, .loop2
     nop
    sw    $v1, ($zero)
    li   $s0, 1
    b  .loop1
     nop

we emit

void test(void)
{
    ?32 temp_v1;
    ?32 phi_s0;

    phi_s0 = 0;
loop_1:
loop_2:
    *NULL = (?32) temp_v1;
    temp_v1 = phi_s0;
    if (0 < 0)
    {
        goto loop_2;
    }
    *NULL = temp_v1;
    phi_s0 = 1;
    goto loop_1;
}

with a read from temp_v1 which gets assigned only later on. This is because we figure that on entry to loop2, v1 will equal phi_s0 regardless of where it comes from, so we pick one of the sources arbitrarily, and we end up taking temp_v1 which isn't set yet. Part of the problem here is that we do a deep equality check, unwrapping EvalOnceExpr's, and then use one of the outermost wrappers. Using a standard equality check would make us lose some cases of register restoration, I guess?

Output correctly nested switch blocks

Including when switches have multiple jump tables and non-jtbl branches.

Handle indirect jumps from switch statements

The laziest option here is to assume that every function has at most one switch in it, so that if there is an indirect jump, it can jump to any jumptable label. Also needs a lot more work to emit something sane, of course, both in if_statements.py and translate.py.

&& detection can drop statements

I was a case of this where it dropped an important statement of the form sp830 = strchr(sp830, 0x2C). We should never skip emitting blocks with statements in them, better just to fail && detection in that case.

(See also #33.)

Name temporaries after the fields they access

a temp v0 that gets assigned the value of a->b could be named b_v0

Local variables declared backwards

Really easy to fix, but I figured I'd write it down so I don't forget to fix it later. It appears that local variables are declared in reverse order from how they actually should be declared. Also, if we don't already combine declarations with assignment statements when appropriate, we should - it's a nice QoL thing. That is,

s32 sp00;
sp00 = 1;

should be emitted as

s32 sp00 = 1;

Again, we might already do this second part, but since the local variables are declared in reversed order, it never happens anyway.

Handle conditionally set registers

Here's the plan I'm thinking of:

for each block, mark which registers get written in that block
for each register reg, and for every block node n: we need to construct a value for n.regs[reg], either as a direct copy from another RegInfo, or as a phi node, in SSA parlance. Obviously, if this node has just one predecessor this is simple, but in the general case it's more tricky. So we do the following:
- compute whether there exists a node m such that all paths backwards from n, stopping at m, never reach any nodes that set reg
- for these purposes, treat the entry node as setting all registers
- there is bound to be some DFS-based algorithm for computing this in O(n), but the cheating way is to loop over all m and try them one by one. O(n^3 * regs) is hopefully fine... (one could get rid of the regs factor using bitsets, I believe)
- if there is such an m, pick one arbitrarily and copy m.regs[reg] from there
- given that we add an initial pass that throws away all unreachable nodes, these copies can never become cyclic. If they did, that would mean that all paths backwards from this cycle (not stopping anywhere) never set reg, but from some part of the cycle we must be able to reach the entry node, which does set reg, contradiction. This argument works even if we consider different registers at the same time.
- this means that we can compute an order in which to emit the nodes, acyclically, so that we can do direct m.regs[reg] copies when we want to
- if there is no such m, set n.regs[reg] to contain a PhiExpr, and emit a set to that phi variable at the end of all the predecessors. This PhiExpr should work like a EvalOnceExpr, only emitting the phi variable and associated sets, and only marking the associated sets as a read, once it's mark_used.

The if_statements code might already be doing some of these steps, I'm not sure.

Support JALR and BREAK opcodes

Add support for parsing JALR and BREAK opcodes. JALR can be interpreted as:

jalr $ra, $v0 becomes (*v0)(arglist...);
BREAK can either be interpreted as dropping the node it's in (assuming that it's a compiler check that was inlined, such as checking for integer overflow), or as calling a function called throw(break_value) that simply acts as a placeholder for the break logic.

Allow jumping to an epilogue in another function

Use float literals from rodata

loop.c crashing

I'm surprised noone's made this issue to at least discuss whats wrong on-record.

Import function signatures, structs and types from .c file

It would be neat if you could run main.py file.s index file.c, where file.c is the file we're trying to decompile into, and the decompiler would use that file (and its includes) to construct a function signature for the function we're decompiling, convert struct member accesses into reasonably named things, determine how many parameters to pass to functions, etc.

Improve casts at stores/loads

sb, lb, lbu etc. need to be marked somehow, but frequently aren't. We have some code for this already, but I think it's buggy and thinks it's fine to omit the cast when it isn't...

One thing we could potentially do to avoid ugly casts is to output declarations for all global variables referred to.

Use strings literals from rodata

Support functions that return structs

This happens using a different calling convention: a pointer is passed to a part of the stack that lies above subroutine args and below saved registers/ra.

Right now we emit the struct variables as subroutine_argX; sp{X*4} would be an improvement, or we could try to support it fully somehow.

(div is an example of a function that returns a struct)

Support `j` instruction

It can be either a tail call (replace by jalr, jr $ra, nop) or a branch within the same function (replace by b). Currently we fail with a cryptic assertion error (assert jump.is_branch_instruction() in the middle of flow_graph.py).

Infinite loop causes infinite loop

The following C

int x;
void test(void) {
    for (;;) {
        if (x == 2) break;
        x = 1;
    }
}

compiles to

.set noat      # allow manual use of $at
.set noreorder # don't insert nops after branches


glabel test
/* 0000B0 004000B0 3C030041 */  lui   $v1, %hi(D_4100E0)
/* 0000B4 004000B4 246300E0 */  addiu $v1, $v1, %lo(D_4100E0)
/* 0000B8 004000B8 24040001 */  addiu $a0, $zero, 1
/* 0000BC 004000BC 24020002 */  addiu $v0, $zero, 2
.L004000C0:
/* 0000C0 004000C0 8C6E0000 */  lw    $t6, ($v1)
/* 0000C4 004000C4 104E0003 */  beq   $v0, $t6, .L004000D4
/* 0000C8 004000C8 00000000 */   nop
/* 0000CC 004000CC 1000FFFC */  b     .L004000C0
/* 0000D0 004000D0 AC640000 */   sw    $a0, ($v1)
.L004000D4:
/* 0000D4 004000D4 03E00008 */  jr    $ra
/* 0000D8 004000D8 00000000 */   nop

/* 0000DC 004000DC 00000000 */  nop

with IRIX 5.3 -O2, on which we loop forever within immediate_postdominator, and to

.set noat      # allow manual use of $at
.set noreorder # don't insert nops after branches


glabel test
.L004000B0:
/* 0000B0 004000B0 3C0E0041 */  lui   $t6, %hi(D_4100F0)
/* 0000B4 004000B4 8DCE00F0 */  lw    $t6, %lo(D_4100F0)($t6)
/* 0000B8 004000B8 24010002 */  addiu $at, $zero, 2
/* 0000BC 004000BC 15C10003 */  bne   $t6, $at, .L004000CC
/* 0000C0 004000C0 00000000 */   nop
/* 0000C4 004000C4 10000006 */  b     .L004000E0
/* 0000C8 004000C8 00000000 */   nop
.L004000CC:
/* 0000CC 004000CC 240F0001 */  addiu $t7, $zero, 1
/* 0000D0 004000D0 3C010041 */  lui   $at, %hi(D_4100F0)
/* 0000D4 004000D4 AC2F00F0 */  sw    $t7, %lo(D_4100F0)($at)
/* 0000D8 004000D8 1000FFF5 */  b     .L004000B0
/* 0000DC 004000DC 00000000 */   nop
.L004000E0:
/* 0000E0 004000E0 03E00008 */  jr    $ra
/* 0000E4 004000E4 00000000 */   nop

/* 0000E8 004000E8 03E00008 */  jr    $ra
/* 0000EC 004000EC 00000000 */   nop

on -g, on which we die due to unbounded recursion, alternating between reach and end_reachable_without.

Combine declarations with assignments

From #26:

Also, if we don't already combine declarations with assignment statements when appropriate, we should - it's a nice QoL thing. That is,
s32 sp00;
sp00 = 1;
should be emitted as
s32 sp00 = 1;

Remove unnecessary parentheses at assignments and function calls

We should have a stringify_without_parenthesis function that removes the outermost layer of parentheses from the str() output if the value is a BinaryExpr. (We already do this for returns, so why not do it in other places as well.)

Support decompiling binaries

A bit tricky, since we'd have no %lo/%hi markers or function names... But maybe we could have a mode for "decompile the byte range A..B, and auto-generate global variable names"?

Alternatively, we should add documentation on how to use n64split to get a .s file in the right format for mips_to_c to use.

Limit output lines to 80 characters

I have two unrelated complaints about the current state of output:

Lines in the final C code should be limited to 80 characters wherever possible. Python's textwrap library will likely be useful here, although we may have to be intelligent about wrapping things like function calls and assignment statements.
There is too much noise in stdout when you just want to decompile one function. All of the registers, node information, etc should be output only when desired (e.g. for debugging) or when an error is detected. Also, node info should probably be output not in DFS-order, but in linear order (with the original MIPS).

More tests

arrays, both global and on stack
more complex control flow
|| and &&
all instructions
different data types for arguments/return types/subroutine calls
larger functions (real-world ones from some existing C project?)
...more ideas?

It would also interesting to try to fuzz-test the control flow code, to make sure it never crashes.

Handle `b` as a branch-likely when possible

b .label
x
...
x
.label:
y

should be converted to

b .label_prev
nop
...
.label_prev:
x
y

b only sometimes behaves this way, not always. (Presumably the compiler is about to emit bl but realizes that that's a silly thing to do and changes that to b.)

Store/loads from overlapping addresses should work

E.g. if you store 4 bytes to to sp + 20, and then lbu 1 byte from sp + 23, that's a u8 cast. This is important especially for arguments of size 8/16.

Currently we fail this miserably and emit loads of stack variables/arguments without any stores to them (we often eliminate stores to the stack).

Handle struct accesses to offsets that don't fit within s16

i.e. detect (var + (lui constant)) + offset and (var + (lui constant))->offset

Document how to best use the decompiler

How to install/use it, workflow, what the current bugs are and how to work around them, common patterns, ...

Don't trust %lo/%hi so much

Right now we assume that %hi's and %lo's are perfectly paired, so we can set %lo's to zero and expand lui %hi(x) to x. This often results in subtly incorrect output when the %lo/%hi's are wrong. We should ensure pairings are accurate, and if they aren't, emit some weird marker that calls attention to them being wrong (HI(x), LO(x)?).

Improve return detection

Mark returns as definite if v0/f0 is manually set and not read from, ~~and definitely void if both are unset at some return point.~~
~~Call mark_used on return values (on the return node's return register, really, which might be a phi expr!). Maybe move the logic for handling return values into translate.py, for that purpose.~~