matt-kempster / m2c Goto Github PK
View Code? Open in Web Editor NEWA MIPS and PowerPC decompiler.
License: GNU General Public License v3.0
A MIPS and PowerPC decompiler.
License: GNU General Public License v3.0
With the following asm:
glabel test
addiu $sp, $sp, -0xC8
sw $ra, 0x34($sp)
sw $zero, 0xC4($sp)
lw $t4, 0xC4($sp)
addiu $t4, $t4, 0x1234
sw $t4, 0x40($sp)
addiu $r6, $a0, 0x10
sw $r6, 0x40($sp)
lw $ra, 0x34($sp)
addiu $sp, $sp, 0xC8
jr $ra
nop
and context C:
struct Dummy {
char pad[0x10];
long member;
};
void test(struct Dummy* arg);
we output
void test(struct Dummy *arg) {
long *spC4;
long *sp40;
spC4 = NULL;
sp40 = spC4 + 0x1234;
sp40 = &arg->member;
}
spC4 gets incorrectly detected as long
. This is because $t4 is used for both spC4 and sp40, and it hits the overwrite_reg codepath which unifies types.
I've seen blocks silently dropped e.g. when they were detected as being part of &&/||, or before I implemented a591103. Would be good to warn in these cases (assuming we don't get too many false positives).
(We should only do this for blocks that have statements to be written.)
E.g. (x << 24) >> 24
is a cast to u8/s8, sign depending on whether we have an arithmetical vs. logical right shift.
phi_s3 = arg1 >> 3;
if (arg1 < 0)
{
phi_s3 = (s32) (arg1 + 7) >> 3;
}
should simply become phi_s3 = arg1 / 8;
We detect
bgez $a3, .L00400100
cvt.s.w $reg, whatever
lui $at, 0x4f80
mtc1 $at, $temp
nop
add.s $reg, $reg, $temp
.L00400100:
but I once saw
cvt.s.w $reg, whatever
bgez $a3, .L00400100
something unrelated
lui $at, 0x4f80
mtc1 $at, $temp
nop
add.s $reg, $reg, $temp
.L00400100:
in some piece of -O2-compiled code (IIRC). Can we work around this with the current framework? Or are these patterns arbitrarily reordered together with other code, making a more complex approach necessary? Needs to be investigated.
In if_statements.py
, when we have a Body
, we should look for sequences of statements that match input macros (akin to how we match types given an input preprocessed C file).
if ((a->unk0 << 2) < 0)
really checks for (unk0b20 != 0)
.
The pattern (x << a) < 0
should be replaced by x & 0x20000000
and if x
is a struct access to a known struct with a bitfield at the relevant position then a use of that bitfield.
Would be nice to detect (a->b & c) >> d
as well.
Hit in practice in a case related to jump tables. I think the assertion was safe to remove but I'm not certain.
Also needs a test.
E.g. return x++;
currently emits something like
x = x + 1;
return x;
which is wrong. Maybe we could iterate over all registers contents recursively on writes and for loads overlapping the write mark corresponding EvalOnceExpr's as "emit if any new use appears".
I imagine post-increments are the most common case of this (?), so might be worth special-casing.
Edit: function calls are roughly the same as reads in this regard, except they also shouldn't be moved past other function calls, writes, or branches.
the misc1.c
fails in two ways, both for -g
and -O2
:
it outputs if (x) { return y; } return z;
rather than if (!x) { return z; } return y
(not super important, but could be worth fixing)
it returns something else than it's supposed to (local variable vs constant). I thought I fixed this with the phi node work... need to look into this. Update: fixed.
Currently they must be named L<hex addr>
, or they will be detected as functions, which breaks everything (you generally get a cryptic KeyError on before_target = label_prev_instr[old_label]
due to non-existent labels).
The simplest way to solve this would be to not mark a new function unless all branch labels are found (if they are found in the file at all). It might have false positives if the default case comes first and does a jr ra
, however, and the switch statement is not within an if
.
Another option would be to look at all jr <non-ra>
, look backwards to what table they read from, read that table and parse all entries as non-functions. However, that's more complex and requires rodata to be available.
I think the first option + our current heuristic + a good error message when jump table targets don't exist would be good. Or maybe #36 will make the second option simpler.
Currently outputs *(pointer arithmetic)
A few times I've seen nodes get emitted multiple times, e.g. with continue
, early returns, and nested &&/||. (Nightmare example that I just encountered: if (blah || (blah && (blah ? (var = func_1(blah)) : (var = func_2(blah))) != 0))
.) It's usually not a big deal to manually fix up the control flow in that case, but it's not often clear that it's going on in the first place.
Solving control flow in the general case is impossible, but what we can do is emit gotos that jump into the original node if we encounter a duplicate one, similar to how we handle loops right now. I think the phi node work should make sure that this in fact preserves correctness.
As a smaller starting point, it would also be helpful to just print a warning when duplicate nodes are encountered.
We should do switch target detection before we prune away unreferenced labels.
and move the non-verbose ERROR
printouts to behind a flag
payload
Currently the addiu sp, sp, -stack_size
instruction modifies the recorded sp
register to contain sp - stack_size
, but then we pretend that never happened when using sp-relative addressing in most contexts ("sp is always sp"). In a few contexts, however, mostly/only involving arrays, we actually do fetch the sp register though, ending up with references to local variables like &sp-20
(intended to be &sp
).
IIRC, I also saw some switch statements move the stack pointer before, within a small block. So it might be worth explicitly storing the current sp offset, and then make every sp-relative computation take that into account. (Edit: I think this is wrong and I was just confused by static functions.)
In the following case:
glabel test
li $s0, 0
.loop1:
move $v1, $s0
.loop2:
sw $v1, ($zero)
move $v1, $s0
bltz $zero, .loop2
nop
sw $v1, ($zero)
li $s0, 1
b .loop1
nop
we emit
void test(void)
{
?32 temp_v1;
?32 phi_s0;
phi_s0 = 0;
loop_1:
loop_2:
*NULL = (?32) temp_v1;
temp_v1 = phi_s0;
if (0 < 0)
{
goto loop_2;
}
*NULL = temp_v1;
phi_s0 = 1;
goto loop_1;
}
with a read from temp_v1 which gets assigned only later on. This is because we figure that on entry to loop2, v1 will equal phi_s0 regardless of where it comes from, so we pick one of the sources arbitrarily, and we end up taking temp_v1 which isn't set yet. Part of the problem here is that we do a deep equality check, unwrapping EvalOnceExpr's, and then use one of the outermost wrappers. Using a standard equality check would make us lose some cases of register restoration, I guess?
Including when switches have multiple jump tables and non-jtbl branches.
The laziest option here is to assume that every function has at most one switch in it, so that if there is an indirect jump, it can jump to any jumptable label. Also needs a lot more work to emit something sane, of course, both in if_statements.py and translate.py.
I was a case of this where it dropped an important statement of the form sp830 = strchr(sp830, 0x2C)
. We should never skip emitting blocks with statements in them, better just to fail && detection in that case.
(See also #33.)
a temp v0
that gets assigned the value of a->b
could be named b_v0
Really easy to fix, but I figured I'd write it down so I don't forget to fix it later. It appears that local variables are declared in reverse order from how they actually should be declared. Also, if we don't already combine declarations with assignment statements when appropriate, we should - it's a nice QoL thing. That is,
s32 sp00;
sp00 = 1;
should be emitted as
s32 sp00 = 1;
Again, we might already do this second part, but since the local variables are declared in reversed order, it never happens anyway.
Here's the plan I'm thinking of:
reg
, and for every block node n
: we need to construct a value for n.regs[reg]
, either as a direct copy from another RegInfo
, or as a phi node, in SSA parlance. Obviously, if this node has just one predecessor this is simple, but in the general case it's more tricky. So we do the following:
m
such that all paths backwards from n
, stopping at m
, never reach any nodes that set reg
m
and try them one by one. O(n^3 * regs) is hopefully fine... (one could get rid of the regs
factor using bitsets, I believe)m
, pick one arbitrarily and copy m.regs[reg]
from therereg
, but from some part of the cycle we must be able to reach the entry node, which does set reg
, contradiction. This argument works even if we consider different registers at the same time.m.regs[reg]
copies when we want tom
, set n.regs[reg]
to contain a PhiExpr, and emit a set to that phi variable at the end of all the predecessors. This PhiExpr should work like a EvalOnceExpr, only emitting the phi variable and associated sets, and only marking the associated sets as a read, once it's mark_used
.The if_statements
code might already be doing some of these steps, I'm not sure.
Add support for parsing JALR and BREAK opcodes. JALR can be interpreted as:
jalr $ra, $v0
becomes (*v0)(arglist...);
BREAK can either be interpreted as dropping the node it's in (assuming that it's a compiler check that was inlined, such as checking for integer overflow), or as calling a function called throw(break_value) that simply acts as a placeholder for the break logic.
I'm surprised noone's made this issue to at least discuss whats wrong on-record.
It would be neat if you could run main.py file.s index file.c
, where file.c
is the file we're trying to decompile into, and the decompiler would use that file (and its includes) to construct a function signature for the function we're decompiling, convert struct member accesses into reasonably named things, determine how many parameters to pass to functions, etc.
sb
, lb
, lbu
etc. need to be marked somehow, but frequently aren't. We have some code for this already, but I think it's buggy and thinks it's fine to omit the cast when it isn't...
One thing we could potentially do to avoid ugly casts is to output declarations for all global variables referred to.
This happens using a different calling convention: a pointer is passed to a part of the stack that lies above subroutine args and below saved registers/ra.
Right now we emit the struct variables as subroutine_argX
; sp{X*4}
would be an improvement, or we could try to support it fully somehow.
(div
is an example of a function that returns a struct)
It can be either a tail call (replace by jalr
, jr $ra
, nop
) or a branch within the same function (replace by b
). Currently we fail with a cryptic assertion error (assert jump.is_branch_instruction()
in the middle of flow_graph.py).
The following C
int x;
void test(void) {
for (;;) {
if (x == 2) break;
x = 1;
}
}
compiles to
.set noat # allow manual use of $at
.set noreorder # don't insert nops after branches
glabel test
/* 0000B0 004000B0 3C030041 */ lui $v1, %hi(D_4100E0)
/* 0000B4 004000B4 246300E0 */ addiu $v1, $v1, %lo(D_4100E0)
/* 0000B8 004000B8 24040001 */ addiu $a0, $zero, 1
/* 0000BC 004000BC 24020002 */ addiu $v0, $zero, 2
.L004000C0:
/* 0000C0 004000C0 8C6E0000 */ lw $t6, ($v1)
/* 0000C4 004000C4 104E0003 */ beq $v0, $t6, .L004000D4
/* 0000C8 004000C8 00000000 */ nop
/* 0000CC 004000CC 1000FFFC */ b .L004000C0
/* 0000D0 004000D0 AC640000 */ sw $a0, ($v1)
.L004000D4:
/* 0000D4 004000D4 03E00008 */ jr $ra
/* 0000D8 004000D8 00000000 */ nop
/* 0000DC 004000DC 00000000 */ nop
with IRIX 5.3 -O2, on which we loop forever within immediate_postdominator
, and to
.set noat # allow manual use of $at
.set noreorder # don't insert nops after branches
glabel test
.L004000B0:
/* 0000B0 004000B0 3C0E0041 */ lui $t6, %hi(D_4100F0)
/* 0000B4 004000B4 8DCE00F0 */ lw $t6, %lo(D_4100F0)($t6)
/* 0000B8 004000B8 24010002 */ addiu $at, $zero, 2
/* 0000BC 004000BC 15C10003 */ bne $t6, $at, .L004000CC
/* 0000C0 004000C0 00000000 */ nop
/* 0000C4 004000C4 10000006 */ b .L004000E0
/* 0000C8 004000C8 00000000 */ nop
.L004000CC:
/* 0000CC 004000CC 240F0001 */ addiu $t7, $zero, 1
/* 0000D0 004000D0 3C010041 */ lui $at, %hi(D_4100F0)
/* 0000D4 004000D4 AC2F00F0 */ sw $t7, %lo(D_4100F0)($at)
/* 0000D8 004000D8 1000FFF5 */ b .L004000B0
/* 0000DC 004000DC 00000000 */ nop
.L004000E0:
/* 0000E0 004000E0 03E00008 */ jr $ra
/* 0000E4 004000E4 00000000 */ nop
/* 0000E8 004000E8 03E00008 */ jr $ra
/* 0000EC 004000EC 00000000 */ nop
on -g, on which we die due to unbounded recursion, alternating between reach
and end_reachable_without
.
From #26:
Also, if we don't already combine declarations with assignment statements when appropriate, we should - it's a nice QoL thing. That is,
s32 sp00; sp00 = 1;should be emitted as
s32 sp00 = 1;
We should have a stringify_without_parenthesis
function that removes the outermost layer of parentheses from the str()
output if the value is a BinaryExpr. (We already do this for returns, so why not do it in other places as well.)
A bit tricky, since we'd have no %lo
/%hi
markers or function names... But maybe we could have a mode for "decompile the byte range A..B, and auto-generate global variable names"?
Alternatively, we should add documentation on how to use n64split to get a .s file in the right format for mips_to_c to use.
I have two unrelated complaints about the current state of output:
Lines in the final C code should be limited to 80 characters wherever possible. Python's textwrap
library will likely be useful here, although we may have to be intelligent about wrapping things like function calls and assignment statements.
There is too much noise in stdout when you just want to decompile one function. All of the registers, node information, etc should be output only when desired (e.g. for debugging) or when an error is detected. Also, node info should probably be output not in DFS-order, but in linear order (with the original MIPS).
It would also interesting to try to fuzz-test the control flow code, to make sure it never crashes.
b .label
x
...
x
.label:
y
should be converted to
b .label_prev
nop
...
.label_prev:
x
y
b
only sometimes behaves this way, not always. (Presumably the compiler is about to emit bl
but realizes that that's a silly thing to do and changes that to b
.)
E.g. if you store 4 bytes to to sp + 20
, and then lbu
1 byte from sp + 23
, that's a u8 cast. This is important especially for arguments of size 8/16.
Currently we fail this miserably and emit loads of stack variables/arguments without any stores to them (we often eliminate stores to the stack).
i.e. detect (var + (lui constant)) + offset and (var + (lui constant))->offset
How to install/use it, workflow, what the current bugs are and how to work around them, common patterns, ...
Right now we assume that %hi's and %lo's are perfectly paired, so we can set %lo's to zero and expand lui %hi(x) to x. This often results in subtly incorrect output when the %lo/%hi's are wrong. We should ensure pairings are accurate, and if they aren't, emit some weird marker that calls attention to them being wrong (HI(x), LO(x)
?).
v0
/f0
is manually set and not read from, void
if both are unset at some return point.mark_used
on return values (on the return node's return register, really, which might be a phi expr!). Maybe move the logic for handling return values into translate.py, for that purpose.Depends on #37, realistically.
See e.g. tests/irix-g/output/multiple-assigns.c, which reads from sp4, but that variable is never declared or written to.
So that teq $t1, $t2, 0
doesn't overwrite $t1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.