Giter Site home page Giter Site logo

akkartik / mu Goto Github PK

View Code? Open in Web Editor NEW
1.3K 39.0 46.0 90.09 MB

Soul of a tiny new machine. More thorough tests → More comprehensible and rewrite-friendly software → More resilient society.

Home Page: http://akkartik.name/akkartik-convivial-20200607.pdf

License: Other

C++ 6.30% Shell 0.54% C 0.31% Emacs Lisp 0.02% Assembly 70.45% Forth 22.14% Less 0.01% Python 0.05% HTML 0.01% Vim Script 0.18%
programming-literacy white-box-testing linux x86 unit-tested machine-code

mu's Introduction

Mu: a human-scale computer

Mu is a minimal-dependency hobbyist computing stack (everything above the processor).

Mu is not designed to operate in large clusters providing services for millions of people. Mu is designed for you, to run one computer. (Or a few.) Running the code you want to run, and nothing else.

Here's the Mu computer running Conway's Game of Life.

git clone https://github.com/akkartik/mu
cd mu
./translate apps/life.mu  # emit a bootable code.img
qemu-system-i386 code.img

screenshot of Game of Life running on the Mu computer

(Colorized sources. This is memory-safe code, and most statements map to a single instruction of machine code.)

Rather than start from some syntax and introduce layers of translation to implement it, Mu starts from the processor's instruction set and tries to get to some safe and clear syntax with as few layers of translation as possible. The emphasis is on internal consistency at any point in time rather than compatibility with the past. (More details.)

Tests are a key mechanism here for creating a computer that others can make their own. I want to encourage a style of active and interactive reading with Mu. If something doesn't make sense, try changing it and see what tests break. Any breaking change should cause a failure in some well-named test somewhere.

Mu requires a 32-bit x86 processor. It supports a short list of generic hardware. There's no networking support yet. Development has slowed, but I still care about it. Feedback, bug reports and other forms of contribution continue to be appreciated.

Mu in the press

Goals

In priority order:

  • Reward curiosity.
  • Safe.
    • Thorough test coverage. If you break something you should immediately see an error message.
    • Memory leaks over memory corruption.
  • Teach the computer bottom-up.

Thorough test coverage in particular deserves some elaboration. It implies that any manual test should be easy to turn into a reproducible automated test. Mu has some unconventional methods for providing this guarantee. It exposes testable interfaces for hardware using dependency injection so that tests can run on -- and make assertions against -- fake hardware. It also performs automated white-box testing which enables robust tests for performance, concurrency, fault-tolerance, etc.

Non-goals

  • Speed. Staying close to machine code should naturally keep Mu fast enough.
  • Efficiency. Controlling the number of abstractions should naturally keep Mu using far less than the gigabytes of memory modern computers have.
  • Portability. Mu will run on any computer as long as it's x86. I will enthusiastically contribute to support for other processors -- in separate forks. Readers shouldn't have to think about processors they don't have.
  • Compatibility. The goal is to get off mainstream stacks, not to perpetuate them. Sometimes the right long-term solution is to bump the major version number.
  • Syntax. Mu code is meant to be comprehended by running, not just reading. It will always be just a thin memory-safe veneer over machine code. I don't know how to make higher-level notations both fast and comprehensible, so they are likely to remain slow and comprehensible, useful for prototyping but invariably needing to be rewritten in statements that map 1:1 with machine code. The goal of a prototype should be a risk-free rewrite, thanks to tests that capture all the details of lessons learned.

Toolchain

The Mu stack consists of:

  • the Mu type-safe and memory-safe language;
  • SubX, an unsafe notation for a subset of x86 machine code; and
  • bare SubX, a more rudimentary form of SubX without certain syntax sugar.

All Mu programs get translated through these layers into tiny zero-dependency binaries that run natively. The translators for most levels are built out of lower levels. The translator from Mu to SubX is written in SubX, and the translator from SubX to bare SubX is built in bare SubX. There is also an emulator for Mu's supported subset of x86, that's useful for debugging SubX programs.

Mu programs build natively either on Linux or on Windows using WSL 2. For Macs and other Unix-like systems, use the (much slower) emulator:

./translate_emulated apps/ex2.mu  # 2-5 minutes to emit code.img

Mu programs can be written for two very different environments:

  • At the top-level, Mu programs emit a bootable image that runs without an OS (under emulation; I haven't tested on native hardware yet). There's rudimentary support for some core peripherals: a 1024x768 screen, a keyboard with some key-combinations, a PS/2 mouse that must be polled, a slow ATA disk drive. No hardware acceleration, no virtual memory, no process separation, no multi-tasking, no network. Boot always runs all tests, and only gets to main if all tests pass.

  • The top-level is built using tools created under the linux/ sub-directory. This sub-directory contains an entirely separate set of libraries intended for building programs that run with just a Linux kernel, reading from stdin and writing to stdout. The Mu compiler is such a program, at linux/mu.subx. Individual programs typically run tests if given a command-line argument called test.

The largest program built in Mu today is its prototyping environment for writing slow, interpreted programs in a Lisp-based high-level language.

screenshot of the Mu shell

(For more details, see the shell/ directory.)

While I currently focus on programs without an OS, the linux/ sub-directory is fairly ergonomic. There's a couple of dozen example programs to try out there. It is likely to be the option for a network stack in the foreseeable future; I have no idea how to interact on the network without Linux.

Syntax

The entire stack shares certain properties and conventions. Programs consist of functions and functions consist of statements, each performing a single operation. Operands to statements are always variables or constants. You can't perform a + b*c in a single statement; you have to break it up into two. Variables can live in memory or in registers. Registers must be explicitly specified. There are some shared lexical rules. Comments always start with '#'. Numbers are always written in hex. Many terms can have context-dependent metadata attached after '/'.

Here's an example program in Mu:

ex2.mu

More resources on Mu:

Here's an example program in SubX:

== code
Entry:
  # ebx = 1
  bb/copy-to-ebx  1/imm32
  # increment ebx
  43/increment-ebx
  # exit(ebx)
  e8/call  syscall_exit/disp32

More resources on SubX:

Mirrors and Forks

Updates to Mu can be downloaded from the following mirrors:

Forks of Mu are encouraged. If you don't like something about this repo, feel free to make a fork. If you show it to me, I'll link to it here. I might even pull features upstream!

  • uCISC: a 16-bit processor being designed from scratch by Robert Butler and programmed with a SubX-like syntax.
  • subv: experimental SubX-like syntax by s-ol bekic for the RISC-V instruction set.
  • mu-x86_64: experimental fork for 64-bit x86 in collaboration with Max Bernstein. It's brought up a few concrete open problems that I don't have good solutions for yet.
  • mu-normie: with a more standard build system for the linux/bootstrap/ directory that organizes the repo by header files and compilation units. Stays in sync with this repo.

Desiderata

If you're still reading, here are some more things to check out:

Credits

Mu builds on many ideas that have come before, especially:

  • Peter Naur for articulating the paramount problem of programming: communicating a codebase to others;
  • Christopher Alexander and Richard Gabriel for the intellectual tools for reasoning about the higher order design of a codebase;
  • David Parnas and others for highlighting the value of separating concerns and stepwise refinement;
  • The folklore of debugging by print and the trace facility in many Lisp systems;
  • Automated tests for showing the value of developing programs inside an elaborate harness;

On a more tactical level, this project has made progress in a series of bursts as I discovered the following resources. In autobiographical order, with no claims of completeness:

mu's People

Contributors

akkartik avatar an1lam avatar charles-l avatar jimmyhmiller avatar kyleamathews avatar stich11 avatar sumeet avatar tekknolagi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mu's Issues

Highish-level syntax for dependency-injected abort

SubX currently allows one to test the exit() syscall. It does so using a dependency-injected wrapper called stop that takes an exit-descriptor as an argument. If the exit-descriptor is null the program really exits. If it is created using tailor-exit-descriptor stop unwinds the stack until the frame that called tailor-exit-descriptor.

But tailor-exit-descriptor is klunky. The way it currently works is, you pass in the number of args of the function that's going to get passed in the exit-descriptor, and it computes where on the stack the return address for the current stack frame is going to be, saving it to the exit-descriptor. That way all stop has to do is set ESP to the return address in the exit-descriptor and then call c3/return.

If we moved to a more HLL syntax where function calls were all in a single line, we'd like to make tailor-exit-descriptor cleaner or maybe do away with it altogether.

One easier way to explain tailor-exit-descriptor is that it is equivalent to a special function call. Now for background, regular calls in SubX look like this:

  • push all args
  • call
  • discard args (say by adding a constant to ESP)

The special call to a function that may want to call stop looks like this:

  • push all the args (which must include the address to the exit-descriptor)
  • save ESP-4 to the exit-descriptor
  • call
  • discard args

Only the second step is new. But we now get to replace all of tailor-exit-descriptor with a single instruction (assuming the address of the exit-descriptor is always in a register).

Now my question becomes: what does a syntax for this special call look like?

If regular calls look like f(arg1, arg2, ... argn), then some possibilities:

ed = tailor-exit-descriptor(...total size of args to f...)
f(arg1, arg2, ...ed, ...argn)
w/exit-descriptor(ed) f(arg1, arg2, ...ed, ...argn)

I don't like either, but they do have some benefits. The first allows for a single exit-descriptor to be reused across function calls in the same stack frame (totally safe). The second is all on one line to indicate that conceptually it's all a single call.

On the other hand, the second now looks like two operations on a single line, which is confusing and potentially sets a bad precedent. So I wonder what sort of approach we may take that makes it look less like two calls in a single line.

  1. Baroque:
w/exit-descriptor{|ed| f(arg1, arg2, ...ed, ...argn) }

That's a lot of grammar for just one construct.

try f(arg1, arg2, ...ed, ...argn)

We'd need to somehow figure out where ed is.

protect[ed] f(arg1, arg2, ...ed, ...argn)
f<ed>(arg1, arg2, ...ed, ...argn)
  1. Some other random ideas: perhaps we should try to generalize the syntax with any other operations that involve munging the stack. Maybe closures or more general exceptions or first-class continuations.

Anyways, that's my brain dump.

cc @charles-l

Opcode mnemonics for SubX

Giving mnemonics to x86 opcodes might make SubX easier for newcomers to read. This issue is intended to track the pros and cons.

Previous discussion: https://news.ycombinator.com/item?id=21268252#21293301

Initial list of pros and cons:

  • Pro: names give some indication (albeit imprecise) of the operation being performed.
  • Pro: names can be checked by tooling. Typoing 8d for 8f seems easier than typoing copy for pop.
  • Con: per-opcode mnemonics are an additional set of things for the new reader to understand and navigate. And cross-correlate with the Intel manual, since it only gives names to sets of opcodes.
  • Con: tooling may have a hard time giving a good error message when a user typos one variant of say add for another.

Redo how SubX lays out code/data segments in memory

Background

Right now SubX has two ways to determine the address of a code/data segment.

  • Programs can provide a starting address for segments:
== 0x0a0001000
...
  • Programs can name segments, in which case segments get hard-coded addresses assigned by SubX:
== code
...

Either way, the first segment defined must be code and contain instructions.

This approach has long seemed a mess. It's hard to explain, and there's a certain amount of historical evolution that led to this mess. I started out expecting programs to specify addresses because that made writing tests easier, and later found use in naming segments so that I could keep code and related data close together. However I couldn't easily retire the old approach because so many tests relied on specific instructions being at specific addresses.

The named segments approach 'works' for a simple ELF binary with two code/data segments, but hasn't really been used with more than two segments even though it seems to support it. As I revise my mental model of how the heap works in the presence of ASLR, this mess has been slowing me down.

New plan

Programs always specify the starting address of segments. But they only have to do so once, the first time the segment is encountered.

== code 0x0a0001000
...
== code
...

This is a single uniform syntax that should help write small tests and also deal with ASLR. Code segments no longer have to come first. The name code is special.

Migration

I've been reluctant to make changes here because we have to change the C++ version and also change how we bootstrap SubX in SubX. But I think it's doable if we first start with SubX in SubX. Currently assort.subx and dquotes.subx fail tests if I add starting addresses to segment headers. First task is to make them pass. In particular, assort.subx will require some changes to the table data structure.

Now that I've made this plan, it looks like it's not needed for building allocate out of mmap. So I'm going to set it aside and hopefully deal with that issue first.

Mu: Can't build due to linker error

willow% CFLAGS="-O0" ./mu
updating mu_bin
-- /home/max/Documents/Dev/mu/.build
-- /home/max/Documents/Dev/mu/termbox
-- /home/max/Documents/Dev/mu
/usr/bin/ld: termbox/libtermbox.a(utf8.o): relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
willow% uname -a
Linux willow 4.15.0-12-generic #13-Ubuntu SMP Thu Mar 8 06:24:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
willow% lsb_release -irc
Distributor ID:	Ubuntu
Release:	18.04
Codename:	bionic
willow% 

Typo in mu/opcode

Hello,

I think there is a typo in mu/opcode :

opcode:

0f 86: jump disp8 bytes away if lesser or equal (unsigned), if ZF is set or CF is set (jcc/jbe/jna)

should be:

0f 86: jump disp32 bytes away if lesser or equal (unsigned), if ZF is set or CF is set (jcc/jbe/jna)

as every opcode around it. Maybe a missed modification from a copy/paste of opcodes 72 to 7f?

running on arm

I'm just curious if mu can be potentially be targeted for arm processors and how much work that might take.

Support the x86 carry flag `CF`

When I selected the x86 instructions SubX would support I excluded the carry flag on the basis that higher-level languages didn't really provide access to it. However, it turns out it is needed for comparing unsigned integers. The current opcodes for jumping on lesser/greater rely on the sign flag which is set purely by treating comparison operands as signed integers. When comparing unsigned integers (say two addresses), we need to check the carry flag.

http://unixwiz.net/techtips/x86-jumps.html makes this obvious, whereas the doc I've been consulting so far (https://c9x.me/x86/html/file_module_x86_id_146.html) refers to the distinction in only a single sentence:

The terms "less" and "greater" are used for comparisons of signed integers and the terms "above" and "below" are used for unsigned integers.

SubX: Tip in `subx help opcodes` doesn't work

willow% ./subx help opcodes     
Opcodes currently supported by SubX:
  01: add r32 to rm32
  03: add rm32 to r32
... SNIP ...
  0f 8e: jump disp16 bytes away if lesser or equal (ZF is set or SF != OF)
  0f 8f: jump disp16 bytes away if greater (ZF is unset, SF == OF)
Run `subx help instructions` for details on words like 'r32' and 'disp8'.
willow% ./subx help instructions
No help found for 'instructions'
Available top-level topics:
  usage
  registers
  opcodes
  syntax
Please check your command for typos.
willow%

Reduce idle cpu consumtion (using the hlt instruction)

This is a feature request: Currently, when waiting for keyboard input, Mu uses a busy loop (which you can tell by the high CPU usage in qemu, for example in the shell app).

On x86, idling is pretty simple, by using the HLT instruction.

Static checks for function outputs

Background

Mu (this repo) contains a compiler built in machine code, which converts a memory-safe high-level language into machine code. Since it's built in machine code, the compiler is intended to be as simple as possible. In particular, it performs no optimizations.

Most Mu statements translate to a single instruction of machine code. One consequence of this 1:1 constraint is that the language exposes registers, and requires programmers to manage them manually. However, the compiler will check registers, and raise errors if you read a variable after a different variable has clobbered its register.

Current design

Function headers currently specify the names, types and registers for their outputs. For example:

# example 1
fn foo -> result/eax: int {
  result <- copy 0x34
}

There is no return instruction.

Problems with the current design (Plan 0)

Failing to write outputs

We need some way to statically ensure that outputs are always written. Otherwise a caller may end up with access to random temporaries of any possible type, which violates the memory-safety guarantees I want:

# example 2
fn foo -> result/eax: int {
  {
    break-if-=
    result <- copy 0x34
  }
}

Detecting errors like this will require generating the control-flow graph for functions, and checking every possible path through them. I've built such analysis passes before, but never in machine code.

Outputs getting over-written

There's an even bigger problem: tracking variable lifetimes. Mu is designed so variable lifetimes nest perfectly like Russian dolls:

fn foo {
  var x/eax: int <- copy 0
  ...
  {
    var x2/eax: int <- copy 1
    ...
    {
      var x3/eax: int <- copy 2
      ...
    }  # x3 gets reclaimed here
    ...  # eax guaranteed to contain x2 here
    {
      var x4/eax: int <- copy 2
      ...
    }  # x4 gets reclaimed here
    ...  # eax guaranteed to contain x2 here
  }  # x2 gets reclaimed here
  ...  # eax guaranteed to contain x here
}

It's been relatively easy to do this in machine code. I just maintain a stack of variables and blocks, and I clean up all variables for a block when I encounter its }.

However, this nice property currently has a loophole due to function outputs. Consider this example:

# example 3
fn foo -> out/eax: int {
  var x/eax: int <- 0
  ...
  {
    var x2/eax: int <- 0  # should we push `x` to the stack here?
    ...
    break-if-=
    out <- copy 0x34
  }
  # Should `x` be accessible here?
}

Currently we save the state of registers when defining a new variable -- only if it wouldn't be written to by a function output later in the block. In example 3 above, we currently don't save x. However this strategy has two problems:

  1. Blocks can conditionally write function outputs. In this situation we lose access to x even in some situations where we could preserve it.
  2. Errors become hard to explain. "x isn't available here, because it may have been overwritten by out."

Back in March, my best attempt at clean error messages was to speak of live ranges in terms of static ranges of lines, ignoring control flow. Once an output is defined, you can't use other existing variables in that register in lower lines. (You can still shadow the output in new variables in nested blocks.) However, I've lately found myself breaking this rule as I write code. Fixed ranges of lines are easy in themselves, but they're also a very different rule from regular variables with shadowing, so it's been easy to forget the rule for outputs. So I need to add the static check to remind myself when I violate the rule. And I've been loath to get into that just because tracking possible control-flow paths seems so hard in machine code. Perhaps it's time to go back to the drawing board on this plan.

undefined reference to...

First of all, neat project. Second of all, I get the following list of errors:

ld: common.o: in function sprintf_va': common.c:(.text+0x836): undefined reference to __stack_chk_fail_local'
ld: common.o: in function sprintf': common.c:(.text+0x89b): undefined reference to __stack_chk_fail_local'
ld: common.o: in function printkf': common.c:(.text+0x96e): undefined reference to __stack_chk_fail_local'
ld: common.o: in function getCpuFlags': common.c:(.text+0xa4d): undefined reference to __stack_chk_fail_local'
ld: fatfilesystem.o: in function initializeFatFileSystem': fatfilesystem.c:(.text+0x96): undefined reference to __stack_chk_fail_local'
ld: fatfilesystem.o:fatfilesystem.c:(.text+0x2ef): more undefined references to __stack_chk_fail_local' follow ld: kernel.bin: hidden symbol __stack_chk_fail_local' isn't defined
ld: final link failed: bad value
make: *** [Makefile:10: kernel.bin] Error 1

In response to running the command: sudo ./gen_soso_iso init.soso examples/ex6.subx

Operating system is arch derivative.

Keyboard input stops working after moving the mouse

Reproduced as follows:

./translate shell/*.mu
qemu-system-i386 code.img -m 256 -snapshot -display gtk,zoom-to-fit=on

Initially the keyboard works. After clicking on the screen to lock the mouse and moving it, keyboard input stops working.

I've narrowed it down (while debugging another issue in v86): Mu turns off the mouse interrupts, but doesn't poll from the IO port. When a mouse event happens, the PS/2 controller sets the output buffer status bit, which prevents further interrupts (including keyboard) until port 60h is read.

translate_emulated produces no image on MacOS

Following the directions in the README on a MacOS machine, I get:

./translate_emulated life.mu

cat $* [0-9]*.mu  |linux/bootstrap/bootstrap run linux/mu                 > a.subx

./translate_subx_emulated boot.subx mu-init.subx [0-9]*.subx a.subx

cat $*            |linux/bootstrap/bootstrap run linux/braces                         > a.braces

cat a.braces      |linux/bootstrap/bootstrap run linux/calls                          > a.calls

cat a.calls       |linux/bootstrap/bootstrap run linux/sigils                         > a.sigils

cat a.sigils      |linux/bootstrap/bootstrap run linux/tests                          > a.tests

# no assort since baremetal SubX doesn't have segments yet

cat a.tests       |linux/bootstrap/bootstrap run linux/dquotes                        > a.dquotes

cat a.dquotes     |linux/bootstrap/bootstrap run linux/pack                           > a.pack

cat a.pack        |linux/bootstrap/bootstrap linux/survey_baremetal   > labels
nothing to do

... and no code.img is produced.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.