iu-parfunc / gibbon Goto Github PK

A compiler for functional programs on serialized data

Home Page: http://iu-parfunc.github.io/gibbon/

Haskell 22.24% Makefile 0.13% C 50.20% Shell 0.17% Racket 1.14% C++ 1.30% Python 0.09% Common Lisp 23.20% Nix 0.02% Dockerfile 0.03% Gnuplot 0.02% Rust 1.47%

gibbon's Introduction

The Gibbon Compiler

Gibbon is an experimental compiler that transforms high-level functional programs to operate on serialized data.

Typically, programs that process tree-like data represent trees using pointer-based data structures in memory (one heap object per-leaf and per-node) because such a layout is convenient to manipulate in a high-level programming language. This is also generally distinct from the representation of the data in serialized form on disk, which means that a program must perform some sort or marshaling when working with serialized data. Gibbon unifies the in-memory and serialized formats, transforming recursive functions to operate directly on serialized data.

Additionally, while the pointer-based structure is efficient for random access and shape-changing modifications, it can be inefficient for traversals that process most or all of a tree in bulk. The Gibbon project aims to explore optimizations of recursive tree transforms by changing how trees are stored in memory.

Currently, the Gibbon compiler has multiple front-ends: an s-expression synax similar to Typed Racket, and a small subset of Haskell.

Building Gibbon

Getting Dependencies

Gibbon is implemented in Haskell, and is set up to be built with Cabal, but it has a number of native dependencies. Follow the instructions below to get all dependencies or enter the Nix shell with nix-shell to get them via Nix.

Ubuntu 22.04: (Parallelism support temporarily not available with ubuntu 22.04 as Cilk support is not avaiable with newer gcc)

 $ sudo apt-get update 
 $ sudo apt-get install software-properties-common 
 $ sudo apt-get install libgc-dev 
 $ sudo apt-get install libgmp-dev 
 $ sudo apt-get install build-essential 
 $ sudo apt-get install uthash-dev 
 $ sudo apt-get install vim wget curl

Install Racket

 $ wget --no-check-certificate https://mirror.racket-lang.org/installers/7.5/racket-7.5-x86_64-linux.sh
 $ chmod +x racket-7.5-x86_64-linux.sh
 $ ./racket-7.5-x86_64-linux.sh

Install haskell, cabal, stack, hls using ghcup

curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | BOOTSTRAP_HASKELL_NONINTERACTIVE=1 BOOTSTRAP_HASKELL_GHC_VERSION=9.4.6 BOOTSTRAP_HASKELL_CABAL_VERSION=3.8.1.0 BOOTSTRAP_HASKELL_INSTALL_STACK=1 BOOTSTRAP_HASKELL_INSTALL_HLS=1 BOOTSTRAP_HASKELL_ADJUST_BASHRC=P sh

Install rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain=1.71.0

Add paths for cabal, ghcup, rust to bashrc
For Building on OSX:

You can install some of the dependencies using Homebrew:

$ brew install libgc gmp gcc ghc@9

Others require a few extra steps:

Racket: Follow the instructions on it's website
uthash: Clone the repository and copy all the .h files in src to /usr/local/include

Actually Building Gibbon

After you have both Cabal and all the dependencies installed, you can build Gibbon from source:

$ git clone https://github.com/iu-parfunc/gibbon
$ cd gibbon && source set_env.sh
$ cd gibbon-compiler && cabal v2-build

At this point you can run the Gibbon executable:

$ cabal v2-run gibbon -- -h

If you'd like to run the testsuite, you can do so with:

$ ./run_all_tests.sh

Building a Developement docker container for Gibbon

To build the Dockerfile for dev purposes run the command below from the gibbon directory.

DOCKER_BUILDKIT=1 docker image build -t gibbon -f .devcontainer/Dockerfile .

Run the docker image using the following command.

docker run -t -i gibbon

This image does not pre-populate the gibbon folder. Use git clone to clone gibbon into a folder. Use instructions from before to build gibbon.

Building an Artifact version of Gibbon with the gibbon source code pre-populated

To build an image with the gibbon source code already in the image run

DOCKER_BUILDKIT=1 docker image build -t gibbon -f .artifact/Dockerfile .

Run the container with

docker run -t -i gibbon

This has the gibbon source code avaiable in /gibbon

Using Gibbon

A valid Gibbon program can be written using Haskell syntax or using Racket-like s-expression syntax. Gibbon doesn't support every Haskell feature supported by GHC, but informally, many simple Haskell-98 programs (sans monads) are valid Gibbon programs. One thing to note is that the main point of entry for a Gibbon program is a function named gibbon_main, as opposed to the usual main. Here's a simple Gibbon program that builds a binary tree and sums up its leaves in parallel using a parallel tuple (par):

module Main where

data Tree = Leaf Int
          | Node Int Tree Tree

mkTree :: Int -> Tree
mkTree i =
  if i <= 0
  then Leaf 1
  else
      let x = (mkTree (i-1))
          y = (mkTree (i-1))
      in Node i x y

sumTree :: Tree -> Int
sumTree foo =
  case foo of
    Leaf i     -> i
    Node i a b ->
      let tup = par (sumTree a) (sumTree b)
          x = fst tup
          y = snd tup
      in x + y

gibbon_main = sumTree (mkTree 10)

The Gibbon compiler is able to run in several modes, which are configured via command line flags. Most important are the flags --packed which means "packed mode" (use serialized data structures), --run which means "compile then run", and --parallel which means "enable parallel execution". You can use these to run the above program as follows:

$ gibbon --run --packed --parallel Bintree.hs

This creates a file Bintree.c which contains the C-code, and a Bintree.exe which is the executable for this program. Running ./Bintree.exe prints 1024, the value of sumTree (mkTree 10). There are many other Gibbon features which can be learned by looking at the programs under ./examples/parallel/, and more flags which can be printed with gibbon --help. To view a complete set of primitives supported by Gibbon, you can look at the Gibbon.Prim module located at gibbon/gibbon-stdlib/Gibbon/Prim.hs.

About this repository

This primarily stores the Gibbon compiler, an implementation of a high-performance functional language.

This repository also contains a collection of sub-projects related to benchmarking tree traversals and performing tree traversals on packed representations. Here is a guide to the subdirectories:

gibbon-compiler - the prototype compiler for the Gibbon language of packed tree traversals.
gibbon - a Racket #lang for Gibbon.
ASTBenchmarks - benchmark of treewalks (compiler passes) on ASTs written with Gibbon. Also includes scripts to fetch input datasets.
BintreeBench - a submodule containing the tiniest binary tree microbenchmark, implemented several different languages and compilers.
core-harvest - tools to harvest realistic, large ASTs (mainly Racket) from the wild.
DEVLOG.md - detailed documentation for those hacking on this repository.

gibbon's People

Contributors

Stargazers

Watchers

Forkers

csabahruska graydon adithyaselv trajafri adityagupta1089 a8e5 dan-zheng vidsinghal wavewave kirshanthans 0xtimmy pl-666 ulysses4ever

gibbon's Issues

Parallel add1Tree comparisons

We've got:

A Haskell sparks (futures) version (about 6X speedup and then flat)
A parallel/packed C version. (11X speedup on 2^20)
A parallel C malloc based version (almost no speedup, even with jemalloc)

To complete this and talk about it in a subsection, we should grab a couple more points of comparison:

Parallel C version with per-thread arenas. This should use a __thread variable to store an index into a table of arenas alloc pointers, and then each thread bump-allocates in its own arena. Padding is needed in the table to avoid false sharing. This version should measure how well the raw pointer chasing parallelizes, without a complicated memory manager in the way.
Since Java/hotspot was the best of our GC'd language tests, we should try a fork-join library version of this benchmark and see how it does.

Use actual symbols, not strings for variables in the compiler's ASTs

This is a no-brainer for any compiler. But we started off with expediency as paramount here.

Of the ~3 libraries I'd tested in the past, it looks like symbol is the only one in stackage.

Debug tracking: Make the substitution compiler pass go through

Test:
test27b_subst.gib

Run the test with;

$ cd $TREELANGDIR/gibbon-compiler/examples
$ DEBUG=2 stack exec -- gibbon --run test27b_subst.gib

Warnings:
All the warnings are generated by the inferEffects pass:

Warning: FINISHME: process these constraints: [Eql (Var "mkSetBang239") (Var "mkWithContinuationMark249"),Eql (Var "mkApp235") (Var "mkSetBang239"),Eql (Var "mkBegin0228") (Var "mkApp235"),Eql (Var "mkBegin221") (Var "mkBegin0228"),Eql (Var "mkIf217") (Var "mkBegin221"),Eql (Var "mkLetrecValues202") (Var "mkIf217"),Eql (Var "mkLetValues189") (Var "mkLetrecValues202"),Eql (Var "mkCaseLambda181") (Var "mkLetValues189"),Eql (Var "mkLambda177") (Var "mkCaseLambda181"),Eql (Var "mkQuoteSyntaxLocal167") (Var "mkLambda177"),Eql (Var "mkQuoteSyntax163") (Var "mkQuoteSyntaxLocal167"),Eql (Var "mkQuote159") (Var "mkQuoteSyntax163"),Eql (Var "mkVariableReferenceNull155") (Var "mkQuote159"),Eql (Var "mkVariableReferenceTop153") (Var "mkVariableReferenceNull155"),Eql (Var "mkVariableReference151") (Var "mkVariableReferenceTop153"),Eql (Var "mkTop149") (Var "mkVariableReference151"),Eql (Var "mkVARREF147") (Var "mkTop149")]
Warning: FINISHME: process these constraints: [Eql (Var "mkBeginTop268") (Var "mkExpression272"),Eql (Var "mkDefineSyntaxes264") (Var "mkBeginTop268"),Eql (Var "mkDefineValues257") (Var "mkDefineSyntaxes264")]
Warning: FINISHME: process these constraints: [Eql (Var "mkF2295") (Var "mkF3296"),Eql (Var "mkF1291") (Var "mkF2295")]
Warning: FINISHME: process these constraints: [Eql (Var "mkCONSSYM301") (Var "mkNULLSYM302")]

Error:

L2_Traverse.hs#L139

gibbon: join: locations have inconsistent shapes: (TupLoc [Bottom,Bottom,Fixed "old_new_lc7553"],
 Fresh "mkCONSLAMBDACASE319")

Fix Switch to generate a switch!!

Unfortunately, this really needs to be fixed, because the perf results are unnecessarily undermined otherwise.

Write CanonicalizeCase pass and allow fall-through default cases

Right now the #lang gibbon case expressions are very fragile. They require exhaustive pattern matches and I think they even require the same order as in the data definition.

This can be relaxed by adding a pass early in the compiler pipeline which:

puts the cases in a canonical order (if that's really needed... but in any case it doesn't hurt)
replaces all missing cases with errors

Plot gpkd bytesize vs time for AST passes

We did the slowdown plots, but we didn't report on the absolute time of the AST traversals. One way of showing that is to plot against tree size (bytes) on the X axis.

This would be per-pass:

countnodes
treewalk
subst
copy-prop

Derive printing functions automatically

This is set up to work for Int and Bool. It needs to be expanded to generate a print_Foo function for each (data Foo ...) definition, just as we derive an unpack_Foo function.

CC @chamibuddhika

Simple bintree param study (vary Int leaves)

This is a simple parameter study we can do without adding any new compiler/codegen functionality. It will make a single nice plot.

The idea is to vary the number of Ints in each Leaf. This will shift the ratio of "payload bytes" (Ints) to tag bytes.

Do this for each of {buildtree,sumtree,add1tree}. In another card on the project kanban board, there's the task of plotting the results.

Namely, the plot can be a factor plot (speedup of gibbon-c-packed over gibbon-c-pointer). It can have one line for each of build/sum/add1. X axis is ratio of payload-bytes/spine-bytes for the pointer version. Y axis is speedup of packed vs pointer.

Change the file extension to .gib

Instead of the inconsistent mix of .sexp and .rkt that is currently in the repo.

minor: rename Common.Constr => Common.DataCon

Discussion: best papers for global compilers that optimize data represenatation?

That is, if a given datatype T provides representation variants T1, T2, ..., how can a whole program compiler select which representation to use based on static information and/or profiling data?

One traditional example is array-of-struct vs struct-of-array. Also, in the past, I became interested in this problem for extensible record systems. For instance, while it might be good to represent most records in a "flat" form (single heap object), if you are frequently extending an {A,B,C} record with a D field, then it might be good to pick an indirection based representation in that particular case (i.e. a cons_D heap object that points to an ABC heap object).

But I still don't know of good examples of this kind of thing. MLton, for instance, is a whole program compiler that uses some cool representation tricks for data, function calls, etc, but I'm not aware of it making interesting choices between multiple representations based on global information.

Here's some old work on data structure selection for SETL:

An Automatic Technique for Selection of Data Representations in SETL Programs

And more recently there's work on data structure synthesis. Related, but not exactly what I'm asking for here.

Concurrent data representation synthesis

@laithsakka @vollmerm @cskksc - can you guys dig up anything else? Or @samth and @milindkulkarni, maybe you're already aware of some work in this area?

Git rid of the current C main argument handling, switch benchmark methodology

There's a mess of old configuration/setup stuff in rts.c and in the codegen.

It should move over to a system where there's an in-language primitive for mmap'ing a file. That, together with the iterate form and size parameter, makes in-language benchmarking straightforward. Then we can delete some of these already-broken kluges in C.

Design: data type factoring system (and apply to sizes)

TBD - this is a placeholder. What's our initial idea on this front? Please fill me in.

Quick and dirty proposal to get Target generating Pointer-based traversals

I initially thought that a codegen for pointer-based traversals would be quite different than our current codegen. But upon further reflection I think that some hacks to the current codegen might do the trick.

Here are the hacks:

allow case expressions through to Target.hs. Assume that any such cases are on pointer-based, non-packed data.
treat every constructor K[Int,Float]as layed out the same way as a tuple [Tag,Int,Float]. Use existing codegen to produce these struct types.
Type every pointer as an IntTy or a new, untyped PtrTy. The Target lang would not be strongly typed. Rather a case expression turns into a switch on the tag, followed by a block of code that projects the relevant fields from the struct, and casts them before storing in local variables of the expected names.
insert a macro call ALLOC(sz) every place there is a constructor, and then generate code that populates the field of the tuple-plus-tag at that newly allocated location.

Then, to compile in pointer-based mode, we would call lowering directly after flatten, and then go straight into codegen.

New Racket-AST pass in gibbon: implement constant folding

(Original discussion: #3)

This pass will stress deep pattern matching but will NOT require dictionaries. It just looks for expressions like (+ 3 4) and squishes them.

It's also a simple example that changes the size/layout of the output tree... this will be interesting when we later add parallelism.

Weird -O2/-O3 discrepancy to look into later.

test20e started segfaulting with -O3. Pretty sure it did not used to do that. It still runs with -O2.

Implement list primitives

This will be substantial work, but is still a good idea for all the originally cited reasons.

Here are some notes on it:

error "FINISHLISTS" is inserted in many (but not all) necessary places in the codebase. Grepping for that is a good place to start.

Establish uniform warmup policy for iterate form

The Packed benchmarks will essentially do lazy IO, reading mmap'd files during the first iteration of the benchmark.

We need the (iterate e) form to implicitly run additional warm up iterations. Possibilities include:

1 warmup iteration
N warmup iterations, thus half the time will be spent in warmup.
min(10,N) warmup iterations, or similar

Discuss.

Design the packed binary file format properly.

Add a magic number to the front of the files so we can tell what they are.
Well, it shouldn't actually be a (short) number per se, it should be a string that we prepend to each binary file, such as "Gibbon packed data v001".

Whatever we do should be lightweight... because we should probably attach it to ALL packed values, so that we don't have to fuss around with prepending it when we're going to mmap a tree out to disk.

Add command line argument passthrough to --run

For example:

gibbon --run foo.sexp -- <args_here>

Simple experiment: time the unpack_T function

@chamibuddhika, since you've already generated this code, timing it would offer a simple way to see the "deserialization cost". I.e., if we were sending trees over the network, deserialization cost for packed would be zero (usual CNF argument), whereas the cost of unpack_T would be the cost of deserializing a regular tree before operating on it.

(This is actually quite generous, because it assumes you are using near-optimal arena allocation instead of malloc or GC.)

minor: fix error when a required file is not there

uh, it does something ugly now if you (require "foo") for nonexistent "foo".

This would be bad for any external users of the tool.. and us too!

New forms for baby language

Add these to baby language:
list
car
cdr
cons
empty?
Listof
and
eq?
(error string)

Write canonicalizer for Racket core dataset

This can be a simple racket program that reads in the current batch of expaded data and produces something in a reduced grammar. This reduced grammar will have a direct interpretation as an algebraic type in our language.

Proposal from slack:

(1) I propose that we emphasize "expr" and squish away all top-level / module-related forms to the following:

(data prog [prog (list toplvl)])
(data toplvl
      [define-values (list Sym) expr]
      [define-syntaxes (list Sym) expr]
      [expression expr])
(data expr [VARREF Sym] [plain-lambda ...]...)
(data formals
      [F1 (list Sym)]
      [F2 (list Sym) Sym]
      [F3 Sym])

(2) we keep all the constructors of "expr" in the same order as the Racket docs, and that determines their tags, starting with 0

(3) for our own purposes we introduce the tags VARREF, F1, F2, F3 above, which are not part of the original grammar

(4) we build-in a notion of list fields in our ASTs, e.g. (list Sym). Worst-case they could be new, packed cons-list datatypes, but we can do better if we bake them in.

(5) by convention we strip all occurrences of #% prefixes in tags

Small thing: Add Strings to ErrT in Target.hs

Benchmark data collection check list

Data collection can be checked off when the results are checked in to the benchdata repo. The tentative plan is for all results to come from cutter.crest.iu.edu. If you don't know how to log in and get onto a dedicated worker node with srun, just ask.

For the AST benchmark checklists below, here are VARIANT (#14) settings in order of priority. Issue #14 also describes the CSV data format and other details.

treelang-c-packed -- our technique
treelang-c-pointer -- baseline technique
treelang-racket -- gives a comparison against an existing functional compiler
handwritten-racket -- gives a comparison against idiomatic functional code

And these have been cut:

handwritten-c-pointer -- cut, just too hard to do.

Bintree microbenchmark

We have done many variants of this one because it's so simple. But we need to gather a full dataset not just one number for each implementation. Here's the requirement for a full dataset:

Schema is the same as CSV for AST benchmarks (NAME, VARIANT, ARGS, ITERS, MEANTIME)
Use the same methodology of doubling ITERS until the BATCHTIME is over 1.0s.
ARGS = depth of tree, vary this from 1-25. 2^25 takes 335MB packed.
stop gathering data if a single iteration takes over 10s (i.e. don't test larger sizes, just let that line end)

Given those criteria, here is the checklist of data to collect for our "shootout" subsection. These are in decreasing importance, if we don't get to them all:

"treebench" - i.e. the add1-to-leaves benchmark

Here we only time the tree traversal itself, not allocation or initialization. For variants that need to allocate/delete each iteration, they must include that in the timing loop, however.

The above strings should also be the exact contents of "VARIANT".

"buildtree"

This one is just building the tree that is input to "treebench". For this benchmark we time only the tree building phase. For simplicity we initialize all the leaves to 1. Some of the handwritten benchmarks above do some arithmetic to initialize to 1..N, that needs to be changed to a constant 1.

"sumtree"

This one is just a fold. Add all the leaves of the binary tree.

AST benchmarks

This checklist is for the newer no-list AST benchmarks.

Non-time based statistics collection

We need to do these for building other figures:

establish canonical key space for bench inputs, i.e. the ARGS column. (Done: this is cleaned_list.txt.)
collect results of countnodes over the racket dataset (number tags/objects in the tree)
count bytes in the packed racket dataset (trivial)
count bytes in the C-struct/pointer-based representation of the racket dataset.

substitution

treelang-c-packed
treelang-c-pointer
treelang-racket
handwritten-racket

copy-prop

treelang-c-packed
treelang-c-pointer
treelang-racket
handwritten-racket

KD-tree benchmark

The following are the final benchmarks that will be evaluated (just some renaming to remove any confusion )

STLC typecheck

Implement dictionary ops -- somehow

The only value types here are

Int
SymT - also Int
Packed data

I don't even know if we need Ints. Basically we want to be able to write passes like inline that keep an environment of tree values (pointers into existing buffers) as they walk down the tree. If we just assume void* that covers all three of the above: a single word.

Intern symbols when packing parsed sexp AST

I propose that we assume symbol interning has happened before we time compiler passes. So for all intents and purposes, variables will be Ints in the packed representations.

If we were using Packed trees as a communication format (as we do with general Compact Normal Forms) then this would cause problems, because we'd need to send thy symbol table. But that is way outside the scope of this paper.

Including symbol interning in each benchmark will mean extra complexity for the hand-written benchmark baselines.

Fix relative path handling for require'

It must match Racket and be relative to the file with the (require _) form, not the current working directory.

Related to #17

Partial tree traversal detection in RACKET backend

This was an idea discussed briefly with @spall in our last call. @samth may have an opinion on whether this is a good idea.

The idea is that we could get gibbon-racket to run in a debug mode that closely tracks the functionality available in the gibbon-c-* backends. Namely, it could enforce "full tree traversals". A candidate algorithm is as follows:

every data constructor implicitly gets an extra mark field.
every function call gensyms a fresh mark, and data constructed during its dynamic extent gets that mark
every function is wrapped with a contract that traverses all output trees, ensuring that they ONLY contain the mark of that function, and not any "leftover" marks from input trees

I believe this contract would ensure full traversals.

Even as we shore up these limitations and perform copy insertion, it may help to have the ability to do these kinds of assertions for sanity checking and debugging.

Add a tuple flattening pass

This would go before unarise, it would:

ensure that MkProdE takes only arguments of non-tuple type. Some arguments will be of the form ProjE ix e, however.
after this pass, no function return, Case/If return, or let binding has a type with a nested tuple.

Decisions: Language to compile and benchmark input exprs

Goal: have large, realistic programs as benchmark inputs, while keeping the language very simple & tractable.

Goal: have N json files (or some simple format) with large, standalone programs in them.

Challenge: module systems.
Challenge: large lists of primitives and corner cases. Complexity.

Candidates:

Racket core: as described by this grammar
GHC core:
?

Proposal AST lobotomy:

take realistic programs
mash the primop set down to <10 primops
preserve realistic structure.

Alternative: make synthetic programs. Hard to convince of realism.

Experiments in notation

Consider this program:

share t =
 case t of
   Node x _ ->
       let x' = share x in
       Node x' x'
   Leaf _ -> t

Share fails to traverse to the rightmost point, thus it CANNOT produce end pointers.
It does write output data though, so it needs to return the end of what it outputs.

Let's think about how we compile this example. Assume that our solver/synthensizer
has these functions available to it:

copy     :: forall (a :: Packed) (p,o :: Region) . a [p,o]->[-p,-o] a
traverse :: forall (a :: Packed) (p :: Region)   . a [p]->[-p] a

We may have to perform an iterative transform that reaches a fixed
point. First, we know we need input and output regions based ONLY on
the type:

share ::  Tree [p,o]->[-o] Tree

We then can insert the implicit args, generate region names, and
leave holes for new arguments and return vals:

  share t [p,o] =              -- t at p
   case t of
     Node x[p1] _ ->           -- x at p1, p1 in p
         let x'[_] = share x [_,_] in
         do o2 <- Node[_]
            o3 <- yield x' [o2]
            o4 <- yield x' [o4]
            return [o4]                -- o4 = -o
     Leaf _ -> t[_]

There are a lot of possibilities here for how we transform data
constructors. The above represents one possibility.

We can solve part of this, but cannot finish it without changing the
program. We can propgate some region constraints, and we can also
explicitly choose copying as our strategy for getting the subtrees
into the right output region:

  share t [p,o] =                      -- t at p
   case t of
     Node x _ ->                       -- x at p1, p1 in p         
         let x'[p2] = share x [p1,?] in 
         do o2 <- Node[o]
            o3 <- copy x' [p2,o2]
            o4 <- copy x' [p2,o3]
            return [o4]                -- o4 = -o
     Leaf _ -> copy t [p,o]

Finally we need a new buffer, and then we have solved all the holes:

  share ::  Tree [p,o]->[-o] Tree
  share t [p,o] =                      -- t at p
   case t of
     Node x _ ->                       -- x at p1, p1 in p         
         let o' = newbuf in
         let x'[p2] = share x [p1,o'] in 
         do o2 <- Node[o]
            o3 <- copy x' [p2,o2]
            o4 <- copy x' [p2,o3]
            return [o4]                -- o4 = -o                  
     Leaf _ -> copy t [p,o]

Discussion: closed vs. open tree traversal language

Closed:

First-order language. No unpacked sums or products?
Standalone or embedded. Say, a little Racket macro.

Open:

Host language (say, Racket, Haskell, OCaml, whatever, best if it supports algebraic datatypes)
Could implemented as a complier pass / pragma
Or could be implemented as a preprocessor / macro

Formal model:

applies whether

current L1->L3
assumes higher order
assumes boxed pruducts and sums (BUT NOT Mu types)
- e.g. how to do a Int -> List PacketTree

Which of these possible passes to tackle?

See also #4 about what language we should compile.
And see #5 about non-AST tree benchmarks.

Does not keep environment

Basic desugaring and other simple passes: e.g. (let->((lambda
Substitution -- needs only fixed old/new arguments.

Keeps environment:

I.e. needs to maintain a dictionary/finite-map structure as it walks down the tree.

Constant prop / constant folding
Beta reduction (representative of simplification/inlining)
Type checking (Sarah is working on this)
Interpreter

Needs state:

Renaming / gensym
Hygenic macro expansion (stateful and more interesting)

Other:

Program analysis, abstract interpreters, or basic dataflow analyses

Eventual: Self-hosting gibbon

This is a long time off, but I thought we should have a discussion about planning so that our eventual trajectory bends towards this goal.

@milindkulkarni, @kunalwagrawal, @samth and myself have plotted a course (and submitted a grant proposal) on parallel compilation. This overall plan requires a number of pieces and I think there are some tricky questions of how to make our work actually fit together, rather than just on paper. For example, when we work on pass-fusion, where should that go?

If Gibbon were its own stand-alone, self-hosted implementation, then the defacto place for where new gibbon work would go would be in gibbon itself.

By being a Racket #lang, Gibbon could lean on Racket to make a gradual transition to self hosting.

But before we seriously consider writing compiler passes for Gibbon, in gibbon, we need to sort out other parts of the "compiler DSL" story. Like the nano-pass / generic programming story. Here's a rough set of issues:

generic programming (typed nanopass)
meta-programming / allowing some higher-order code
pass fusion (may relate to nanopass encoding)
pass vectorization
low level ELF / JIT details. Can we make the best JIT that's also the best AOT simultaneously?
runtime + scheduling issues

Please help brainstorm on this.

Push list fields and map through the compiler

First it needs to be added to L1, then traversal inference, and then cursorize.

Cleanup, debugging, and robustness tasks for compiler prototype

That pre-PLDI hacking sprint left quite a bit of technical debt behind. Here is the plan for getting things on solid footing:

Stop using a pattern synonym hack to encode CursorsTy's as PackedTy. Make it its own type in L1. The WriteInt pattern synonyms are much less of a problem. They can probably stay.
Add an interpreter and run it after (almost) every pass.
Run the type checker after (almost) every pass and debug any remaining problems.
Add linear/session types to the type system and type checker (#41)

Make linear+session types real

This will be a major step up in assurance for the compiler internals. The core cursorizing transform is really so complicated and error prone that it warrants it.

I won't describe the technique here. Read the paper for more details..

P.S. Note that if this were implemented in a type-indexed, Accelerate style, we could consider doing it at the host type-system level, and thus verifying that our compiler passes correctly handle cursors for all inputs. That's a lot of work, though. It would probably happen in an inverted version of the technique described in the ghostbuster paper. You would need to go to an indexed representation, do the verified transform, and then drop back down. Either that or refactor the whole compiler. But we have generally felt that that provides too high a cost/benefit ratio. Perhaps there are lighter weight (static) verification techniques... In the meantime, the more dynamic testing approach in this issue should do quite well.

Change unpack_* return values to PtrTy rather than CursorTy

Make the treelang compiler properly usable from Racket

Right now it's a standalone compiler.

From accelerack, we already have experience with FFI and multi-language projects playing nice with Racket. When it is ready, we can make it possible to:

compile treelang modules through the packing compiler (as now)
but instead of producing an executable, produce a library
during macro expansion, generate stubs that call into this library and do the usual FFI things

The end result is that a Racket module can call treelang code, i.e. just like a racket module can call typed-racket code.

Draft sec 1-3 (@samth)

@samth will finish this.

Eventual: ELF binaries in memory as JIT target

This was briefly discussed with @chamibuddhika in Slack. (CC @samth)

The simple question is -- can ELF binary format be a good JIT target? If it were, then optional AOT compilation == msync.

We've been asking "what barriers can be removed from the compiler", i.e. pass/pass barriers. But another big one is the global compile/run barrirer. Essentially the compiler is a global synchronization barrier, you don't run the binary until the entire compiler finishes.

"JIT mode" may be a way of running the code before all of it is compiled. JITs create a kind of streaming situation, where the start of the main function is the thing that should be compiled first, and then everything else is prioritized based on its first occurrence in the execution. (Or, in the hotspot style, 1st execution is always interpreted.)

Interpreting would just be "speculation" in this sense. If you MUST produce the compiled binary on disk, then all the work of compilation is scheduled to happen eventually. The only reason to interpret is if a parallel worker is so starved for work that it makes sense to do this "low quality" (less efficient) work. (correct? @kunalwagrawal).

Make gibbon a proper #lang

So that we only have to write #lang treelang.

Also, we may want to make it the idiom to do #lang sweet-exp treelang.

Include at least one non-AST benchmark

To satisfy reviewers, include one or more non-AST tree benchmarks:

DOM passes (hard!?)
kd-tree? and other nearest-neighbor tree variants. (Why haven't people done manual unpacking for this domain?)

Our very simple microbenchmarks fall into this category:

add1Leaves -- our first becnhmark we implemented for packed representations. Add 1 to all the leaves of a binary tree.
treeSize -- traverse a tree and count all its leaf and interior nodes.

Rejected:

unbalanced tree search (UTS) - let's not do this one; the tree doesn't exist apriori.

CC @laithsakka @milindkulkarni

Design: Next-gen Cursorize, based on data-flow analysis

@vollmerm and I are both eager to refactor and reduce the complexity of the current Cursorize. (Even though I think it should be correct and complete with a little more debugging.)

The next-generation idea was to do a dataflow analysis first, and determine a-priori the destinations to which any packed-allocating function calls (or primitive constructors) flow.

Here we introduce a symbolic notion of destinations which may be related to, but not necessarily identical to, the lattice of locations used in the traversal-inference pass. Two examples that we can use when discussing this are the add1 recursions (let bound or inline) as well as a sharing version:

let x' = add1 x 
    y' = add1 y
in Node x' y'

versus

let x' = add1 x 
in Node x' x'

In both cases the Node data constructor flows to the "output of function" destination. Whereas in the first example x' flows to "output of function + sizeOfTag", in the second example we have a conflict, where x' attempts to flow to two different, non-equal destinations. The data-flow analysis must produce a list of conflicts as well as an annotated program. It is the job of a subsequent pass to repair these conflicts: i.e. introducing copies, duplicating work, or changing the representation.

Either this pass or a subsequent pass we should introduce an explicit let region binding (which in our case could just be a normal let binding to a NewScopedRegion) for any lexically scoped regions. I.e. ones that do not flow to the output of a function.

P.S. Remember to check out the new ICFP submission from SPJ and others on "destination passing style".

Reference: Benchmark and CSV output conventions

I'm copying this over from Slack because it needs to be referenced and possible discussed in a thread.

We're currently producing output to this directory in NSF space on [email protected]:

~/results_backup/tree-velocity/cutter/345

Where:

cutter is the hostname, stripped of digits characters that indicate worker node (e.g. cutter03)
345 is the "git depth" i.e. number of commits on master, defined as git log --pretty=oneline | wc -l

There's no separation between the benchmark harness and the subst pass code now (benchmark-er and benchmark-ee)... it's just one racket script that loops through all the files. It also bakes in a number of conventions that we must follow in other benchmarks:

(1) output looks like this:

NAME, VARIANT, ARGS, ITERS, MEANTIME
substitution, treelang-racket, ./collects/acks/acks.rkt.out.sexp, 524288, 3.757476806640625e-06
substitution, treelang-racket, ./collects/compiler/cm-accomplice.rkt.out.sexp, 131072, 1.32598876953125e-05
substitution, treelang-racket, ./collects/compiler/cm.rkt.out.sexp, 4096, 0.000304443359375
substitution, treelang-racket, ./collects/compiler/compilation-path.rkt.out.sexp, 32768, 3.57666015625e-05
substitution, treelang-racket, ./collects/compiler/compile-file.rkt.out.sexp, 65536, 1.82647705078125e-05
substitution, treelang-racket, ./collects/compiler/compiler.rkt.out.sexp, 8192, 0.0001400146484375
substitution, treelang-racket, ./collects/compiler/distribute.rkt.out.sexp, 8192, 0.000162353515625
....

(2) the policy for running is to start at "1" iteration and double the number of iterations until it takes >= 1 second

(3) the value reported is the mean time/iter for the final batch, ignoring all the previous batches

Note we are NOT using Criterion and NOT using linear regression here. We'd have to do a bit more work to hook it up and there's no time.

What do we need to build a log/log scatterplot?

We need one or more data files of the above format.
We need another file that maps these canonical paths ./collects/acks/acks.rkt.out.sexp onto NumNodes

Then we can plot log(size) vs log(time) and have a scatterplot with a different color point per VARIANT

We also need a standard namespace for variants. I propose:

treelang-c-packed - C backend for treelang, packed representation
treelang-racket - treelang executed through Sarah's racket embedding
treelang-c-pointer - we wont get to it for this paper, but the compiler-generated pointer-chasing code
handwritten-c-pointer
handwritten-c-packed

And if we need them, we can also have handwritten-racket, handwritten-haskell etc. Also, we could probably print treelang trivially to other functional languages to benchmark their compilers if we wanted... eg treelang-ghc and treelang-mlton. But that would only really work for self-contained benchmarks (add1tree), because we're not going to replicate our hacky SExp-parsing for our compiler passes any more than we already have.

TODO list meta issue

Here are some TODO items for a paper. Normally we keep implementation separate from paper, but it should be fine to put a paper in this repo, since it's just microbenchmarks and sketches at the moment, not a reusable library.

Please update the following when someone claims an issue, and split out into separate issues for discussion as needed.

(started by @rrnewton and @chamibuddhika) Manually parallelize the existing bintree microbenchmark. (Mainly using C.)
- Introduce offset entries within the packed trees, thus enabling mixed representations and parallelism.
- Introduce block-chains for a more realistic allocation story.
- Use Cilk for a simple fork-join C implementation.
(@rrnewton) finish L1->L2->L3 prototype compiler. This is the first model of the formal language, and hopefully what will get translated into greek in the paper.
(?) Populate a paper skeleton and start filling in the core language for the compiler transforms.
(?) Harvest input program(s) for beta-reduction benchmark. Now moved to issue #4.
(?) Manually implement the packed version of a compiler pass in C/C++/Rust. List of possible passes in #3.
(@milindkulkarni & @laithsakka ) Manually implement a packed version of kd-tree for a non-AST example. See #5.
(?) Select other compiler passes and repeat for those.
(?) Look into doing a front-end that meshes with a host language.
- Could be a subset of Haskell + Template Haskell or a preprocessor (e.g. haskell-src-exts). Could be a core-to-core pass.
- Could be a very simple DSL in Racket, implemented in a macro but not allowing arbitrary Racket code.
- Could be something standalone if we make the closed world assumption as described in #2

Week-by-week plan for PLDI17 deadline

Wk1 (10/10): get at least one benchmark AST and a couple manual compiler pass implementations. Finish translator for minimal core language (2). Finish (1) above and hopefully (6).
Wk2 (10/17):
Wk3 (10/24):
Wk4 (10/31):
Wk5 (11/7):
Wk6 (11/14): paper due at start of week Tues, 11/15.

PLDI17 Implementation checklist: Compiler passes and runtime components

I'm working on the middle of the compiler. Here's the basic pipeline. Check these off when "add1" goes through, and then we'll come back and do a second pass to finish the rest

Parsing (from Haskell) (@osa1)
Parsing (from SExp) - we need one or the other of these parsers at 100% functionality
Effect inference (traversal effects) (@rrnewton)
Missing traversal insertion
Copy insertion
Copy/traversal codegen. (After this we can actually rerun effect inference and confirm there are no missing traversals.)
Lowering to target language and cursor insertion.
C Codegen

I'm in the middle of hacking on these middle passes. It's a nano-pass style but most of these are very simple.

And then here are the runtime pieces:

implementation of SExp parsing for ASTs (@chamibuddhika)
converting parsed SExps to packed rep, including interning symbols (#6) (@chamibuddhika)
implementation of dictionary ops (#10)

And other pieces of the implementation:

Direct implementation of the treelang in Racket (@spall) -- one baseline for benchmarks
By-hand benchmarks, other baselines
- Write data cleaner (@samth #12)
Since we're settled on AST format for benchmarks, complete data cleaning (@samth #12)

Minor: allow ShakeTree to drop ScopedBuffer that is not used

This is useless allocation.