MiniJava

MiniJava is a subset of Java. It's a simple language for educational purposes.
Here is the grammar that I used to build the compiler. The Compiler in this repo is MiniJava to X64_86.
THIS COMPILER IS FULLY HAND WRITTEN. NO TOOLS WERE USED.

Development Status

The compiler is still under development here is the stages:

Lexer: finished.
Parser: finished.
AST Building: finished.
Type Checker has to passes:
  Symbol Table building: finished.
  Checking phase by consulting symbol table: finished.
Code Generator: finished.

Build and Run

$ make to build MiniJava from the source you will get MiniJava executable
run ./MiniJava with -h flag to see the help list.
You can Run with the following flags:

-h to show the help list.
-S to see the program from the Lexer prospective (Tokens).
-P to see the program from the Parser prospective for example:
if we have x + y * z it becomes (x + (y * z)).
-T to run the type checker.
-C to run the Code Generator, you will get a file called demo.s
under runtime/ this has the assembly code for your program.
After Generatin the assembly code you can get the executable of your program
by executing make run, the executable name is out
run the executable by typing ./out

Motivation

This is a side-project I totally built myself. The project
is for educational purposes. I studied compiler theories, and
I wanted to implement the theories I learned.

Acknowledgement

I realy thank University of Washington for making CSE P 501 Open source. This course is a great help and helped me a lot.

What you should know

To understand the code, you don't have to know Compilers' theories.
But here are some highlights, that might help:

Regular expressions
Compiler components: Lexer, Parser, Type checking, IR, and Code generation.
Know the order of compiler phases, and why we use this order?
What the AST? Why We use it?
What is the recursive-descent parser?
Hash tables
Graph theory

Lexical Analysis

The lexer does the Lexical Analysis. Lexer implementaion is in Lexer.h and Lexer.cpp
Here are most important functions in the Lexer:

getToken(): Reads the Token stream, and put each group of
charachters in the suitable group e.g, Key word, operation, ...etc.
getNextToken(): Gets the next token in the input stream.
lookahead(): Retuerns the next token in the input stream
used in Parser as it's LL(1) type
eat(TOK, string): Takes two arguments the token, and the string form of it.
Consumes the token if it's the expexted token, or emits an error.

Parsing

The Parser is hand-crafted LL(1). I didn't use any tools to build it,
as I wanted to learn how Parsers are built internally.
We needed to use lookahead() to check parse variable declarations as
they have the same start tokens as assignment statements see grammar
NOTE: The grammar is not LL but I converted it to be LL.
Files Parser.cpp, and Parser.h

Error Recovery:

The compiler Prints Colored error messages. it continues parsing always, but
It exits if the source has 10 errors, couldn't parse the main class,
If there's an error in a class, or method declaration, the compiler skips
the class, or the method, but it emits a message of the error and a note
that it's skipping the operation of the current class or method.
If there is an error in a variable declaration, it skips till the next
semi-colon.
You can see the source program from the parser view by passing the -P flag.

Error Locations

The Error message has a correct location for the parse error, but
if the error is the last charchter in the line it expects the correct token to
be in the next line for example:

1 int x
2 int z;
output:
expected `;` at line 2

it expectes ; instead of int.
The implementation is in Error.h and Error.cpp

AST

The Parser has two important jobs:

Check the syntax and make sure it's error free.
Build the AST. The AST is very important we use it in type-checking, and code generation.
AST implementaion is in AST.h, and AST.cpp.
NOTE: For Any AST traversing in the compiler I used visitor pattern:
Print the source from the parser view.
Building The symbol table.
Type-checking.
Code generation.

Type-Checking / Semantic Analysis

After done with parsing we should check the semantics of the source code.
We do this by two passes:

Build a symbol table of every symbol in the program, and store its scope.
Perform the type-checking consulting the symbol table.

Building Symbol Table

We use the builtin hashmap std::map for the symbol table enteries. This phase is seriously challenging assume the following problems:

Variables and methods' Scopes, and overriding. Assume the following case:
```
class A {
  int a;
  int fii() {int a; return a;}
  int tii() {retuen 0;}
}
class B extends A {
  int a;
  int foo(int a) {return a;}
  int fee() {return a;}
  int tii() {return 6;}
  int a() {return a;}
}
```
The compiler must handle the above case perfectly. in SymbolTable/SymbolTable.h and SymbolTable/SymbolTable.cpp, you see
the interface, and implementation of symbol table.
but the functions that interests you are enterScope(string), and
exitScope(), these functions store the information of every scope it
hits. before entering a new scope, enterScope creates a Symbol table
entry, store the parent scope for the current scope and start storing it's
information.
After done with the current scope exitScope return the control to the
parent scope.
Cyclic inheritance. Assume the following case:
```
class A extends B {}
class B extends C {}
class C extends A {}
```
How can we know this? the solution is simple, just build a graph using the class relations. every node is a class, and every directed edge represents extends, and check for the cycles using simple dfs.
- Class appears before it's used. This problem doesn't exist in C / C++, for example:
```
class A extends B {}
class B {}
```
or
```
Class A {B b;}
Class B {A a;}
```
How can we know this? answer:
- For the first case, class B must be in the symbol table before class
  A, so we use topological sort to know the order.
- For the second case, If class B doesn't use class A object,
  we could use topological sort, but there's a cycle in declaration.
  how could we solve this? answer, just use a map e.g, map<string, bool>
  and before building the symbol table just store all classes.
  If we find the class in map then we don't emit an error. For example:
  the map has two values: {{A, true}, {B, true}}, the we start building the
  symbol table.
  Merging the above two solutions, sort, and store, we can build the symbol table
  successfully.

Type-Checking

After building the Symbol table this phase becomes easy. Just use
getScope, and exiteScope for every scope you hit. The type-checker
consults the symbol table about variables and classes, e.g:

Return type mismatch.
Use of undefined variable.
Check if the object has the method e.g, a.foo().
Declare variable of undefined types e.g:
```
class A {B b;}
```
non-integer array index.
non-boolean if expression.
and many more...

You can find The Symentic Phase code in: SymbolTable/, TypeCheckerVisitor.h, TypeCheckerVisitor.cpp,
and TypeVisitor.h

Code Generation

This phase was the most difficult although it seems easy, but the generated
code had many Segmentaion fault errors most of them was because of stack alignment.
The technique used here is so simple traverse the AST in execution order
and emit the code in visitor methods.
One strategy that made the implementation easy is to treat X86 as a 1-register
language with a stack for intermediate values all operations writes %rax register
The code however uses other register for some operations but not much.
function calls was a little bit trickt to implement but the code uses O(1) dynamic dispatch.
you can find the code in CodeGeneratorVistor.h and CodeGeneratorVisitor.cpp

Language Runtime

The MiniJava program should have runtime environment e.g space for heap, and stack,
%rsp and the other registers should have correct initial values, and a way
to allocate memory for objects, and arrays.

Bootstrapping

We use the existing C runtime, under directory runtime/ you will find
a C program boot.c. This program calls the MiniJava main function
as if it were a C function. By using this technique, it's possible
to call standard C functions, and the program will now use the C ennvironment.

Copyrights

I wrote All files in this project except runtime, and SamplePrograms
they are from CSE P 501 starter code, however you can use this code freely.

elhewaty / minijava Goto Github PK

minijava's Introduction