mrlsd / semantic-analyzer-rs Goto Github PK

Semantic analyzer library for compilers written in Rust for semantic analysis of programming languages AST

License: MIT License

Rust 100.00%

compiler programming-language abstract-syntax-tree semantic-analysis semantic-analyzer compiler-construction compiler-design

semantic-analyzer-rs's Introduction

mrLSD`/semantic-analyzer-rs`

Semantic analyzer is an open source semantic analyzer for programming languages that makes it easy to build your own efficient compilers with extensibility in mind.

🌀 What the library is for and what tasks it solves

Creating a compilers for a programming language is process that involves several key stages. Most commonly they are:

▶️ Lexical Analysis (Lexer): This stage involves breaking down the input stream of characters into a series of tokens. Tokens are the atomic elements of the programming language, such as identifiers, keywords, operators, etc.

▶️ Syntax Analysis (Parsing): At this stage, the tokens obtained in the previous stage are grouped according to the grammar rules of the programming language. The result of this process is an Abstract Syntax Tree (AST), which represents a hierarchical structure of the code.

⏩ Semantic Analysis: This stage involves checking the semantic correctness of the code. This can include type checking, scope verification of variables, etc.

▶️ Intermediate Code Optimization: At this stage, the compiler tries to improve the intermediate representation of the code to make it more efficient. This can include dead code elimination, expression simplification, etc.

▶️ Code Generation: This is the final stage where the compiler transforms the optimized intermediate representation (IR) into machine code specific to the target architecture.

This library represents Semantic Analysis stage.

🌻 Features

✅ Name Binding and Scope Checking: The analyzer verifies that all variables, constants, functions are declared before they're used, and that they're used within their scope. It also checks for name collisions, where variables, constants, functions, types in the same scope have the same name.

✅ Checking Function Calls: The analyzer verifies that functions are called with the number of parameters and that the type of arguments matches the type expected by the function.

✅ Scope Rules: Checks that variables, functions, constants, types are used within their scope, and available in the visibility scope.

✅ Type Checking: The analyzer checks that operations are performed on compatible types for expressions, functions, constant, bindings. For operations in expressions. It is the process of verifying that the types of expressions are consistent with their usage in the context.

✅ Flow Control Checking: The analyzer checks that the control flow statements (if-else, loop, return, break, continue) are used correctly. Supported condition expressions and condition expression correctness check.

✅ Building the Symbol Table: For analyzing used the symbol table as data structure used by the semantic analyzer to keep track of symbols (variables, functions, constants) in the source code. Each entry in the symbol table contains the symbol's name, type, and scope related for block state, and other relevant information.

✅ Generic expression value: The ability to expand custom expressions for AST, according to compiler requirements. And the ability to implement custom instructions for these custom expressions in the Semantic Stack Context.

🌳 Semantic State Tree

The result of executing and passing stages of the semantic analyzer is: Semantic State Tree.

This can be used for Intermediate Code Generation, for further passes semantic tree optimizations, linting, backend codegen (like LLVM) to target machine.

🌲 Structure of Semantic State Tree

blocks state and related block state child branches. It's a basic entity for scopes: variables, blocks (function, if, loop). Especially it makes sense for expressions. This allows you to granularly separate the visibility scope and its visibility limits. In particular - all child elements can access parent elements. However, parent elements cannot access child elements, which effectively limits the visibility scope and entity usage.
- variables state: block state entity, contains properties of variable in current state like: name, type, mutability, allocation, mallocation.
- inner variables state: block state entity, contains inner variables names. It's useful for Intermediate Representation for codegen backends like LLVM. Where shadowed name variables should have different inner names. It means inner variables always unique.
- labels state: block state entity, that contains all information about control flow labels.
Global state: contains global state of constants, declared functions and types.
State entity: contains:
- Global State
- Errors results
- Semantic tree results

All of that source data, that can be used for Intermediate Representation for next optimizations and compilers codegen.

🧺 Subset of programming languages

The input parameter for the analyzer is a predefined AST (abstract syntax tree). As a library for building AST and the only dependency used nom_locate - which allows getting all the necessary information about the source code, for further semantic analysis and generating relevant and informative error messages. Currently decided that the AST is a fixed structure because it is a fundamental element that defines the lexical representation of a programming language.

On the other hand, it allows you to implement any subset of the programming language that matches syntax tree. It also implies a subset of lexical representations from which an AST can be generated that meets the initial requirements of the semantic analyzer. As a library for lexical analysis and source code parsing, it is recommended to use: nom is a parser combinators library.

AST displays the Turing complete programming language and contains all the necessary elements for this.

🔋 🔌 Extensibility

Since AST is predefined, but in real conditions it may be necessary to expand the functionality for the specific needs of the compiler, has been added the functionality of the AST extensibility and the additional generated set of Instructions for the Semantic Stack Context.

🚨 Genetic expression value: The ability to expand custom expressions for z, according to compiler requirements. The ability to implement custom instructions for these custom expressions in the Semantic Stack Context.

🛋️ Examples

🔎 There is the example implementation separate project 💾 Toy Codegen. The project uses the SemanticStack results and converts them into Code Generation logic which clearly shows the possibilities of using the results of the semantic-analyzer-rs SemanticStackContext results. LLVM is used as a backend, inkwell as a library for LLVM codegen, and compiled into an executable program. The source of data is the AST structure itself.

📶 Features

Available library rust features:

codec - 💮 enable serialization and deserialization with Serde. This is especially convenient in the process of forming AST, Codegen, a serialized representation of the SemanticState. Another important nuance is that any library that implements Serde can act as a serializer codec. For example formats: json, toml, yaml, binary, and many others that can use serde library. The main entities, which apply the codec feature are:
- AST ↪️ AST data source can be presented with serialized source. This is especially useful for designing and testing Codegen, AST data transfer pipeline, and also for use as a data generation source for AST - any programming language that can generate serialized AST data.
- State ↪️ SematniсState can be obtained in the form of serialized data. This is especially convenient for storing state before code generation with different parameters, post-analysis, optimizations - which will allow to work with already analyzed data.
- SemanticStack ↪️ contains a set of instructions for Codegen. Representation in serialized form may be convenient for cases: code generation without repeated semantic analysis, only based on instructions for the code generator generated by the semantic analyzer. Serialized data represented SemanticStack - opens up wide possibilities for using any third-party code generators and compilers implemented in any programming language.

MIT LICENSE

semantic-analyzer-rs's People

Contributors

Stargazers

Watchers

Forkers

hbcbh1999

semantic-analyzer-rs's Issues

For semantic function body state return can call multiple times

Description

For the semantic analyzer function_body function, there is the possibility to call multiple times return in case:

ast::BodyStatement::Expression(expression)  | ast::BodyStatement::Return(expression) => {...}

How to reproduce

Set AST entity: ast::BodyStatement::Return
Set AST entity: ast::BodyStatement::Expression

Expected behavior

🛑 Should add an error message to an error state

Actual behavior

No errors 🙆. Multiple-time return doesn't check.

Add registers number to `SemanticStackContext` definitions

Description

To efficiently implement Codegen as the next step after semantic analysis, it will be really helpful to implement register numbers for all kinds of expressions and assignment operations.

The main point is to track correct expressions, values, and function parameters.

Early, this functionality was present but removed.

[Documentation] Add additional design documentation

Description

Add a separate section with documentation, which should contain:

Detailed design overview
Overview of the library structure
Structural overview of the basic components
Documentation on the use of existing types, traits, and functions - in the form of a general overview
Description of ways to interact with a data source in the form of AST
Description of interaction with Codogen - as the main consumer of the results of the semantic analyzer

[Feature] Add extensibility to Expression Value

Description

Add extensibility to Expression Value.

⭐ Motivation

To expand the possibilities of operating Expressions, a flexible representation of Expression Variables is necessary.
To do this, it is necessary to implement a general approach that would take into account the possibility of describing the AST for Expression Value, and for generating the corresponding instructions.

➡️ To do this, the following must be implemented:

Generic trait binding for AST custom Expression Value
Generic instruction for custom Expression Value instruction generation for SemanticStack

Improve return logic for `function-body` and `if-body` state

Description

`If-body` block state

For the if-body block state, when called return anyway after this the JumpTo if-end instruction is called. It's redundant and should be added an additional check for label instruction JumpTo is manual-return for the current block state or not. If manual-return is set do not call JumpTo.

`function-body` block state

For the function-body block state should be added an additional check for manual-return. If it is set to true add the label return before the return instruction and also add the return instruction itself.

Other cases

↪️ Check the same cases (if any) for the if-else statement
↪️ Check same cases (if any) for the if-else-if statement
↪️ Check same cases (if any) for the loop statement

Feature: Function parameters init as variables

Description

⏲️ Currently, function parameters are not interpreted as variables, and it means that Codegen should operate with that to the own choice.

mrLSD/toy-codegen explicitly shows, that it's a bit complicated, to operate with no pre-init variables, and has inner pitfalls with implementation for Codegen.

🕺 Proposal

Add function parameters as variables to BlockState, and add instructions to allocate and bind explicitly. As the solution: just use SemanticStackContext::LetBinding to fulfill SemanticStack result.

Fix `loop-begin` label for the `loop-statement`

Description

➿ For the loop-statement it should always present the JumpTo loob-begin instruction, that "loop" logic itself. Currently, it doesn't exist.

Expected behavior

➿ The loop-statement should "loop" itself, to repeat all instructions in the loop body, before calls: break or return instructions, that terminate loop repetition.

Feature: after termination instructions `return`, `break`, `continue` do not allow other instructions

Description

After termination instructions return, break, and continue do not allow other instructions.

Currently, all kinds of instructions are allowed after termination instructions.

This applies to statements:
🏁 function-body
🏁 if-body
🏁 else-if-body
🏁 loop-body

Remove String from PromitiveTypes

Description

PromitiveTypes represents only the most primitive types.

As potential pitfalls for backend implementation, String type mostly represents a complex type. For examples:

sized array[char]
unsized array[u8]
or other, including low-level LLVM representations.

To avoid that collision, String should be removed from PromitiveTypes.

[Feature]: Add example for AST

Description

The semantic-analyzer accepts AST as a data source. It is necessary to add examples of AST implementations as a data source.

💜 Purpose

➡️ AST implemented with nom library.
➡️ AST through JSON input (it's depend on #21 )

Feature: flatten root `BlockState` for `SemanticStackContext`

Description

The current BlockState is represented as a tree with one parent and many children. It represents blocks as a tree with branches and leaves. Each block state has a context that contains SemanticStack. It means that all contexts also represented as a tree.

There is an unknown entry point to the branch or leaf of context. And it's a bit tricky to flatten root context for endpoint Codegen. To simplify the fetching process for Codegen backends, there is a proposal:

for the parent BlockState "flatten" all children's contexts. It means parent context will also include children's contexts. And it has linear representations. At the same time, it will exist children BlockState with their own context
root BlockState will contain flattened linear context.

Fix `if-end` label instruction for IfStatement: `if-else`, `else-if` statements

Description

For if-statement it always should be a single point of the end: if-end label, currently for the else-if condition if-end label is duplicated.

Expected behavior

➡️ Should be only one SetLabel if-end instruction generated.

[Feature]: Implement `Serde` serializer

Description

Implement Serde serializer, which can be enabled via feature flag.

💠 It's most useful for:

➡️ AST - AST can be presented with (se/dese)rialization.
➡️ State - whole state as serialization result
➡️ SematicStack - serialized instructions set

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

mrlsd / semantic-analyzer-rs Goto Github PK

semantic-analyzer-rs's Introduction

mrLSD/semantic-analyzer-rs

🌀 What the library is for and what tasks it solves

🌻 Features

🌳 Semantic State Tree

🌲 Structure of Semantic State Tree

🧺 Subset of programming languages

🔋 🔌 Extensibility

🛋️ Examples

📶 Features

MIT LICENSE

semantic-analyzer-rs's People

Contributors

Stargazers

Watchers

Forkers

semantic-analyzer-rs's Issues

Description

How to reproduce

Expected behavior

Actual behavior

Description

Description

Description

⭐ Motivation

Description

If-body block state

function-body block state

Other cases

Description

🕺 Proposal

Description

Expected behavior

Description

Description

Description

💜 Purpose

Description

Description

Expected behavior

Description

Recommend Projects

Recommend Topics

Recommend Org

mrLSD`/semantic-analyzer-rs`

`If-body` block state

`function-body` block state