Giter Site home page Giter Site logo

ficus's People

Contributors

4ekmah avatar andrey-golubev avatar vpisarev avatar zihaomu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ficus's Issues

Regexp in ficus

Regular Expressions in Ficus

right now the standard library, module Re, offers 2 functions:

  1. Re.compile(regexp_str: string) : Re.regex_t and
  2. Re.match_(regexp_obj: Re.regex_t, matched_str: string): bool

The implementation uses Rob Pike's VM approach (O(N) regexp matching algorithm, where N is the matched string length), described and implemented at https://swtch.com/~rsc/regexp/; in particular, it's a wrapper on top of re1 library.

The implementation is just at "technology preview" level of functionality and stability. Here is the list of things to make this module really useful:

    • change the implementation to directly process UTF32 strings instead of UTF8 strings. This is one of the first things to make, because it will let us to avoid conversion of matched strings to UTF8 (extra step that requires O(N) time and extra O(N) space) and will also let us to return the positions of matched sub-expressions as-is, without UTF8 to UTF32 conversion.
    • avoid the use of global variables, in particular, the pool of "fx_re_Sub" buffers (freesub). It should be possible to invoke the regexp engine simultaneously from multiple threads (using different compiled regexp objects or even the same regexp).
    • change the error handling. Now in the case of error exit() function is called, which is inappropriate. We should return negative error code to the user (i.e. throw exception using Ficus runtime facilities) with proper deallocation of temporary allocated buffers. Probably, it will require to make the step 4 first.
    • rewrite regexp parsing code without using yacc. The regexp syntax is quite simple and it can be parsed using a simple recursive approach. Currently, we use statically compiled into .c Yacc/BISON-based parser, and with any further extensions to regexp syntax (which are expected) we'd need to regenerate .c and update it (add fx_re prefixes etc.).
    • optimize match; skip the initial ".*?" processing (the first 2-3 instructions in regexp VM). Now we just run "find" and then check if sub[0] == 0 && sub[1] == str_length.
    • return sub-matches indices to the user: Re.match_(r: Re.regex_t, s: string): (bool, (int, int)[]). That is, in addition to the success flag, return positions of submatches as 1D array of 2-element tuples (start_j, end_j). (start_0, end_0) โ€” position of the whole regexp within the searched string. Remember that int in Ficus corresponds to int_ ~ ptrdiff_t in C.
    • add syntax r"regexp_str" to Ficus to automatically mirror back slashes.
    • (optional) add s ~ regexp syntax to Ficus (syntactic sugar): if input_filename ~ "{basename:.+}\.fx" { println(f"base name of the ficus source file is {basename}") }. That is, regexp matching, as in Perl, can be an operator with optional value capture. If matching is successful, we can get the named captured values and use them.
    • implement Re.find(regexp_obj: Re.regex_t, str: string): (bool, (int, int) []). The semantics matches Re.match_(), but the match can be a substring of str, instead of the whole str. Maybe there should be overloaded version that will take extra offset: int parameter โ€” from where to start the search.
    • implement Re.findall(regexp_obj: Re.regex_t, str: string): (int, int) [,]. Returns matrix. i-th row corresponds to i-th match and contains the position of the whole match and positions of sub-matches.
    • implement Re.replace(regexp_obj: Re.regexp_t, str: str: string, replace_with: string): string. replace_with string can use \<n> syntax, e.g. \0 (the whole matched string) \1 (the first matched sub-expression), \2(the second matched sub-expression) etc.
    • (optional) We can also extend Re.replace() functionality. \U1 will mean convert the first submatch to uppercase, \L2 will mean convert the second submatch to lowercase, \C0 will mean capitalize the whole submatch (convert the first letter of it to uppercase).
    • extend regexp syntax with the following constructions:
    • [[^]c1-c2c3-c4c5-c6] syntax for specifying positive and negative character ranges. There should be new instructions in VM to implement this. Maybe it should be a variable-length instruction FX_RE_OP_POSRANGE and FX_RE_OP_NEGRANGE. Single characters can be encoded (internally) by duplicating the beginning and end of the range, e.g. [+-0-9] can be encoded as [+-+\--\-0-9].
    • \b - word boundary. There should probably be a new VM instruction (like FX_RE_OP_WORDBOUND) to implement it.
    • support \t, \r, \n, \x.., \u...., \U........ syntax to specify various characters.
    • built-in class characters: \s (space), \d (decimal digit), \x (hexadecimal digit), \a (letter), \w (letter, decimal digit or underscore), and maybe some others. fx_isdigit() and other standard functions from ficus runtime can be used to implement this part, and probably a new VM instruction (e.g. FX_RE_OP_CHARCLASS <class_id>) should take care of it.
    • (partially emulated by passing multiline=true) Add support for:
    • ^ (beginning-of-line) anchor (FX_RE_OP_BEGIN_OF_LINE)
    • $ (end-of-line) anchor (FX_RE_END_OF_LINE).
    • (apparently, it's not needed, several alternatives, separated with '|', can be used to emulate the same behaviour) (optional). A very useful functionality for data parsing. Not sure about API, but it can be something like (not optimal) Re.matchsome(rl: Re.regex_t list, str: string[, offset: int]): (int, (int, int) []). That is, we check if str or str[offset:] matches with any of regexp from rl. If there are several matches, the longest is chosen. If there are several matches of the same length, the earliest expression in the list is chosen. The index of matched expression is returned. Such functionality can be useful to implement parsers (but using lex should be much more efficient in general for such tasks).
    • (optional). Add optional flags to Re.match*(), Re.find*() and Re.replace(). The most important flags are ignore-case and multi-line matches. Need to double-check if it's consistent with other regexp engines. In the case of multi-line matches . or \s can match \r or \n or \r\n, ^ can match beginning of each line, $ - the end of each line.

=======

IMPORTANT NOTE:

It's a lot of work to bring the current very early draft to a solid state, even though many of the items are relatively easy to implement. Probably, instead of trying to make re1-based engine more mature, it makes sense to write a ficus wrapper on top of re2, implemented in C++, where all of the above features (and probably more) are already implemented.

adding 'deci' type

this is not a high-priority item at all, but maybe it would be nice to add a true 'decimal' floating-point number type to ficus, based on https://github.com/vpisarev/DEC64/tree/alt.

The rationale:

  1. since it's purely software implementation that does not use floating-point operations internally, the operations on deci are guaranteed to give absolutely the same 'bit-exact' results on every platform, which can be useful for some accuracy-critical algorithms, reference implementations in tests etc.
  2. for quite a broad range of values scanning/printing operations are 'bit-exact', i.e. the textual decimal representation of floating-point numbers perfectly matches the internal representation, unlike with IEEE754 binary-2 numbers.
  3. the number of bits in deci's mantissa is greater than in double-precision IEEE754 format, hence the better accuracy can be achieved out-of-the-box. Besides, some extra primitives, such a super-accurate dot-product or Horner scheme (polynomial evaluation) can be added to the runtime to obtain even better accuracy (for IEEE754 formats such functions are possible and useful to implement as well though)

The details:

  1. the type may be called 'deci', short for 'decimal'.
  2. there should be 'd'/'D' suffix to specify literals of 'deci' type.
  3. basic and not so basic operations on deci() (+, -, *, /, fmod, pow, sqrt, sin, cos, exp, log, etc.) will be added to ficus runtime and the standard library. Note: the current implementation partly uses __int128 type, which is GCC/clang extension. Need to add a branch for MSVC (the necessary multiply/divide intrinsics exist in MSVC).
  4. operations on the type are converted to runtime calls at the C code generation phase.
  5. some compile-time optimizations (e.g. constant folding, a+0, a*1 etc.) maybe applied to 'deci' operations

ficus stdlib TODO

(in bold there are higher-priority items)

    • Basic ops on strings, lists, arrays, numbers, tuples etc. ...
    • Real complex type & ops (need support at compiler level, but mostly can be done in the standard library, just like "option")
    • Introspection, compiler as a library
    • JSON parser & emitter
    • XML, YAML parser & emitter
    • Async, parallel programming (Mutex class/module or equivalent functionality is necessary to implement first for map-reduce algorithms)
    • OS services (files, directories, processes, ...)
    • Time & Date (arithmetic operations on time and date, calendar etc.)
    • Unit test framework
    • Logging
    • Pretty printing
    • Regexp (RE2? or something like that)
    • Parsing module for context-free grammars or some more constrained grammars
    • Functional & imperative data structures (balanced trees, hash tables, ...)
    • Basic operations on graphs (?)
    • Matrix operations, including popular decompositions
    • Bindings to BLAS & LAPACK
    • Math functions
    • Random number generation
    • Solvers (Linear programming, convex programming etc.)
    • Geometry (rotations, projections, multiple-view geometry etc.)
    • BigInt a.k.a. long or bigint or decimal (bindings to GMP?) (libBF by Fabrice Bellard? or big.js?). See also #12
    • GPGPU OpenCL (bindings & runtime loader of OpenCL; memory management)
    • GPGPU CUDA (???)
    • OpenGL API (or maybe something more modern?)
    • Vector graphics (AGG, NanoVG, ...)
    • Bindings to OpenCV
    • Tiny image processing library (filters, resize & warping, color conversion etc.)
    • Tiny DSP library (DFT, DCT, FIR/IIR filter, windows, ...)
    • (work in progress) Deep learning inference library (importers + execution)
    • Machine learning library (SVM, kNN, ...)
    • Image I/O
    • Bindings to zlib, ... (lossless data compression)
    • Bindings to sqlite, convenient SQL queries
    • Video encoding/decoding
    • Audio I/O (including mp3 decoder & microphone & playback)
    • Camera & other sensor access
    • Small network library (sockets, file downloading from url, ...)
    • Simple UI framework (?, LCUI? GLFW? tiny ui)

a few issues/annoyances

... found during conversion of the compiler to ficus:

  • <some variant constructor>(_) pattern should work for any number of formal parameters.
  • when match case (or maybe any block) finishes with declaration or consists of just one declaration, there is very strange diagnostic
  • also there is strange diagnostic when '::' is used with lists of unproper type
  • ocaml2fx tool should handle if's without then-branches or without else-branches in more intelligent way to reduce further refactoring
  • walk_kexp, KMkArray() case (with nested list comprehensions initially) required some workaround to compile (wrong C code is generated)
  • comparison with empty list should be fast
  • comparison of variant value/variable with parameter-less variant case should be fast (just tag comparison)
  • when inside a recursive function that uses free variables (i.e. needs a closure), e.g. typ2ktyp in KNormalize.fx we pass this function itself into some generic function, e.g. List.map, we cannot pass its standard function structure (FP), because it's created after the function declaration. 2 solutions: 1) create another function structure with the input fx_fv (do not forget to increment the reference counter before the call and decrement it after), 2) create the standard closure earlier (may not always work)
  • KForm.init_all() was found, but it belongs to Ast module! After type checking the module environment needs to be cleaned.
  • would be nice to extend 'fold' syntax to set initial values separately, not via tuple packing/unpacking
  • sometimes variable tuple arguments (...) do not work properly, see for example the commented off tuple hash() function in Builtins.fx
  • the generated .c code shows some instabilities after even very minor changes. For example, for some modules of the compiler a slightly different .c code is generated even though they are not edited themselves. So far the only observed instabilities are about the ordering of locally defined variables, which is a not big problem and we cannot say that compiler shows non-deterministic behavior, but it's still an issue
  • formatting of the output .c code is generally fine, though a bit ugly in some places. Need to polish it.

Project status

What is the project status?

Do you want contributions? What kind of contributions do you accept?
Should one file issues for the project?

Should #5 and #16 be treated as roadmaps?

Thanks!

ficus 1.0 TODO

NOTE: the previous big ticket was renamed and closed.

Now we have some more things to be done to finally release Ficus 1.0 (or at least Ficus 1.0 beta):

    • proper stack trace reporting in the case of exceptions, asserts etc. The stack trace should report the original filenames & lines in the original ficus code, not in the generated .c code
    • option to generate shared lib (or static lib), not just executable (the only available option for now). In the case of shared lib probably it should be option to compile ficus runtime as a separate shared library, in order to use several ficus-generated modules together in the same final app.
    • as a part of the previous item, provide some method to export clean C and/or C++ API instead of ugly-looking mangled names.
    • (lower priority, but still important) option to generate binary python module
    • (lower priority) dynamic loading of ficus-generated shared libraries
    • automatically define platform-specific symbols that can be checked with preprocessor, like OS, CPU_ARCH etc.
    • proper handling of complex modules. To avoid ambiguity, probably there should be __init__.fx in each ficus module directory (including subdirectories)
    • add pragma pragma "framework:<framework>" to link Apple frameworks easily without using -clibs flag.
    • add pragma pragma "pkg:<pkg>" to link packages via pkg-config more easily without using -cflags and -clibs
    • partial implementation of Language Service Protocol: syntax highlighting, including inline C/C++ code, folding, jump to symbol definition, intellisense.
    • (quite difficult to do) some debugger support, probably we should try to keep the original local variables' names as much as possible and provide helper scripts for GDB to visualize arrays, lists, variants etc.
    • better support for records and variants with cases with record attributes; maybe if we have a variant case SomeCase: {...}, there should be automatically introduced SomeCase.t as type for the attribute record. Or some other way to pack and move data for the particular case.
    • prepare proper string interpolation and the corresponding string() and print() with complete support for Python 3 format.
    • automatic printing of variants
    • fix bug when some type declarations are thrown away.
    • better support for exceptions at C/C++ level, especially exceptions with some attributes, at least simple attributes (integers, strings).
    • check once again if return is properly implemented, maybe not
    • (much) more unit tests
    • some threading without openmp? we will definitely need parallel for & @sync. What about tasks and asynchronous execution?
    • round-up the standard library #5
    • full-scale "tensor" data type for nd dense arrays, which type is unknown at compile time. That will help to wrap OpenCV better, implement something like numpy and also improve implementation of deep learning inference engine. Also, tensor could be stored on an acceleration device (GPU, NPU), which is a critical feature.
    • implement various scalar intrinsics (partially done): round(), floor(), sin(), cos(), popcount() ...
    • maybe not for 1.0, but it would be nice to provide some introspection and meta-compiler features usign something like Python decorators. Probably for that we would need a full-scale JIT

ficus 1.0 alpha TODO

This is the list of features to be implemented before ficus 1.0. Some of the features are optional and marked as such.

    • for a@index <- A, i.e. the notation to get indices of the array element elements, not just their values. index is a tuple in the case of multi-dimensional arrays.
    • (...) notation in generic function argument types to denote that the particular argument should be a tuple, but its size can be an arbitrary (for each specific size and each specific set of tuple element types a separate instance of generic function is created). There is a similar notation for records: {...}
    • 'typ [+] notation to denote array arguments of varying dimensionality. For each specific dimensionality a separate instance is created.
    • [\A; \B], i.. use of \array notation for compound array initialization. [\A1; \A2; ... ; \An] denotes "vertical" concatenation of arrays, where rows of \Ai are put next to each other. [\A1, \A2, ..., \An] denotes horizontal concatenation, where columns of \Ai are put next to each other. There can also be a mixed combination, and also it's possible to put scalar elements.
    • support for inline functions; automatic inline expansion of small functions
    • elimination of index range checks when arrays are accessed sequentially and unconditionally inside the loops.
    • optimized concatenation of multiple strings. Instead of iterative construction of "a" + "b" + "c" + "d" + ... as (("a" + "b") + "c") + "d" ..., which results in multiple memory allocations and super-linear complexity, the compiler computes the final concatenation with a single call of intrinsic function.
    • when keyword in the patten matching to add custom predicates.
    • possibility to process multiple cases in match expression with a single action, e.g. match s { | "a" | "b" => foo() | _ => bar() }. Currently, when using multiple cases, it's not possible to capture values, but this is a minor inconvenience.
    • object types that will let users write lst.rev() instead of List.rev(lst) etc. This is very light-weight method to introduce object-oriented notation, even for very basic, intrinsic types.
    • str[.-n] syntax, i.e. unary minus with preceding dot. It helps to conveniently access the last string character, the last element of array etc.
    • [very desirable, but optional] fun foo (a: t1, b: t2, ~param3:t3=defval3) notation for named/optional parameters.
    • [unzip for ...] notation to produce a tuple of arrays as the result of comprehension instead of a single array of tuples.
    • [optional] automatic detection of "nothrow" functions. If such functions are called directly, we can eliminate the check of return value
    • automatic detection of stack overflow. Ficus exception is thrown in such a case.
    • detection of division by zero. Ficus exception is thrown in such a case.
    • (mostly done except for variants) automatically generate print() and string() for any type
    • automatically generate comparison function for any type.
    • OOP (object types (see above) + interfaces)
    • (postponed till after 1.0; hand-written ficus lexer has been implemented) port the lexer generator to ficus (fxlex)
    • (postponed till after 1.0; hand-written ficus parser has been implemented); port the parser generator to ficus (fxyacc)
    • port ficus compiler to ficus
    • implement nested modules
    • (postponed till after 1.0) [optional] module interfaces (separate files with .fxi extension, similar to .mli files in Ocaml)
    • separated compilation when each ficus module is compiled into a dedicated .c file
    • (there is never enough tests, but at least we have some) add extensive set of unit tests
    • write ficus tutorial
    • parallel keyword before for loops (the current implementation uses OpenMP)
    • [optional] implement full-scale Python3-like format f"{:}"
    • [optional] implement image/matrix/tensor border extrapolation using img.clip[y,x], img.zero[y,x] or similar notation.
    • implement loop fusion
    • add immutable RRB vector data type and the associated operations including concatenation, comprehension etc.
    • implement a small but versatile standard library (see #5) (for 1.0 release only a small part of the planned library will be implemented)

==== some extra items ====

    • add optional finally clause into try-catch to close files & sockets, unlock mutexes etc. safely. Shall we also add Python-like with expression?
    • check for invalid type casts, implement more type cast modes (literal to tuple, tuple to tuple, tuple to scalar type (in the latter case it means that each tuple element is casted to this scalar type and this rule is applied recursively)
    • [optional] optimize tuple packing/unpacking
    • [optional] add back quote syntax `expr` that expands to (expr, expr-captured-as-string, modulename, line). The feature will help to make assert's much more informative.
    • more intelligent type inference in the case of overloaded/template functions and type constructors (in particular, Math.... functions, None constructor etc.)
    • insert more try-catch blocks into the compiler to catch compile errors
    • (maybe later) [optional] add some syntax sugar for containers. Convert for (k, d) <- mymap {...} into Map.app(), fold ... for x <- myset {...} into Set.foldl(...) etc. In general, if the object we iterate over is a module type, we will convert the loop body into a lambda function and depending on whether it's a list comprehension, folding or a simple loop, we convert it to map, foldl or app.
    • [optional] add syntax to embed data files into source code, val myfont = @data "myfont.ttf"
    • more smooth support for the records and record patterns in functions.
    • (optional) not only literals, but also id's as default initializers in the record fields or default function parameters, e.g. None etc.
    • (optional) string() and print() for function types. Maybe it should also have the __name__ attribute.
    • report warnings about unused variables. preferably also report about unused or unmatched variant tags in match {} expressions.
    • it looks like immutable RRB vector (see item 32, now postponed) can be super-efficient in ficus' basic operations - comprehensions and iterations. And it will do many other things in O(1) or O(logN) time. Therefore, it should become the main collection type. Therefore, we need to change array comprehensions and literals from [ ... ] to [| ... |], and leave [...] to vectors. cons-based lists will stay [: ... :].
    • maybe refactor compiler to speedup incremental compilation mode when just a few lines in one of the module are modified. This should include caching of k-forms, a bit different handling of template functions/types, and completely separable compilation and maybe even some changes in the fundamental id_t type. The goal is to avoid influence of one module's k-form to another, i.e. if the module itself or the modules it depends on are not edited, it's initial k-form should be exactly the same (and thus we can eliminate the stage of k-form optimization, c code generation and c code compilation alltogether, not just the latter one).
    • implement hash table, use it whenever possible instead of Map or Set, e.g. inside K_remove_unused.
    • maybe add non-type template arguments to make Map and Set more efficient.
    • add "-rebuild" option to rebuild an app from scratch.
    • refactor backtrace dump. Use the system call backtrace() to retrieve backtrace (just the stack). Do not use "update backtrace" inside FX_CALL() macro
    • test everything on Linux
    • need to extend pattern matching to allow nested alternating patterns, e.g. match tokens { | (PLUS | MINUS) as op :: rest => ... }. This will also solves the problem with when, which is now not clear where to attach in the case of several alternatives.
    • stylistic change. need to normalize naming where possible. For example, function formal parameters should be called "parameters", not "arguments". Whereas the actual arguments passed to function at runtime should be called "arguments". Now arguments are used almost everywhere.
    • (optimization idea: just put it here for now not to forget). When internal functions access non-local values from outer scope, and those internal functions are not passed around, just called, they can be augmented to take all those extra values as arguments. Probably, the best place to do it is inside k_lift, because mutable values from outer scope still need to be converted to references before being passed as arguments.
    • (update: mimalloc was replaced with a tiny yet mighty rpmalloc, which is now included into ficus runtime) download and build (or include inside or make a git submodule) mimalloc as a dependency. mimalloc significantly boosts performance of the compiler and, in general, ficus-generated code. So, it would be nice to have it enabled by default.
    • (maybe later) maybe build everything (mimalloc, other useful 3rd-party libraries, runtime, reference ficus compiler, ficus compiler, tests, examples etc.) using cmake
    • (openmp runtime for mac is included, on linux it's always available, on windows need to test, but it should work fine there too) implement @parallel for based on the task framework, so that it's always available, not only when the produced binaries are linked with OpenMP runtime. Getting rid of OpenMP dependency will also make ficus binaries more compact. Note that OpenMP-based parallel for backend is also possible.
    • (ficus compiler is built from scratch in 21 seconds with -O1 flag on M1 macbook pro. On Linux it takes <15 seconds. That is, it's fast enough) try to make ficus compiler as parallel as possible. At minimum make the last phase parallel - generation of the C code, emitting C code, invoking C compiler on the updated files. At maximum, K-form optimization should also become mostly parallel with some synchronization points.
    • test everything on Windows
    • @sync-blocks for map-reduce style algorithms.
    • mutable record fields
    • Some form of conditional compilation, e.g. -pp flag that will run .fx code through cpp preprocessor. In addition, there should also -Dparam[=value] command-line option to define certain macros that are passed to the preprocessor

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.