riot-ml / riot Goto Github PK

View Code? Open in Web Editor NEW

435.0 8.0 30.0 610 KB

An actor-model multi-core scheduler for OCaml 5 🐫

Home Page: https://riot.ml

License: Other

OCaml 93.50% Erlang 0.24% Standard ML 0.11% C 2.56% Perl 3.59%

actor-model elixir erlang fault-tolerance multicore multicore-ocaml ocaml otp

riot's Introduction

An actor-model multi-core scheduler for OCaml 5.

Quick Start | Tutorial | Reference

Riot is an actor-model multi-core scheduler for OCaml 5. It brings Erlang-style concurrency to the language, where lightweight processes communicate via message-passing.

open Riot

type Message.t += Hello_world

let () =
  Riot.run @@ fun () ->
  let pid =
    spawn (fun () ->
        match receive () with
        | Hello_world ->
            Logger.info (fun f -> f "hello world from %a!" Pid.pp (self ()));
            shutdown ())
  in
  send pid Hello_world

At its core Riot aims to offer:

Automatic multi-core scheduling – when you spawn a new Riot process, it will automatically get allocated on a random scheduler.
Lightweight processes – spawn 10 or 10,000 processes as you see fit.
Fast, type-safe message passing
Selective receive expressions – when receiving messages, you can skim through a process mailbox to consume them in arbitrary order.
Process links and monitors to keep track of the lifecycle of processes

Riot also includes:

Supervisors to build process hierarchies
Logging and Telemetry designed to be multicore friendly
an Application interface to orchestrate startup/shutdown of systems
Generic Servers for designing encapsulated services like with Elixir's GenServer

Non-goals

At the same time, there's a few things that Riot is not, and does not aim to be.

Primarily, Riot is not a full port of the Erlang VM and it won't support several of its use-cases, like:

supporting Erlang or Elixir bytecode
hot-code reloading in live applications
function-call level tracing in live applications
ad-hoc distribution

Quick Start

opam install riot

After that, you can use any of the examples as a base for your app, and run them:

dune exec ./my_app.exe

Acknowledgments

Riot is the continuation of the work I started with Caramel, an Erlang-backend for the OCaml compiler.

It was heavily inspired by eio by the OCaml Multicore team and miou by Calascibetta Romain and the Robur team, as I learned more about Algebraic Effects. In particular the Proc_state is based on the State module in Miou.

And a thousand thanks to Calascibetta Romain and Antonio Monteiro for the discussions and feedback.

riot's People

Contributors

Stargazers

Watchers

riot's Issues

Implement timers

Most importantly: interval and send_after

Preemptive Scheduling via a PPX

On stream we talked about preemptive scheduling and how Riot could support this, and one idea was to use a preprocessor to inject reduction count bumps on any code, to make it easier to make existing code play nicer with the cooperative scheduling that Riot supports via algebraic effects.

Rougly speaking, this code would get injected calls to a new function Riot.reduction_count_bump () like this:

let () = 
  while true do
    print_endline "Hello, World!"
  done

so that it becomes:

let () = 
  while true do <- 1000, yield, 1000, yield
    Riot.reduction_count_bump ();
    print_endline "Hello, World!"
  done

And this new function could be implemented internally like shown below, so that every N reductions (or iterations or whatever we call it) it will emit a yield (), essentially giving back control to the scheduler.

 module Riot = struct
    let reduction_bump () = 
      let proc = _get_proc () in
      proc.reductions <- proc.reductions + 1;
      if proc.reductions > reduction_limit then yield () else ()
  end

Thanks @linduxed for bringing this up! 🫰

Implement Process Stealing

As of right now, when processes are scheduled they are put on a random scheduler. This is okay because in the long run we trust the random numbers to evenly distribute across the cores.

However while running a program, most of the time there'll be an unbalance of processes per scheduler. This means that there are processes that could be running that aren't running. And in the cases of apps like Scarab and Minttea sometimes a process is so intense that it takes too much time off a scheduler.

We'd like to fix this by allowing schedulers that have no work to steal work from the schedulers that are too busy, and that way seamlessly move the load across cores.

Implementation Notes

In short, we should make a few small changes in the scheduler.ml (module Scheduler):

when a scheduler runs out of processes to run, it will call a function to steal a process from the other schedulers, like steal_processes pool sch max
this function iterates over the pool.schedulers and try to grab the next process process queue: (Proc_queue.next sch.proc_queue) and move it into its own queue
this function should also lock the scheduler into a "stealing" mode so that other schedulers don't steal from it while it is stealing

This is a rather simple strategy but should get us started, and we can refine and add more heuristics as we go.

Testing Ideas

Most practical think I can think of here is to make a test that runs a few processes and uses the Riot.processes () function and examines the Process.sid, which we'd have to update when we migrate a process from one scheduler to the next.

Handle graceful shutdowns

Some of the different ways to shutdown the application and processes.

*.stop
on sigterm
call and wait for terminate callbacks (if trapping exits?)
...

Build failure on gluon's kqueue stub with Big Sur (Apple SDK 11)

Problem

It seems like we're implicitly casting struct kevent * to const char * and that's a problem in older versions of the Apple SDK.

riot/gluon/sys/unix/gluon_unix_kqueue.c

Line 73 in fb6910b

event_array = caml_alloc_array(kqueue_event_to_record, event_ptrs);

Environment Info

Note

I'm not on Big Sur, I'm using Nix to build Riot which provides it's own Apple SDK, and it lacks Apple SDKs above 11.0 (can share flake if needed).
& Big Sur is only ~4 years old so I believe compatibility should be ensured.

SDK: 11,
Darwin: xnu-7195 sys/event.h

Make multiple calls to Riot.run throw

Since it's illegal to call Riot.run or Riot.start more than once, we should make it throw if it happens.

Implementation Notes

One easy way to achieve this would be to have a global boolean ref that marks if the runtime has been started once, and we can check its false when we call Riot.run.

If its false, we flip it to true.

If it already is true, we can throw an exception.

Add an example for Refs

At the moment refs are used in a few places like Mint Tea but there's no good explanation of what they are.

We should add a little example and document them, and maybe consider renaming them to something like Rune or some word that isn't taken in OCaml (like "ref") is, so we can introduce it as new concept.

Create scheduler friendly FS module

Would be useful to have a little File module to read/write files in a scheduler-friendly way.

In short, we would need a module like Net.Socket with functions like:

File.open file that returns a Fd.t
File.read fil that return an (Bigstringaf.t, _) result
File.write file ~data

Internally these functions can call the I/O module to do the actual reading/writing, they just have to beware that if the I/O module says `Retry then you have to call syscall "debugging-name" fd @@ fun fd -> (* do work here *) like this:

let rec read fd ~buf =
  match IO.read fd buf with
  | `Retry -> syscall "file.read" `r fd @@ read ~buf
  (* ... *)

This way we would tell the scheduler we're waiting, and some other process can run.

Implement process hibernation

Currently the scheduler hot loop is burning CPU cycles even while many processes are just idle-looping. For example examples/http_server/main.ml includes a tiny function that is just meant to keep the process alive without doing any work. However, this isn't free, because we have to context switch into and out of that process constantly.

let rec loop () =
  yield ();
  loop ()
in
loop()

Ideally we'd be able to hibernate a process, so that it won't be marked as dead, but also won't be picked up and scheduled for work until a new signal has been delivered to it. A signal here would be either an exit signal, or a message signal, etc.

This would mean that possibly, eventually, all processes would be hibernating until some work needs to happen. This frees up the scheduler to do other things, or even better: to do nothing.

A Condition variable could be used to signal when a process should be woken up from hibernation, and then kickstart the scheduler loop again.

That way we don't have to burn through my laptop batter in 2 hours.

Make scheduler IO aware

Current example HTTP server builds a Socket library that uses non-blocking UNIX calls to make the process become a "poller". However, this has 2 big issues:

if not careful, the process can block a scheduler completely – say if we didn't mark the socket as non-blocking, or if hard looped without yielding
we can't poll on multiple fds at the same time, with something like Unix.select

This may also be a reason why #6 happens, as eventually most schedulers just stop running.

To address this, and to make the runtime more efficient, I think it makes sense to make the scheduler IO aware:

add a table of (fd, pid) pairs to keep track of which processes are currently waiting on what descriptors
add functions like Riot.Unix.accept that register the (fd,pid)
make the scheduler loop do an immediate-timeout select call on all registered pids at the end of every loop
ready fds will be used to find their associated pids and send out messages like Unix.Message.Socket_accept (fd, conn_addr)

This means we can

listen on all open sockets in a scheduler simultaneously
hibernate/await socket accept connections instead of idle-looping on waiting processes
free up scheduler time for other processes
implement more time-slice friendly socket reads/writes

Process-friendly sync primitives

At the moment we use Atomic, Mutex, and Weak throughout the Riot runtime to do synchronization of work across the schedulers.

This style of programming doesn't play very well with Riot's process model, but it can be more straightforward to code in than synchronizing things via message-passing directly.

For example, in building a Hashmap that needs to lock, but should only really lock the current process, we can't use Mutex.protect – this would potentially lock the entire scheduler.

Shadowing Mutex with a version that is process-based would be ideal, to allow for a similar programming model and at the same time making it harder to accidentally lock an entire scheduler.

Implementation Notes

I'd probably start with making a Mutex be a small process, so creating a Mutex spawn_links it. Operations on top then are implemented as messages to/from that process. This ensures that a call to Mutex.lock actually becomes an interrupt point for the process.

Another good option here is a 'value Rw_lock.t type for making it possible to have multiple readers or a single writer to a value.

Support `::string` type in bytestring construction patterns

It'd be much easier to construct bytestring from existing strings if we had a ::string type, so we can do things like:

let name = third_party_foo_that_returns_string () in
{%b| name::string |}

Implementation Notes

This involves work in the bytestring package, specifically in the bytepattern.ml module. We can start by:

extending the Parser.parse_size to support string
updating all the error messages accordingly
creating new tests at the parser and higher levels

I think that we can reuse the add_literal_string function in the transient builder.

Consider using sync-logger on single-threaded workloads

When using the Riot.Logger on a runtime configured to use no additional workers (so only the main-thread runs), we run into the issue that calling log functions sometimes does not log anything because the Logger.Formatter process isn't getting its share of CPU time.

In those cases, it would be good to switch to the sync-logger instead, so at least some logs are provided. Maybe this would be good under a feature flag.

let () = Riot.run ~use_sync_logger:true @ main

For ex. internally when calling Logger.start we'd just check some global setting and swap in the Logger.Formatter.write calls to use the internal Logs module.

Process-stealing dead lock

When running on a large number of cores, the current process stealing starts dead-locking schedulers and shows a few other bugs:

a process gets queued up in several schedulers, which is likely a bug in the Proc_queue or Proc_set, and once its terminated in one scheduler, the next scheduler that tries to run it will fail because finalized processes should never be put on a queue.
when moving timers around sometimes a timer will get triggered on a scheduler before its moved out of it – moving timers to the IO scheduler helps, and can improve the reliability of the timers since the polling workload has a strict deadline, but also means reworking the timeouts for receives and syscalls.

I've been unable to fix with additional safeguards (like more restrictive locking of the process queue), but I have identified that the Proc_set is not working as intended (likely due to the use of Atomics instead of a lock).

In the meantime main has disabled process-stealing until we figure out next steps here.

This is a good time to step back and maybe rewrite the scheduler into more module pieces that can be easier to reason about and test.

Support for UDP / datagram sockets

At the moment only TCP sockets are nicely supported, but I've had folks reach out with UDP use-cases so it'd be a good idea to at least have basic support for it.

Implementation Notes

So we'd need to extend the riot/runtime/net/addr.ml to make a distinction between unix / datagram / stream sockets with something like:

type datagram_addr = [ `Udp of ip_addr * int | `Unix of string ]
type stream_addr = [ `Tcp of ip_addr * int | `Unix of string ]

And then amend all the functions and interfaces, including riot/lib/net.ml which has a user-facing Addr module with functions like get_info.

Testing Ideas

I'm not very well versed in UDP tbh, so I'm very open to ideas on how to test the correctness and robustness of this. One thought would be to see if we can implement a low-level protocol like Wake-On-LAN as a test, if its not too much work.

Implement Process Priority

At the moment, processes are spawned on a random scheduler and are scheduled on a first-come first-serve basis.

In short, if a process schedules itself often, it'll run more often.

This also means that we don't have a lot of tools to let certain processes run before others.

For example, we'd like the Logger processes (specifically the underlying Formatter process) to have a higher priority than most things during development, to facilitate logging as close to when something happened.

Implementation notes
I think we can implement this in 4 small changes:

make room for a priority field in the Process.t record, and have that priority be a variant of High | Normal | Low
introduce a new process_flag for changing the priority level
extend the process queue (see proc_queue.ml) to support multiple sub-queues per priority, and stuff the logic to deal with priority queuing in there and keep the rest of the scheduler oblivious to it.

every time we call Proc_queue.next i'd expect the queue to do:
1. exhaust the High priority queue
2. if the High prio is empty, use the Normal prio queue
3. if the Normal prio is empty, use the Low prio queue

Testing ideas

A good test for this would be to mark the Logger processes with process_flag(Priority(High)) and see the behavior of the tests logs, especially when removing the small waits we've added to compensate for the runtime shutting down too fast.

Add support for _ in bytepattern captures

Currently the bytepattern syntax supports underscores in starting position, but we'd like to support it anywhere within an identifier.

For example, this is supported today:

match%b data with
| {| _hello |} -> ...

but this is not:

match%b data with
| {| hello_world |} -> ...

Implementation notes

We'd start by adding a few tests in the bytepattern_test.ml module, specifically we add a small test to every test group (lexer, parser, construction lower, construction, matching lower, matching).

Then we need to add support for this in the Lexer by extending the identifier definition (search for let ident inside of the Bytepattern.Lexer) and go correcting the tests.

Error in links and monitors example

There is a small issue with examples/5-links-and-monitors/main.ml where the monitor function is not calling Self (). I will open a quick PR to fix this.

Process dictionary

Having something like Erlang's process dictionary would be useful. Maybe with API like Eio.Fiber.with_binding or Lwt.with_value.

Rewrite Hashmap to play nicely with the scheduler

At the moment we are exposing our internal "Dashmap" as a Hashmap that should be prefered in userland Riot programs for concurrent access.

Sadly, this Hashmap is far from ideal as it is lock-based. We should design a version of this interface that plays along nicely with concurrent and parallel access, and that doesn't rely on locks directly, but rather uses Riot-friendly mutexes.

It'll be useful to have some of #45 in place to do this.

Implementation Notes

Something I had in mind was to mimic what Dashmap from Rust does, where the Hashmap has individual read-write locks for every single entry, which allows concurrent and parallel read/write access to separate entries.

This could be implemented on top of the Hashtbl.t (like the current dashmap.ml does), and just try to get a lock on a value at the very moment the value will be used.

Please fill in LICENSE in the `opam` file

https://github.com/leostera/riot/blob/69540a1a84cab2d467634999053c6772d3cef666/riot.opam#L7

Please do fill in LICENSE in the opam file otherwise opam says

[WARNING] Failed checks on riot package definition from source [...]
warning 62: License doesn't adhere to the SPDX standard, see https://spdx.org/licenses/: "LICENSE"

BTW I do notice a https://github.com/leostera/riot/blob/main/LICENSE.md file in the repository. So this issue is just to appease opam.

Relocate timers during process stealing phase

Since #40 we introduced a bug with timers, where a timer isn't relocated alongside its process during the work-stealing phase, so the process ends up hanging when waiting on a timeout for syscalls or receives.

We need to either move the timers with the processes or make the timers global, whatever's easiest right now.

core sublibrary clashes with core alternate standard library

Hey, I'm running into problems using this in a project that also depends on https://github.com/janestreet/core (inconsistent assumptions over interface for Core__Ref). Changing the riot files to reference riot.core doesn't fix this.

Could this be renamed to riot_core or something? That's how I fixed it locally. Possibly this is a bug in dune (I vaguely thought this was supposed to work), but even if it is I guess the fix for that will take a while, so it would be nice to fix it in riot.

Make Riot.run return a `(int, [> `Msg of string ]) result`

To make it easier to write quick programs and use let* bindings for results, it'd be best if Riot.run takes a function that returns a result of an int (status code), or a polyvariant for an error.

This would allow us to run programs like:

let () = Riot.run @@ fun () -> Ok 0

which means we can do programs like the following much more easily:

let () = Riot.run @@ fun () ->
  let* conn = Blink.connect "URL" in
  (* ... *)
  Ok 0

Today we have to write this program by extracting the function that requires a Result, and then forcing the value out with a Result.iter, Result.get_ok, etc.

let main () = 
  let* conn = Blink.connect "URL" in
  (* ... *)
  Ok ()

let () = Riot.run @@ fun () -> main () |> Result.get_ok

Implementation notes

Start by modifying the type signature of run in riot.mli and let the compiler show you where the function lives.

We do not want to change the behavior of the underlying spawn function, but rather wrap the parameter to run (which is called main) so that we unwrap the result and handle it appropriately there.

So in the riot.ml file you'll find a line like:

(* ... *)

let _pid = _spawn ~pool ~scheduler:sch0 main in
Scheduler.run pool sch0 ();

(* ... *)

And this variable main is the function passed as input. We want this one to be wrapped in a function that calls main and handles the result.

To shutdown the Riot runtime you can call the shutdown function defined in the same file.

[Feature] Dynamic supervisor and Registry

Hello.

I think Riot has a lot of potential.
GenServer in OCaml is a good mix.

While Riot is good as-is, dynamic supervisors and registry can be good additions.
But I guess, the typed registry can be difficult as we see Gleam (typed Erlang) sees same problem.

Thank you.

Scheduling on Debian seems not work correctly

I am testing Riot on Debian and I see some issues with the scheduling. When there is 0 job running do the CPU increase to 100% cpu usage but when there is tasks being spammed(many requests at the same time creating tasks) scheduled do it execute the jobs extremly late and the cpu usage drops to 0.1% of usage indicating that there is almost 0 jobs running.
I am running commit d12609a and did rebuild riot using this commit.

I have also sent @leostera a link with a video when i'm testing which shows the behavior.

Make timers cancellable

Since we can now start timers, its a good idea to be able to stop them.

The API we want looks something like this:

let Ok timer = Timer.send_after this msg ~after:0.1 in
Timer.cancel timer;

This can be done by:

extending riot_api.ml/mli to include the new cancel function
adding a remove_timer function to the Timer_wheel module
adding a test where we assert that the timer does send a message, but only one message (using a long-ish ~interval value followed by a receive should get us going here).

Add missing GenServer callbacks

handle_cast handles async messages to the genserver
handle_info handles messages from other processes, self, ...
handle_continue called immediately after previous callback, before handling any more messages

Make match%b expressions tail-recursive

currently when we write a recursive function that uses a match%b expression to pattern-match on bytestrings, we end up creating a stack, instead of tail-recursing.

My gut tells me this is the try-catches we have, and we may want to move to a result-based error-recovery pattern.

For example, i'd expect this function to be tail recursive, but it isn't:

let rec split ?(left = {%b||}) str =
  match%b str with
  | {| "\r\n"::bytes, rest::bytes |} -> [ left; rest ]
  | {| c::utf8, rest::bytes |} -> split ~left:(Bytestring.join c left) rest

riot-ml / riot Goto Github PK

riot's Introduction

Non-goals

Quick Start

Acknowledgments

riot's People

Contributors

Stargazers

Watchers

Forkers

riot's Issues

Problem

Environment Info

Implementation Notes

Implementation Notes

Implementation Notes

Implementation notes

Implementation Notes

Implementation notes

Recommend Projects

Recommend Topics

Recommend Org