wasmcloud / wasmcloud-otp Goto Github PK
View Code? Open in Web Editor NEWwasmCloud host runtime that leverages Elixir/OTP and Rust to provide simple, secure, distributed application development using the actor model
License: Apache License 2.0
wasmCloud host runtime that leverages Elixir/OTP and Rust to provide simple, secure, distributed application development using the actor model
License: Apache License 2.0
Related to #6
We publish CloudEvents for things like provider_started
, actor_started
, provider_stopped
, and more after the above issue. We should have a module that can be instantiated and monitor for these types of events, similar to what was implemented in wasmCloud/wasmCloud#183.
The goal of such a module would be to provide a way to monitor events that occur in a wasmcloud host / lattice, which could either be used to replay events leading up to an issue or assert that specific events had occurred. For example, when we write tests that start a provider, we can ensure that the provider has had enough time to spawn as a process and is ready for invocations.
Allow someone to upload or paste a manifest file into a web UI and the web ui should then perform all of the loads and configurations as defined in the file.
Complete all of the event publications on the lattice control interface event topic (currently only 4 events are published). This also includes handling those events to monitor and maintain the shared "lattice state"
Support the ability to specify custom host labels from the environment as well as the default/intrinsic labels like OS family, CPU, etc.
For managed (started as executable files, which could've been downloaded from OCI) providers, the host should mint user credentials for the lattice (if an account seed is available to the host). This is a follow-up to the issue where providers need to get the host data via environment variable #36
A logging provider needs to be implemented/ported for OTP. Rather than make it talk over NATS, which would add congestion to the messaging network, it can be embedded in the host process, and communicate via a separate TCP or UDP channel to a logging collector or proxy.
There's an erlang/elixir OpenTelementry API that we could use to connect over UDP or TCP to an OTEL receiver. This would add a couple configuration variables to the host, for connecting to the receiver. I don't think it's necessary to make those variables part of the actor link binding - a single OTEL receiver configuration per runtime host should be sufficient for now.
At the same time as integrating OTEL, the logging capability api should be reviewed to see if we want to expose more OTEL features into that api (at a minimum, a span id).
Additional nice-to-have feature: maintain a per-actor logging level threshold (trace...critical). If this could be changed at runtime via washboard, then you could enable debug or trace logging for one actor that you're debugging, and still leave everything else at the system default logging level. This would improve the signal/noise ratio for a better dev experience.
Similar to the live update functionality in the current 0.18.0
runtime, we should support live updates for actors both with a direct invocation and via the lattice control interface.
A live update in this scenario is an update from one actor to a newer revision of the same actor, as indicated by the revision information on the actor's embedded JWT. We should enforce that the new actor has a revision that is greater than the previous actor's revision. We should also include an option for the Host itself to enforce strict claims checking on live updates, which simply means that actors cannot live update if the newer actor has a superset of the claims that are already included on the previous actor
Erlang has system APIs that allow for capturing metrics, these can be integrated into the web dashboard widgets to provide pretty usage statistics like uptime, memory usage, process count, etc.
Here are some of those APIs:
http://erlang.org/documentation/doc-5.7.4/erts-5.7.4/doc/html/erlang.html#erlang:system_info-1
To make a first-party capability provider for end to end testing, we'll need more than just the HTTP server provider. This should cover the creation of a Rust-based key-value capability provider that supports wasmcloud:keyvalue
against a Redis database.
This is required before @brooksmtownsend and I can create an end-to-end test that invokes the kvcounter.wasm
actor, which serves as a kind of basic smoke test for the surrounding runtime.
This should be an epic but github doesn't really support those.
This RFC was originally a simple implementation issue but, after much discussion, we think it needs some more commentary before we implement (or don't) anything.
In the current (v0.18.x
) state of things, when a capability provider sends an invocation to an actor, it sends a raw payload to the wasmCloud host. The host then wraps that in an invocation, signs it with the host's private key, and then delivers the invocation directly to the actor.
Conversely, when an actor makes a host call, the call bubbles up as a raw payload, which is then wrapped in an invocation and then signed by the host's private key. There is a set of claims on the invocation itself that are signed, which include the target URL, the origin URL, and the operation name. This is, in theory, designed to prevent anyone who has compromised the system from sending fake invocations.
The current state of things requires that both actors and capability providers send raw payloads through the host, which signs the invocation prior to publication across a network. Upon receipt from the network, the destination host will then perform a signature verification against the signed claims on the invocation.
The proposed solution is to, quite simply, do nothing. Capability providers are able to send invocations directly to actors via NATS in the 0.20.x
versions and later, taking the host out of the communication channel as an unnecessary "person in the middle."
The assumption in the proposed solution is that if a malicious entity compromises enough of the system to have full and unfettered access to the NATS server (it has a valid TLS connection to NATS using a valid user JWT and valid user seed), then asking the provider or actor to sign their invocations is nothing more than theater.
Further, in the new (v0.20.x
) version, capability providers and actors do not have access to the host's signing key. This quite literally means that it's impossible for a capability provider to generate signed credentials without an alternate key, like possibly the provider's signing key or the actor's signing key. This would require private keys to be floating around in the system in violation of our requirement that no long-term private keys are ever deployed to production.
There are actually only a few options here:
We feel that asking capability providers and actors to sign their own invocations in such a way that can be verified by the host when the entities on either end of the transmission do not have a host key is an even greater security risk than simply not requiring signed claims. Therefore we're thinking the latter option of not using anti-forgery tokens actually speeds up performance and avoids wasting effort and code on a false security vector.
When you launch a capability provider that is an executable, it is simply launched as a child process. In order for those providers to allow for flexible configuration, we should copy (minus some secrets) environment variables from wasmcloud's environment into the provider's environment.
TODOs were created in #73
Heartbeats are a way of understanding the health of actors, providers, and the host runtime. We currently have placeholder labels for this information in the UI, and event handlers that simply ignore the message. The health information should be merged with the actor, provider, and host state so that we can display accurate health information of the resources running in a host.
There's a subtle, but important point with this one. If we implement the extras provider as a standard subscriber to NATS topics, then we'll basically have one extras provider per host, and they'll each respond to all of the requests of the extras provider.
On the other hand, if the extras provider subscribes as a queue subscriber to the specific subject for it, and registers itself during startup, then an extras provider will be randomly chosen from among the hosts in the lattice. Given the way NATS works, we can at least guarantee that the one chosen will be in an optimal path, but will never be local.
In short, the extras provider will incur a network latency penalty just as if it were a "regular" capability provider that was registered and managed externally.
Another alternative is to monitor the namespace and operation for each invocation, and if we see the extras provider, then we can defer those specific invocations to a custom provider.
I prefer the former, where we create a provider that's queue subscribed to the appropriate topic. This will ensure that all provider invocations go through the same code path and we don't end up with all kinds of one-off conditional branches like we had with Actix and the version before it.
In the current experiment, the capability provider is launched with no information. It doesn't know the host ID, doesn't know how to connect to nats, doesn't know anything. I propose that a HostData
JSON structure be passed to the capability provider as an environment variable called WASMCLOUD_HOST_DATA
and it have the following structure:
{
"host_id": "Nxxxxx",
"lattice_rpc_user_jwt": "ed2xxxxx.xxxx",
"lattice_rpc_user_seed": "Sxxxxx",
"lattice_rpc_url" : "nats://0.0.0.0:4222",
"provider_key": "Vxxxxxx",
}
The fields are as follows:
host_id
- The public key of the host, in the "node" prefix form Nxxxx
lattice_rpc_user_jwt
- A User JWT to be used for connecting to NATS. This JWT will be minted by the host if the host has been provided a suitable account seed. Otherwise, this will be empty and anonymous authentication will be assumed.lattice_rpc_user_seed
- A user seed required to authenticate against NATS. User credentials for RPC NATS are minted by the host if the host has been provided an account seed. Otherwise, this will be empty and anonymous authentication is assumed.lattice_rpc_url
- The URL of the NATS host for the provider to connect in order to establish connectivity to the latticeprovider_key
- The provider's public key. The rest of the claims are not needed for this provider to do its job, and so it is only given the public key, which is used when generating Invocation
s to be sent to actors within the lattice.Accept overridable NATS configuration data (e.g. JWT and seed authentication, configurable URL, etc) for both the lattice connection and the RPC connection.
Implement the full suite of API functions available in the lattice control interface.
In an effort to keep things as simple and easy and error-free as possible, we'd like to be able to start the host even if NATS isn't running. The host would essentially sit in an idle, non-functional, "not ready" state until it detects that NATS is available. Once NATS is available, it'll connect, make all its subscriptions, enable functionality, and put itself in the "ready" state.
The StartActor
and StopActor
messages are potentially a bit outdated now that we support multiple actors of the same public key within a single host (OTP). It might be a better control interface to expose a single scale message that contains the desired instance count and if the actor isn't currently running, then we start that actor for the first time if the scale count is > 0.
Currently, the dashboard shows an overview of a single wasmcloud_host
, though you cannot access multiple hosts in the same dashboard. After #73 , multiple hosts will be able to run on different ports on the same machine, but as it is today multiple hosts can be connected to the same lattice.
If it's technically possible, the wasmcloud dashboard should be able to display a "lattice" view that shows all hosts connected to the same lattice, allowing remote control of each of them. I can see this being technically challenging until #5 is implemented, so it may depend on that issue before we can issue commands to remote hosts via the dashboard.
We currently do not enforce this because at the moment there are no providers currently sending anti-forgery tokens on the invocations. Once we have at least the HTTP server and the Redis key-value provider (the ones we use for our automated tests) sending AFTs, then we will need to enable AFT checking.
Produce invocations that use the operation HealthRequest
(or whatever the actual operation name is called) and periodically invoke actors and capability providers with this. Anything that fails a health request will emit a "health check failed" event. After a configurable number of failed events, the failing entity will be terminated.
washboard: Support the ability to upload and start an actor by supplying a file and sending it through the browser.
To reproduce:
iex -S mix
, then start httpserver provider and bind it to an actorExpected:
Comment:
abort
from iex will do a clean shutdown. There needs to be some way to trigger a clean shutdown of the host, which would trigger the provider shutdown sequence, and could take a minute or two depending on timeouts configured.[Edited: corrected info about shutdown api]
Implement a Gnat server that subscribes to the relevant lattice control interface subjects and responds accordingly, which should include functionality like responding to host auctions, various queries, mutations, etc.
Provider heartbeats are essential to being able to support #19 remote capability providers. The host needs to subscribe to, and react to, heartbeats coming from providers.
If a provider misses n
heartbeats, it should be considered offline and the host that detected the gap will remove the provider from all relevant data structures, as if the provider isn't running. This should not remove any link definitions because remember that a link definition needs to be persistent.
The Github action we currently use to build host_core
allows failures with the Credo step and the Dialyzer step, in order to let us cache results and continue with our builds and tests. We should resolve all of these failing issues to create a solid baseline for this repository, once we're out of the rapid prototyping phase (e.g. when a TODO: can trigger a failure in the build pipeline).
As a part of this, we should also uncomment the additional elixir versions in the build action so we test across a variety of environments.
Rather than building a totally generic authorization system, allow people to optionally use Open Policy Agent by providing an OPA policy execution URL and optional username and password for that URL. The policy will be sent the actor JWT claims and some other information (which we will specify/document) and then the "can it make this call" decision will be deferred to the OPA policy.
Upon startup, a host must make requests of the appropriate NATS topics so that it can refresh the cache (ETS tables) of the following types:
Emit host inventory heartbeats per the lattice control interface specification. These heartbeats contain an inventory of everything known to be running inside that host, e.g. actors and providers.
Providers should also emit heartbeats and the host should support tracking them, "deleting" or removing the data for a provider after it has missed n
heartbeat detections.
Actors must support health checks and the host needs to perform the health check on all running actors every n
(configurable) seconds.
To do:
.smithy
model file for wasmcloud:keyvalue capability api, and generated shared libraryAll potential interactions between actors and capability providers, both internally within a single host (over nats) and externally across multiple hosts (also over nats) needs to be tested in a test suite that can flag regressions.
Related to #3, supporting OCI artifacts should also only allow resources with embedded claims
Currently, we allow consumers of the web UI to upload signed wasm files to launch as actors (good) and executables to run as a child process as providers (bad). There are numerous security implications with running an unverified executable, and we should mitigate the majority of those by only allowing providers in a provider-archive or similar format so we can validate claims before starting the executable.
Pick up the namespace prefix, if one is supplied, from the environment variables.
Right now all lattice events are published on wasmbus.ctl.(prefix).events
, this should be changed to wasmbus.evt.(prefix)
Ensure that when an invocation is requested, it supports the use of a call alias if that alias is stored in the cache.
The host runtime needs to monitor the process (Port
) created when a managed capability provider is executed. If that process terminates, the host needs to react to this by "removing" the provider from anything that would indicate the provider is up and running in the runtime.
It should also publish the "provider terminated" event.
In any event, if a provider process terminates before the host asks it to do so, then this is an unscheduled termination and should be flagged as an error condition.
I'm not sure if this is a real race condition, but wanted to log this issue so we can come back to it.
The sequence of operations is:
NATS doesn't guarantee delivery order, for example including if retry is enabled.
The race condition scenario would be:
I haven't dug into the implementation of Halt() so I don't what exactly happens in that scenario, but what should happen is that the message X gets delivered.
Note that this can happen even if the provider does not send any messages after sending the shutdown ack - provider could have flushed all output buffers and exit immediately after sending. The problem arises because the message could still be in some NATS message queue or retry queue somewhere.
I suggest adding a delay of N seconds (configurable) between host receiving shutdown response and calling Halt.
Support the ability to have liveness
and readiness
probes responses. Such probes are commonly used in automated scheduler environments like Kubernetes.
If a liveness probe fails, it means that the process is not fully running and can be killed. Otherwise, the process is up and to be considered alive (can receive additional workloads like actors and providers).
If a readiness probe fails, it indicates that at least one provider or actor within the host is reporting a status of unhealthy as of the last health check internal probe. This readiness probe will return HTTP 500 until all actors and providers in the host report healthy, either through "becoming" healthy or through eviction (the unhealthy entity is de-scheduled).
OCI file download (not necessarily the interpretation of said file) is currently available in here: https://github.com/wasmCloud/wasmCloud/blob/main/crates/wasmcloud-host/src/oci.rs
This issue will be considered closed when there is a tested function in a NIF that can download an OCI file by reference and correctly verify its contents. This elixir-wrapped function will then be used as the basis for starting capability providers and actors by OCI reference instead of by supplying raw bytes.
Support the ability to "start" a capability provider that is already sitting on the appropriate lattice topics. Such a provider would not be able to be terminated, but would still receive all appropriate messages about creation and removal of link definitions. (remember that link def assignment is no longer an invocation, providers must subscribe to the .claims.put
topic.
Support the following:
Note that we can use Rust for most of this, I suspect we'll just put our oci_bytes
function inside the native NIF and grab a dependency on the oci_distribution
crate.
Per issue #34, we discussed that anti-forgery tokens on invocations were actually more security theater than anything else, because anyone with access to the lattice via NATS could simply generate their own host key and in turn sign their invocations. Under the existing 0.18.0
code base, there would be no way of telling the difference between an intruder signing invocations and a legitimate host signing invocations.
The problem we have is one of defense in depth. Assuming that a malicious actor compromises access to NATS with sufficient privilege to post to the RPC or control interface topics, we need a way of telling the difference between legitimate RPC and control interface calls and those originating from malicious entities.
The proposed solution is to give the lattice (cluster) an identity of its own, including a list of valid signing keys. By injecting the lattice identity via environment variables, we can provide a solution to the defense in depth problem. When we receive a message either intended as an RPC invocation for an actor or as a message intended to be processed as part of the lattice control interface, we will verify a bearer token that will likely be in a header on the NATS message. The bearer token will be a signed JWT, where the issuer
field is the public key corresponding to the private key that signed the JWT. The issuer
must be one of the listed public keys injected with the cluster identity JWT via environment variables.
In this defense-in-depth model, a malicious actor would not only have to compromise enough credentials to gain access to NATS, but they would have to gain physical access to a virtual machine on which one of the wasmCloud host runtimes was running and then compromise that process to obtain a signing key for the invocation bearer tokens.
If we start the wasmCloud host with an environment variable containing the identity of the lattice in the form of a JWT, we automatically gain some beneficial features, such as a potential expiration and not-valid-before date on the cluster itself. The following information could be contained within a JWT to describe lattice claims:
signing_keys
- A list of public keys that can be used in the issuer
field of a signed invocation claim. This allows each wasmCloud host runtime to be started with a potentially different seed key used for signingsubject
- The root public key of the cluster nameIf the new host runtime is started without a supplied cluster identity, it will create its own so that it can run in isolation/standalone/"single player" mode without inconveniencing the developer. We'll also want to make sure that both wash
and the web-based dashboard allow for the easy creation of the cluster JWT.
The nats topic used for health checks to capability providers is
topic = "wasmbus.rpc.#{state.public_key}.#{state.link_name}.health"
The topic name should have lattice prefix as an additional term, after "rpc"
Add distillery to the project and run through the process of manually creating a release. Automate this into the CI pipeline so that when a v0.x.y
tag is created, a new distillery release is created for Linux, Mac, and Windows and attached to the release as a .tar.gz
file, e.g. wasmcloud-host-x86_linux.tar.gz
.
Let's also figure out a way to automate the generation of the aarch64
image.
wasmCloud actors are required to be webassembly modules with an embedded jwt that contain the following information:
The non-optional fields for token expiration and NBF date should be validated to ensure that an actor's claims are valid, and the actor should not be scheduled and an appropriate error message to denote that the claims are invalid. (Claims are validated in the Rust crate in this repository, and this should already be implemented. A test could be appropriate to ensure this works)
When an actor attempts to make an invocation to a provider, its capability claims should be inspected for the appropriate capability contract ID, and the invocation should be rejected if the actor is not signed with that capability
Capability providers get a JSON payload made available in the environment during startup. The problem is that there are a number of ways this JSON payload can cause the environment variable to not work/not read properly.
As a solution, the host should base64 encode the JSON before setting the environment variable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.