wasmcloud / wasmcloud-otp Goto Github PK

wasmCloud host runtime that leverages Elixir/OTP and Rust to provide simple, secure, distributed application development using the actor model

License: Apache License 2.0

Elixir 76.57% Rust 10.06% SCSS 0.37% CSS 1.66% JavaScript 0.10% Makefile 0.31% Dockerfile 0.28% Smarty 0.30% HTML 4.41% Shell 0.13% Nix 5.81%

otp elixir actor-model actors dispatcher webassembly webassembly-runtime

wasmcloud-otp's Issues

Monitor cloud events

Related to #6

We publish CloudEvents for things like provider_started, actor_started, provider_stopped, and more after the above issue. We should have a module that can be instantiated and monitor for these types of events, similar to what was implemented in wasmCloud/wasmCloud#183.

The goal of such a module would be to provide a way to monitor events that occur in a wasmcloud host / lattice, which could either be used to replay events leading up to an issue or assert that specific events had occurred. For example, when we write tests that start a provider, we can ensure that the provider has had enough time to spawn as a process and is ready for invocations.

Support manifest files in the web UI

Allow someone to upload or paste a manifest file into a web UI and the web ui should then perform all of the loads and configurations as defined in the file.

Implement event publications

Complete all of the event publications on the lattice control interface event topic (currently only 4 events are published). This also includes handling those events to monitor and maintain the shared "lattice state"

Support host labels

Support the ability to specify custom host labels from the environment as well as the default/intrinsic labels like OS family, CPU, etc.

Host should mint NATS user credentials for managed providers

For managed (started as executable files, which could've been downloaded from OCI) providers, the host should mint user credentials for the lattice (if an account seed is available to the host). This is a follow-up to the issue where providers need to get the host data via environment variable #36

Create embedded logging provider for OpenTelemetry

A logging provider needs to be implemented/ported for OTP. Rather than make it talk over NATS, which would add congestion to the messaging network, it can be embedded in the host process, and communicate via a separate TCP or UDP channel to a logging collector or proxy.

There's an erlang/elixir OpenTelementry API that we could use to connect over UDP or TCP to an OTEL receiver. This would add a couple configuration variables to the host, for connecting to the receiver. I don't think it's necessary to make those variables part of the actor link binding - a single OTEL receiver configuration per runtime host should be sufficient for now.

At the same time as integrating OTEL, the logging capability api should be reviewed to see if we want to expose more OTEL features into that api (at a minimum, a span id).

Additional nice-to-have feature: maintain a per-actor logging level threshold (trace...critical). If this could be changed at runtime via washboard, then you could enable debug or trace logging for one actor that you're debugging, and still leave everything else at the system default logging level. This would improve the signal/noise ratio for a better dev experience.

Implement actor live updates

Similar to the live update functionality in the current 0.18.0 runtime, we should support live updates for actors both with a direct invocation and via the lattice control interface.

A live update in this scenario is an update from one actor to a newer revision of the same actor, as indicated by the revision information on the actor's embedded JWT. We should enforce that the new actor has a revision that is greater than the previous actor's revision. We should also include an option for the Host itself to enforce strict claims checking on live updates, which simply means that actors cannot live update if the newer actor has a superset of the claims that are already included on the previous actor

Capture metrics in web dashboard

Erlang has system APIs that allow for capturing metrics, these can be integrated into the web dashboard widgets to provide pretty usage statistics like uptime, memory usage, process count, etc.

Here are some of those APIs:
http://erlang.org/documentation/doc-5.7.4/erts-5.7.4/doc/html/erlang.html#erlang:system_info-1

http://erlang.org/doc/man/erlang.html#statistics-1

http://erlang.org/doc/man/erlang.html#memory-0

Create Redis key-value capability provider as a standalone executable

To make a first-party capability provider for end to end testing, we'll need more than just the HTTP server provider. This should cover the creation of a Rust-based key-value capability provider that supports wasmcloud:keyvalue against a Redis database.

This is required before @brooksmtownsend and I can create an end-to-end test that invokes the kvcounter.wasm actor, which serves as a kind of basic smoke test for the surrounding runtime.

Convert all capability providers to freestanding binaries

This should be an epic but github doesn't really support those.

[RFC] Validate anti-forgery tokens on inbound invocations

Validating Anti-Forgery Tokens

This RFC was originally a simple implementation issue but, after much discussion, we think it needs some more commentary before we implement (or don't) anything.

Current State

In the current (v0.18.x) state of things, when a capability provider sends an invocation to an actor, it sends a raw payload to the wasmCloud host. The host then wraps that in an invocation, signs it with the host's private key, and then delivers the invocation directly to the actor.

Conversely, when an actor makes a host call, the call bubbles up as a raw payload, which is then wrapped in an invocation and then signed by the host's private key. There is a set of claims on the invocation itself that are signed, which include the target URL, the origin URL, and the operation name. This is, in theory, designed to prevent anyone who has compromised the system from sending fake invocations.

The current state of things requires that both actors and capability providers send raw payloads through the host, which signs the invocation prior to publication across a network. Upon receipt from the network, the destination host will then perform a signature verification against the signed claims on the invocation.

Proposed Solution

The proposed solution is to, quite simply, do nothing. Capability providers are able to send invocations directly to actors via NATS in the 0.20.x versions and later, taking the host out of the communication channel as an unnecessary "person in the middle."

The assumption in the proposed solution is that if a malicious entity compromises enough of the system to have full and unfettered access to the NATS server (it has a valid TLS connection to NATS using a valid user JWT and valid user seed), then asking the provider or actor to sign their invocations is nothing more than theater.

Further, in the new (v0.20.x) version, capability providers and actors do not have access to the host's signing key. This quite literally means that it's impossible for a capability provider to generate signed credentials without an alternate key, like possibly the provider's signing key or the actor's signing key. This would require private keys to be floating around in the system in violation of our requirement that no long-term private keys are ever deployed to production.

Alternatives Considered

There are actually only a few options here:

Burden capability providers with the need to sign invocations they send, using their own signing keys or signing keys devoted solely to the need for invocation anti-forgery tokens, or possibly even something as complex as TUF (the update framework).
Do not require capability providers to sign their invocations, assuming that this kind of AFT verification is more theater than actual security.

We feel that asking capability providers and actors to sign their own invocations in such a way that can be verified by the host when the entities on either end of the transmission do not have a host key is an even greater security risk than simply not requiring signed claims. Therefore we're thinking the latter option of not using anti-forgery tokens actually speeds up performance and avoids wasting effort and code on a false security vector.

Implement dangling TODOs

Copy environment variables to launched capability providers

When you launch a capability provider that is an executable, it is simply launched as a child process. In order for those providers to allow for flexible configuration, we should copy (minus some secrets) environment variables from wasmcloud's environment into the provider's environment.

Handle heartbeats appropriately in the UI

TODOs were created in #73

Heartbeats are a way of understanding the health of actors, providers, and the host runtime. We currently have placeholder labels for this information in the UI, and event handlers that simply ignore the message. The health information should be merged with the actor, provider, and host state so that we can display accurate health information of the resources running in a host.

Implement the built-in "extras" provider

There's a subtle, but important point with this one. If we implement the extras provider as a standard subscriber to NATS topics, then we'll basically have one extras provider per host, and they'll each respond to all of the requests of the extras provider.

On the other hand, if the extras provider subscribes as a queue subscriber to the specific subject for it, and registers itself during startup, then an extras provider will be randomly chosen from among the hosts in the lattice. Given the way NATS works, we can at least guarantee that the one chosen will be in an optimal path, but will never be local.

In short, the extras provider will incur a network latency penalty just as if it were a "regular" capability provider that was registered and managed externally.

Another alternative is to monitor the namespace and operation for each invocation, and if we see the extras provider, then we can defer those specific invocations to a custom provider.

I prefer the former, where we create a provider that's queue subscribed to the appropriate topic. This will ensure that all provider invocations go through the same code path and we don't end up with all kinds of one-off conditional branches like we had with Actix and the version before it.

Host should supply host data and other configuration metadata to spawned capability provider

In the current experiment, the capability provider is launched with no information. It doesn't know the host ID, doesn't know how to connect to nats, doesn't know anything. I propose that a HostData JSON structure be passed to the capability provider as an environment variable called WASMCLOUD_HOST_DATA and it have the following structure:

{
    "host_id": "Nxxxxx",
    "lattice_rpc_user_jwt":  "ed2xxxxx.xxxx",
    "lattice_rpc_user_seed": "Sxxxxx",
    "lattice_rpc_url" : "nats://0.0.0.0:4222",
    "provider_key": "Vxxxxxx",
}

The fields are as follows:

host_id - The public key of the host, in the "node" prefix form Nxxxx
lattice_rpc_user_jwt - A User JWT to be used for connecting to NATS. This JWT will be minted by the host if the host has been provided a suitable account seed. Otherwise, this will be empty and anonymous authentication will be assumed.
lattice_rpc_user_seed - A user seed required to authenticate against NATS. User credentials for RPC NATS are minted by the host if the host has been provided an account seed. Otherwise, this will be empty and anonymous authentication is assumed.
lattice_rpc_url - The URL of the NATS host for the provider to connect in order to establish connectivity to the lattice
provider_key - The provider's public key. The rest of the claims are not needed for this provider to do its job, and so it is only given the public key, which is used when generating Invocations to be sent to actors within the lattice.

Support environment variables for lattice configuration

Accept overridable NATS configuration data (e.g. JWT and seed authentication, configurable URL, etc) for both the lattice connection and the RPC connection.

Implement lattice control interface

Implement the full suite of API functions available in the lattice control interface.

Allow host to start before NATS

In an effort to keep things as simple and easy and error-free as possible, we'd like to be able to start the host even if NATS isn't running. The host would essentially sit in an idle, non-functional, "not ready" state until it detects that NATS is available. Once NATS is available, it'll connect, make all its subscriptions, enable functionality, and put itself in the "ready" state.

Replace Start/Stop actor messages in control interface with a single "Scale Actor" message

The StartActor and StopActor messages are potentially a bit outdated now that we support multiple actors of the same public key within a single host (OTP). It might be a better control interface to expose a single scale message that contains the desired instance count and if the actor isn't currently running, then we start that actor for the first time if the scale count is > 0.

Implement "Lattice" view in Dashboard

Currently, the dashboard shows an overview of a single wasmcloud_host, though you cannot access multiple hosts in the same dashboard. After #73 , multiple hosts will be able to run on different ports on the same machine, but as it is today multiple hosts can be connected to the same lattice.

If it's technically possible, the wasmcloud dashboard should be able to display a "lattice" view that shows all hosts connected to the same lattice, allowing remote control of each of them. I can see this being technically challenging until #5 is implemented, so it may depend on that issue before we can issue commands to remote hosts via the dashboard.

Enforce anti-forgery token rules on provider->actor invocations

We currently do not enforce this because at the moment there are no providers currently sending anti-forgery tokens on the invocations. Once we have at least the HTTP server and the Redis key-value provider (the ones we use for our automated tests) sending AFTs, then we will need to enable AFT checking.

Perform health checks on actors and providers

Produce invocations that use the operation HealthRequest (or whatever the actual operation name is called) and periodically invoke actors and capability providers with this. Anything that fails a health request will emit a "health check failed" event. After a configurable number of failed events, the failing entity will be terminated.

Support uploading actor bytes in washboard (web UI)

washboard: Support the ability to upload and start an actor by supplying a file and sending it through the browser.

When host exits, it doesn't stop capability provider process

To reproduce:

start server with iex -S mix, then start httpserver provider and bind it to an actor
quit the server either with ctrl-C ctrl-C, or ctrl-G q
otp server exits, but httpserver process is still running and bound to its tcp port.

Expected:

capability provider process exits when host exits

Comment:

The shutdown command in the Provider api should work most of the time
but if the shutdown doesn't work after a designated grace period, the host still needs a way to force-kill the process.
I don't expect abort from iex will do a clean shutdown. There needs to be some way to trigger a clean shutdown of the host, which would trigger the provider shutdown sequence, and could take a minute or two depending on timeouts configured.

[Edited: corrected info about shutdown api]

Implement the server side of the lattice control interface (Lattice API)

Implement a Gnat server that subscribes to the relevant lattice control interface subjects and responds accordingly, which should include functionality like responding to host auctions, various queries, mutations, etc.

Support provider heartbeats

Provider heartbeats are essential to being able to support #19 remote capability providers. The host needs to subscribe to, and react to, heartbeats coming from providers.

If a provider misses n heartbeats, it should be considered offline and the host that detected the gap will remove the provider from all relevant data structures, as if the provider isn't running. This should not remove any link definitions because remember that a link definition needs to be persistent.

Address Credo and Dialyzer warnings

The Github action we currently use to build host_core allows failures with the Credo step and the Dialyzer step, in order to let us cache results and continue with our builds and tests. We should resolve all of these failing issues to create a solid baseline for this repository, once we're out of the rapid prototyping phase (e.g. when a TODO: can trigger a failure in the build pipeline).

As a part of this, we should also uncomment the additional elixir versions in the build action so we test across a variety of environments.

Add OPA support

Rather than building a totally generic authorization system, allow people to optionally use Open Policy Agent by providing an OPA policy execution URL and optional username and password for that URL. The policy will be sent the actor JWT claims and some other information (which we will specify/document) and then the "can it make this call" decision will be deferred to the OPA policy.

Host should query distributed cache upon startup

Upon startup, a host must make requests of the appropriate NATS topics so that it can refresh the cache (ETS tables) of the following types:

Claims
Link Definitions
Reference Maps / OCI References

Emit heartbeats

Emit host inventory heartbeats per the lattice control interface specification. These heartbeats contain an inventory of everything known to be running inside that host, e.g. actors and providers.

Providers should also emit heartbeats and the host should support tracking them, "deleting" or removing the data for a provider after it has missed n heartbeat detections.

Actors must support health checks and the host needs to perform the health check on all running actors every n (configurable) seconds.

Implement keyvalue provider and counter demo actor with smithy models

To do:

create .smithy model file for wasmcloud:keyvalue capability api, and generated shared library
modify keyvalue provider (otp/rust) to use the generated library
modify counter actor to use generated shared library
confirm by reproducing the "counter demo"

Create full test suite

All potential interactions between actors and capability providers, both internally within a single host (over nats) and externally across multiple hosts (also over nats) needs to be tested in a test suite that can flag regressions.

Only upload resources with embedded claims

Related to #3, supporting OCI artifacts should also only allow resources with embedded claims

Currently, we allow consumers of the web UI to upload signed wasm files to launch as actors (good) and executables to run as a child process as providers (bad). There are numerous security implications with running an unverified executable, and we should mitigate the majority of those by only allowing providers in a provider-archive or similar format so we can validate claims before starting the executable.

Support configurable namespace prefix

Pick up the namespace prefix, if one is supplied, from the environment variables.

Move events from wasmbus.ctl to wasmbus.evt

Right now all lattice events are published on wasmbus.ctl.(prefix).events, this should be changed to wasmbus.evt.(prefix)

Support for call aliases

Ensure that when an invocation is requested, it supports the use of a call alias if that alias is stored in the cache.

Host must handle process termination from managed providers

The host runtime needs to monitor the process (Port) created when a managed capability provider is executed. If that process terminates, the host needs to react to this by "removing" the provider from anything that would indicate the provider is up and running in the runtime.

It should also publish the "provider terminated" event.

In any event, if a provider process terminates before the host asks it to do so, then this is an unscheduled termination and should be flagged as an error condition.

Possible race condition in provider shutdown logic

I'm not sure if this is a real race condition, but wanted to log this issue so we can come back to it.

wasmcloud-otp/host_core/lib/host_core/providers/provider_supervisor.ex

Line 61 in 7ddd377

def terminate_provider(public_key, link_name) do

The sequence of operations is:

host sends provider a shutdown message
provider replies to host with ack
host calls ProviderModule.Halt

NATS doesn't guarantee delivery order, for example including if retry is enabled.
The race condition scenario would be:

provider sends message X (to actor or host)
provider receives shutdown message and replies to it
host receives shutdown ack and calls Halt
host receives message X

I haven't dug into the implementation of Halt() so I don't what exactly happens in that scenario, but what should happen is that the message X gets delivered.

Note that this can happen even if the provider does not send any messages after sending the shutdown ack - provider could have flushed all output buffers and exit immediately after sending. The problem arises because the message could still be in some NATS message queue or retry queue somewhere.

I suggest adding a delay of N seconds (configurable) between host receiving shutdown response and calling Halt.

Support readiness and liveness probes

Support the ability to have liveness and readiness probes responses. Such probes are commonly used in automated scheduler environments like Kubernetes.

If a liveness probe fails, it means that the process is not fully running and can be killed. Otherwise, the process is up and to be considered alive (can receive additional workloads like actors and providers).

If a readiness probe fails, it indicates that at least one provider or actor within the host is reporting a status of unhealthy as of the last health check internal probe. This readiness probe will return HTTP 500 until all actors and providers in the host report healthy, either through "becoming" healthy or through eviction (the unhealthy entity is de-scheduled).

Add OCI download capability to the native library

OCI file download (not necessarily the interpretation of said file) is currently available in here: https://github.com/wasmCloud/wasmCloud/blob/main/crates/wasmcloud-host/src/oci.rs

This issue will be considered closed when there is a tested function in a NIF that can download an OCI file by reference and correctly verify its contents. This elixir-wrapped function will then be used as the basis for starting capability providers and actors by OCI reference instead of by supplying raw bytes.

Support remote (detached) capability providers

Support the ability to "start" a capability provider that is already sitting on the appropriate lattice topics. Such a provider would not be able to be terminated, but would still receive all appropriate messages about creation and removal of link definitions. (remember that link def assignment is no longer an invocation, providers must subscribe to the .claims.put topic.

Create automated release distribution including installation instructions

Compatibility support for OCI downloads

Support the following:

Download and start an actor via OCI reference
Download and start a capability provider via OCI reference
Full error handling support
Ensure actor call alias remains intact

Note that we can use Rust for most of this, I suspect we'll just put our oci_bytes function inside the native NIF and grab a dependency on the oci_distribution crate.

[RFC] Implement Cluster Identity and Signing Keys

Overview

Per issue #34, we discussed that anti-forgery tokens on invocations were actually more security theater than anything else, because anyone with access to the lattice via NATS could simply generate their own host key and in turn sign their invocations. Under the existing 0.18.0 code base, there would be no way of telling the difference between an intruder signing invocations and a legitimate host signing invocations.

The problem we have is one of defense in depth. Assuming that a malicious actor compromises access to NATS with sufficient privilege to post to the RPC or control interface topics, we need a way of telling the difference between legitimate RPC and control interface calls and those originating from malicious entities.

Proposed Solution

The proposed solution is to give the lattice (cluster) an identity of its own, including a list of valid signing keys. By injecting the lattice identity via environment variables, we can provide a solution to the defense in depth problem. When we receive a message either intended as an RPC invocation for an actor or as a message intended to be processed as part of the lattice control interface, we will verify a bearer token that will likely be in a header on the NATS message. The bearer token will be a signed JWT, where the issuer field is the public key corresponding to the private key that signed the JWT. The issuer must be one of the listed public keys injected with the cluster identity JWT via environment variables.

In this defense-in-depth model, a malicious actor would not only have to compromise enough credentials to gain access to NATS, but they would have to gain physical access to a virtual machine on which one of the wasmCloud host runtimes was running and then compromise that process to obtain a signing key for the invocation bearer tokens.

Lattice Identity

If we start the wasmCloud host with an environment variable containing the identity of the lattice in the form of a JWT, we automatically gain some beneficial features, such as a potential expiration and not-valid-before date on the cluster itself. The following information could be contained within a JWT to describe lattice claims:

signing_keys - A list of public keys that can be used in the issuer field of a signed invocation claim. This allows each wasmCloud host runtime to be started with a potentially different seed key used for signing
subject - The root public key of the cluster name

Reasonable Defaults

If the new host runtime is started without a supplied cluster identity, it will create its own so that it can run in isolation/standalone/"single player" mode without inconveniencing the developer. We'll also want to make sure that both wash and the web-based dashboard allow for the easy creation of the cluster JWT.

Health check subscription name is missing lattice prefix

The nats topic used for health checks to capability providers is

topic = "wasmbus.rpc.#{state.public_key}.#{state.link_name}.health"

The topic name should have lattice prefix as an additional term, after "rpc"

Create a distillery release

Add distillery to the project and run through the process of manually creating a release. Automate this into the CI pipeline so that when a v0.x.y tag is created, a new distillery release is created for Linux, Mac, and Windows and attached to the release as a .tar.gz file, e.g. wasmcloud-host-x86_linux.tar.gz.

Let's also figure out a way to automate the generation of the aarch64 image.

Support full lattice API manipulation in web UI

Validate and enforce Actor claims

wasmCloud actors are required to be webassembly modules with an embedded jwt that contain the following information:

Token expiration
Token valid "not before" date
Account Key
Module Key
Name
Capability claims
(optional) Revision/Version information
(optional) Call alias
(optional) tags

The non-optional fields for token expiration and NBF date should be validated to ensure that an actor's claims are valid, and the actor should not be scheduled and an appropriate error message to denote that the claims are invalid. (Claims are validated in the Rust crate in this repository, and this should already be implemented. A test could be appropriate to ensure this works)

When an actor attempts to make an invocation to a provider, its capability claims should be inspected for the appropriate capability contract ID, and the invocation should be rejected if the actor is not signed with that capability

Use base64 encoding on JSON payload sent to providers at startup

Capability providers get a JSON payload made available in the environment during startup. The problem is that there are a number of ways this JSON payload can cause the environment variable to not work/not read properly.

As a solution, the host should base64 encode the JSON before setting the environment variable.

wasmcloud / wasmcloud-otp Goto Github PK

wasmcloud-otp's Issues

Validating Anti-Forgery Tokens

Current State

Proposed Solution

Alternatives Considered

Overview

Proposed Solution

Lattice Identity

Reasonable Defaults

Recommend Projects

Recommend Topics

Recommend Org